·
Economia ·
Econometria
Envie sua pergunta para a IA e receba a resposta na hora

Prefere sua atividade resolvida por um tutor especialista?
- Receba resolvida até o seu prazo
- Converse com o tutor pelo chat
- Garantia de 7 dias contra erros
Recomendado para você
15
Anotações sobre Dados Empilhados e Dados em Painel em Econometria II
Econometria
UFABC
13
Notas de Aula: Econometria II - Equações Simultâneas
Econometria
UFABC
13
Notas de Aula: Econometria II - Equações Simultâneas
Econometria
UFABC
1
Universidade Federal do ABC
Econometria
UFABC
11
Notas de Aula - Econometria II: Dados Empilhados e Dados em Painel
Econometria
UFABC
15
Referências de Marketing e Gestão de Marca
Econometria
UCAM
2
3ª Prova Online de Econometria II - FEA PUC
Econometria
PUC
8
Resumo sobre Processos Estocásticos e Determinísticos
Econometria
IBMEC
33
Gerenciamento de Risco e Valor no Brasil: Um Estudo Empírico
Econometria
FGV
6
Aula sobre Intervalos de Confiança e Testes de Hipóteses em Econometria
Econometria
UNIGRAN
Texto de pré-visualização
A Modern ApproAch S I X T H E d I T I o n Jeffrey M Wooldridge Michigan State University Australia Brazil Mexico Singapore United Kingdom United States Introductory econometrics Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it This is an electronic version of the print textbook Due to electronic rights restrictions some third party content may be suppressed Editorial review has deemed that any suppressed content does not materially affect the overall learning experience The publisher reserves the right to remove content from this title at any time if subsequent rights restrictions require it For valuable information on pricing previous editions changes to current editions and alternate formats please visit wwwcengagecomhighered to search by ISBN author title or keyword for materials in your areas of interest Important Notice Media content referenced within the product description or the product text may not be available in the eBook version Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Printed in the United States of America Print Number 01 Print Year 2015 2016 2013 Cengage Learning ALL RIGHTS RESERVED No part of this work covered by the copyright herein may be reproduced transmitted stored or used in any form or by any means graphic electronic or mechanical including but not limited to photocopying recording scanning digitizing taping Web distribution information networks or information storage and retrieval systems except as permitted under Section 107 or 108 of the 1976 United States Copyright Act without the prior written permission of the publisher Introductory Econometrics 6e Jeffrey M Wooldridge Vice President General Manager Social Science Qualitative Business Erin Joyner Product Director Mike Worls Associate Product Manager Tara Singer Content Developer Chris Rader Marketing Director Kristen Hurd Marketing Manager Katie Jergens Marketing Coordinator Chris Walz Art and Cover Direction Production Management and Composition Lumina Datamatics Inc Intellectual Property Analyst Jennifer Nonenmacher Project Manager Sarah Shainwald Manufacturing Planner Kevin Kluck Cover Image kentohShutterstock Unless otherwise noted all items Cengage Learning For product information and technology assistance contact us at Cengage Learning Customer Sales Support 18003549706 For permission to use material from this text or product submit all requests online at wwwcengagecompermissions Further permissions questions can be emailed to permissionrequest cengagecom Library of Congress Control Number 2015944828 Student Edition ISBN 9781305270107 Cengage Learning 20 Channel Center Street Boston MA 02210 USA Cengage Learning is a leading provider of customized learning solutions with employees residing in nearly 40 different countries and sales in more than 125 countries around the world Find your local representative at wwwcengagecom Cengage Learning products are represented in Canada by Nelson Education Ltd To learn more about Cengage Learning Solutions visit wwwcengagecom Purchase any of our products at your local college store or at our preferred online store wwwcengagebraincom Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it WCN 02200203 iii Brief contents Chapter 1 The Nature of Econometrics and Economic Data 1 Part 1 Regression Analysis with CrossSectional Data 19 Chapter 2 The Simple Regression Model 20 Chapter 3 Multiple Regression Analysis Estimation 60 Chapter 4 Multiple Regression Analysis Inference 105 Chapter 5 Multiple Regression Analysis OLS Asymptotics 149 Chapter 6 Multiple Regression Analysis Further Issues 166 Chapter 7 Multiple Regression Analysis with Qualitative Information Binary or Dummy Variables 205 Chapter 8 Heteroskedasticity 243 Chapter 9 More on Specification and Data Issues 274 Part 2 Regression Analysis with Time Series Data 311 Chapter 10 Basic Regression Analysis with Time Series Data 312 Chapter 11 Further Issues in Using OLS with Time Series Data 344 Chapter 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 372 Part 3 Advanced Topics 401 Chapter 13 Pooling Cross Sections Across Time Simple Panel Data Methods 402 Chapter 14 Advanced Panel Data Methods 434 Chapter 15 Instrumental Variables Estimation and Two Stage Least Squares 461 Chapter 16 Simultaneous Equations Models 499 Chapter 17 Limited Dependent Variable Models and Sample Selection Corrections 524 Chapter 18 Advanced Time Series Topics 568 Chapter 19 Carrying Out an Empirical Project 605 aPPendices Appendix A Basic Mathematical Tools 628 Appendix B Fundamentals of Probability 645 Appendix C Fundamentals of Mathematical Statistics 674 Appendix D Summary of Matrix Algebra 709 Appendix E The Linear Regression Model in Matrix Form 720 Appendix F Answers to Chapter Questions 734 Appendix G Statistical Tables 743 References 750 Glossary 756 Index 771 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it iv Contents Preface xii About the Author xxi chapter 1 The Nature of Econometrics and Economic Data 1 11 What Is Econometrics 1 12 Steps in Empirical Economic Analysis 2 13 The Structure of Economic Data 5 13a CrossSectional Data 5 13b Time Series Data 7 13c Pooled Cross Sections 8 13d Panel or Longitudinal Data 9 13e A Comment on Data Structures 10 14 Causality and the Notion of Ceteris Paribus in Econometric Analysis 10 Summary 14 Key Terms 14 Problems 15 Computer Exercises 15 P a r t 1 Regression Analysis with CrossSectional Data 19 chapter 2 The Simple Regression Model 20 21 Definition of the Simple Regression Model 20 22 Deriving the Ordinary Least Squares Estimates 24 22a A Note on Terminology 31 23 Properties of OLS on Any Sample of Data 32 23a Fitted Values and Residuals 32 23b Algebraic Properties of OLS Statistics 32 23c GoodnessofFit 35 24 Units of Measurement and Functional Form 36 24a The Effects of Changing Units of Measurement on OLS Statistics 36 24b Incorporating Nonlinearities in Simple Regression 37 24c The Meaning of Linear Regression 40 25 Expected Values and Variances of the OLS Estimators 40 25a Unbiasedness of OLS 40 25b Variances of the OLS Estimators 45 25c Estimating the Error Variance 48 26 Regression through the Origin and Regression on a Constant 50 Summary 51 Key Terms 52 Problems 53 Computer Exercises 56 Appendix 2A 59 chapter 3 Multiple Regression Analysis Estimation 60 31 Motivation for Multiple Regression 61 31a The Model with Two Independent Variables 61 31b The Model with k Independent Variables 63 32 Mechanics and Interpretation of Ordinary Least Squares 64 32a Obtaining the OLS Estimates 64 32b Interpreting the OLS Regression Equation 65 32c On the Meaning of Holding Other Factors Fixed in Multiple Regression 67 32d Changing More Than One Independent Variable Simultaneously 68 32e OLS Fitted Values and Residuals 68 32f A Partialling Out Interpretation of Multiple Regression 69 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents v 32g Comparison of Simple and Multiple Regression Estimates 69 32h GoodnessofFit 70 32i Regression through the Origin 73 33 The Expected Value of the oLS Estimators 73 33a Including Irrelevant Variables in a Regression Model 77 33b Omitted Variable Bias The Simple Case 78 33c Omitted Variable Bias More General Cases 81 34 The Variance of the oLS Estimators 81 34a The Components of the OLS Variances Multicollinearity 83 34b Variances in Misspecified Models 86 34c Estimating s2 Standard Errors of the OLS Estimators 87 35 Efficiency of oLS The GaussMarkov Theorem 89 36 Some Comments on the Language of Multiple Regression Analysis 90 Summary 91 Key Terms 93 Problems 93 Computer Exercises 97 Appendix 3A 101 chapter 4 Multiple Regression Analysis Inference 105 41 Sampling distributions of the oLS Estimators 105 42 Testing Hypotheses about a Single Population Parameter The t Test 108 42a Testing against OneSided Alternatives 110 42b TwoSided Alternatives 114 42c Testing Other Hypotheses about bj 116 42d Computing pValues for t Tests 118 42e A Reminder on the Language of Classical Hypothesis Testing 120 42f Economic or Practical versus Statistical Significance 120 43 Confidence Intervals 122 44 Testing Hypotheses about a Single Linear Combination of the Parameters 124 45 Testing Multiple Linear Restrictions The F Test 127 45a Testing Exclusion Restrictions 127 45b Relationship between F and t Statistics 132 45c The RSquared Form of the F Statistic 133 45d Computing pValues for F Tests 134 45e The F Statistic for Overall Significance of a Regression 135 45f Testing General Linear Restrictions 136 46 Reporting Regression Results 137 Summary 139 Key Terms 140 Problems 141 Computer Exercises 146 chapter 5 Multiple Regression Analysis OLS Asymptotics 149 51 Consistency 150 51a Deriving the Inconsistency in OLS 153 52 Asymptotic normality and Large Sample Inference 154 52a Other Large Sample Tests The Lagrange Multiplier Statistic 158 53 Asymptotic Efficiency of oLS 161 Summary 162 Key Terms 162 Problems 162 Computer Exercises 163 Appendix 5A 165 chapter 6 Multiple Regression Analysis Further Issues 166 61 Effects of data Scaling on oLS Statistics 166 61a Beta Coefficients 169 62 More on Functional Form 171 62a More on Using Logarithmic Functional Forms 171 62b Models with Quadratics 173 62c Models with Interaction Terms 177 62d Computing Average Partial Effects 179 63 More on GoodnessofFit and Selection of Regressors 180 63a Adjusted RSquared 181 63b Using Adjusted RSquared to Choose between Nonnested Models 182 63c Controlling for Too Many Factors in Regression Analysis 184 63d Adding Regressors to Reduce the Error Variance 185 64 Prediction and Residual Analysis 186 64a Confidence Intervals for Predictions 186 64b Residual Analysis 190 64c Predicting y When logy Is the Dependent Variable 190 64d Predicting y When the Dependent Variable Is logy 192 bj Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents vi Summary 194 Key Terms 196 Problems 196 Computer Exercises 199 Appendix 6A 203 chapter 7 Multiple Regression Analysis with Qualitative Information Binary or Dummy Variables 205 71 describing Qualitative Information 205 72 A Single dummy Independent Variable 206 72a Interpreting Coefficients on Dummy Explanatory Variables When the Dependent Variable Is logy 211 73 Using dummy Variables for Multiple Categories 212 73a Incorporating Ordinal Information by Using Dummy Variables 214 74 Interactions Involving dummy Variables 217 74a Interactions among Dummy Variables 217 74b Allowing for Different Slopes 218 74c Testing for Differences in Regression Functions across Groups 221 75 A Binary dependent Variable The Linear Probability Model 224 76 More on Policy Analysis and Program Evaluation 229 77 Interpreting Regression Results with discrete dependent Variables 231 Summary 232 Key Terms 233 Problems 233 Computer Exercises 237 chapter 8 Heteroskedasticity 243 81 Consequences of Heteroskedasticity for oLS 243 82 HeteroskedasticityRobust Inference after oLS Estimation 244 82a Computing HeteroskedasticityRobust LM Tests 248 83 Testing for Heteroskedasticity 250 83a The White Test for Heteroskedasticity 252 84 Weighted Least Squares Estimation 254 84a The Heteroskedasticity Is Known up to a Multiplicative Constant 254 84b The Heteroskedasticity Function Must Be Estimated Feasible GLS 259 84c What If the Assumed Heteroskedasticity Function Is Wrong 262 84d Prediction and Prediction Intervals with Heteroskedasticity 264 85 The Linear Probability Model Revisited 265 Summary 267 Key Terms 268 Problems 268 Computer Exercises 270 chapter 9 More on Specification and Data Issues 274 91 Functional Form Misspecification 275 91a RESET as a General Test for Functional Form Misspecification 277 91b Tests against Nonnested Alternatives 278 92 Using Proxy Variables for Unobserved Explanatory Variables 279 92a Using Lagged Dependent Variables as Proxy Variables 283 92b A Different Slant on Multiple Regression 284 93 Models with Random Slopes 285 94 Properties of oLS under Measurement Error 287 94a Measurement Error in the Dependent Variable 287 94b Measurement Error in an Explanatory Variable 289 95 Missing data nonrandom Samples and outlying observations 293 95a Missing Data 293 95b Nonrandom Samples 294 95c Outliers and Influential Observations 296 96 Least Absolute deviations Estimation 300 Summary 302 Key Terms 303 Problems 303 Computer Exercises 307 P a r t 2 Regression Analysis with Time Series Data 311 chapter 10 Basic Regression Analysis with Time Series Data 312 101 The nature of Time Series data 312 102 Examples of Time Series Regression Models 313 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents vii 102a Static Models 314 102b Finite Distributed Lag Models 314 102c A Convention about the Time Index 316 103 Finite Sample Properties of oLS under Classical Assumptions 317 103a Unbiasedness of OLS 317 103b The Variances of the OLS Estimators and the GaussMarkov Theorem 320 103c Inference under the Classical Linear Model Assumptions 322 104 Functional Form dummy Variables and Index numbers 323 105 Trends and Seasonality 329 105a Characterizing Trending Time Series 329 105b Using Trending Variables in Regression Analysis 332 105c A Detrending Interpretation of Regressions with a Time Trend 334 105d Computing RSquared When the Dependent Variable Is Trending 335 105e Seasonality 336 Summary 338 Key Terms 339 Problems 339 Computer Exercises 341 chapter 11 Further Issues in Using OLS with Time Series Data 344 111 Stationary and Weakly dependent Time Series 345 111a Stationary and Nonstationary Time Series 345 111b Weakly Dependent Time Series 346 112 Asymptotic Properties of oLS 348 113 Using Highly Persistent Time Series in Regression Analysis 354 113a Highly Persistent Time Series 354 113b Transformations on Highly Persistent Time Series 358 113c Deciding Whether a Time Series Is I1 359 114 dynamically Complete Models and the Absence of Serial Correlation 360 115 The Homoskedasticity Assumption for Time Series Models 363 Summary 364 Key Terms 365 Problems 365 Computer Exercises 368 chapter 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 372 121 Properties of oLS with Serially Correlated Errors 373 121a Unbiasedness and Consistency 373 121b Efficiency and Inference 373 121c Goodness of Fit 374 121d Serial Correlation in the Presence of Lagged Dependent Variables 374 122 Testing for Serial Correlation 376 122a A t Test for AR1 Serial Correlation with Strictly Exogenous Regressors 376 122b The DurbinWatson Test under Classical Assumptions 378 122c Testing for AR1 Serial Correlation without Strictly Exogenous Regressors 379 122d Testing for Higher Order Serial Correlation 380 123 Correcting for Serial Correlation with Strictly Exogenous Regressors 381 123a Obtaining the Best Linear Unbiased Estimator in the AR1 Model 382 123b Feasible GLS Estimation with AR1 Errors 383 123c Comparing OLS and FGLS 385 123d Correcting for Higher Order Serial Correlation 386 124 differencing and Serial Correlation 387 125 Serial CorrelationRobust Inference after oLS 388 126 Heteroskedasticity in Time Series Regressions 391 126a HeteroskedasticityRobust Statistics 392 126b Testing for Heteroskedasticity 392 126c Autoregressive Conditional Heteroskedasticity 393 126d Heteroskedasticity and Serial Correlation in Regression Models 395 Summary 396 Key Terms 396 Problems 396 Computer Exercises 397 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents viii P a r t 3 Advanced Topics 401 chapter 13 Pooling Cross Sections across Time Simple Panel Data Methods 402 131 Pooling Independent Cross Sections across Time 403 131a The Chow Test for Structural Change across Time 407 132 Policy Analysis with Pooled Cross Sections 407 133 TwoPeriod Panel data Analysis 412 133a Organizing Panel Data 417 134 Policy Analysis with TwoPeriod Panel data 417 135 differencing with More Than Two Time Periods 420 135a Potential Pitfalls in First Differencing Panel Data 424 Summary 424 Key Terms 425 Problems 425 Computer Exercises 426 Appendix 13A 432 chapter 14 Advanced Panel Data Methods 434 141 Fixed Effects Estimation 435 141a The Dummy Variable Regression 438 141b Fixed Effects or First Differencing 439 141c Fixed Effects with Unbalanced Panels 440 142 Random Effects Models 441 142a Random Effects or Fixed Effects 444 143 The Correlated Random Effects Approach 445 143a Unbalanced Panels 447 144 Applying Panel data Methods to other data Structures 448 Summary 450 Key Terms 451 Problems 451 Computer Exercises 453 Appendix 14A 457 chapter 15 Instrumental Variables Estimation and Two Stage Least Squares 461 151 Motivation omitted Variables in a Simple Regression Model 462 151a Statistical Inference with the IV Estimator 466 151b Properties of IV with a Poor Instrumental Variable 469 151c Computing RSquared after IV Estimation 471 152 IV Estimation of the Multiple Regression Model 471 153 Two Stage Least Squares 475 153a A Single Endogenous Explanatory Variable 475 153b Multicollinearity and 2SLS 477 153c Detecting Weak Instruments 478 153d Multiple Endogenous Explanatory Variables 478 153e Testing Multiple Hypotheses after 2SLS Estimation 479 154 IV Solutions to ErrorsinVariables Problems 479 155 Testing for Endogeneity and Testing overidentifying Restrictions 481 155a Testing for Endogeneity 481 155b Testing Overidentification Restrictions 482 156 2SLS with Heteroskedasticity 484 157 Applying 2SLS to Time Series Equations 485 158 Applying 2SLS to Pooled Cross Sections and Panel data 487 Summary 488 Key Terms 489 Problems 489 Computer Exercises 492 Appendix 15A 496 chapter 16 Simultaneous Equations Models 499 161 The nature of Simultaneous Equations Models 500 162 Simultaneity Bias in oLS 503 163 Identifying and Estimating a Structural Equation 504 163a Identification in a TwoEquation System 505 163b Estimation by 2SLS 508 164 Systems with More Than Two Equations 510 164a Identification in Systems with Three or More Equations 510 164b Estimation 511 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents ix 165 Simultaneous Equations Models with Time Series 511 166 Simultaneous Equations Models with Panel data 514 Summary 516 Key Terms 517 Problems 517 Computer Exercises 519 chapter 17 Limited Dependent Variable Models and Sample Selection Corrections 524 171 Logit and Probit Models for Binary Response 525 171a Specifying Logit and Probit Models 525 171b Maximum Likelihood Estimation of Logit and Probit Models 528 171c Testing Multiple Hypotheses 529 171d Interpreting the Logit and Probit Estimates 530 172 The Tobit Model for Corner Solution Responses 536 172a Interpreting the Tobit Estimates 537 172b Specification Issues in Tobit Models 543 173 The Poisson Regression Model 543 174 Censored and Truncated Regression Models 547 174a Censored Regression Models 548 174b Truncated Regression Models 551 175 Sample Selection Corrections 553 175a When Is OLS on the Selected Sample Consistent 553 175b Incidental Truncation 554 Summary 558 Key Terms 558 Problems 559 Computer Exercises 560 Appendix 17A 565 Appendix 17B 566 chapter 18 Advanced Time Series Topics 568 181 Infinite distributed Lag Models 569 181a The Geometric or Koyck Distributed Lag 571 181b Rational Distributed Lag Models 572 182 Testing for Unit Roots 574 183 Spurious Regression 578 184 Cointegration and Error Correction Models 580 184a Cointegration 580 184b Error Correction Models 584 185 Forecasting 586 185a Types of Regression Models Used for Forecasting 587 185b OneStepAhead Forecasting 588 185c Comparing OneStepAhead Forecasts 591 185d MultipleStepAhead Forecasts 592 185e Forecasting Trending Seasonal and Integrated Processes 594 Summary 598 Key Terms 599 Problems 600 Computer Exercises 601 chapter 19 Carrying Out an Empirical Project 605 191 Posing a Question 605 192 Literature Review 607 193 data Collection 608 193a Deciding on the Appropriate Data Set 608 193b Entering and Storing Your Data 609 193c Inspecting Cleaning and Summarizing Your Data 610 194 Econometric Analysis 611 195 Writing an Empirical Paper 614 195a Introduction 614 195b Conceptual or Theoretical Framework 615 195c Econometric Models and Estimation Methods 615 195d The Data 617 195e Results 618 195f Conclusions 618 195g Style Hints 619 Summary 621 Key Terms 621 Sample Empirical Projects 621 List of Journals 626 Data Sources 627 appendix a Basic Mathematical Tools 628 A1 The Summation operator and descriptive Statistics 628 A2 Properties of Linear Functions 630 A3 Proportions and Percentages 633 A4 Some Special Functions and their Properties 634 A4a Quadratic Functions 634 A4b The Natural Logarithm 636 A4c The Exponential Function 639 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents x A5 differential Calculus 640 Summary 642 Key Terms 642 Problems 643 appendix B Fundamentals of Probability 645 B1 Random Variables and Their Probability distributions 645 B1a Discrete Random Variables 646 B1b Continuous Random Variables 648 B2 Joint distributions Conditional distributions and Independence 649 B2a Joint Distributions and Independence 649 B2b Conditional Distributions 651 B3 Features of Probability distributions 652 B3a A Measure of Central Tendency The Expected Value 652 B3b Properties of Expected Values 653 B3c Another Measure of Central Tendency The Median 655 B3d Measures of Variability Variance and Standard Deviation 656 B3e Variance 656 B3f Standard Deviation 657 B3g Standardizing a Random Variable 657 B3h Skewness and Kurtosis 658 B4 Features of Joint and Conditional distributions 658 B4a Measures of Association Covariance and Correlation 658 B4b Covariance 658 B4c Correlation Coefficient 659 B4d Variance of Sums of Random Variables 660 B4e Conditional Expectation 661 B4f Properties of Conditional Expectation 663 B4g Conditional Variance 665 B5 The normal and Related distributions 665 B5a The Normal Distribution 665 B5b The Standard Normal Distribution 666 B5c Additional Properties of the Normal Distribution 668 B5d The ChiSquare Distribution 669 B5e The t Distribution 669 B5f The F Distribution 670 Summary 672 Key Terms 672 Problems 672 appendix c Fundamentals of Mathematical Statistics 674 C1 Populations Parameters and Random Sampling 674 C1a Sampling 674 C2 Finite Sample Properties of Estimators 675 C2a Estimators and Estimates 675 C2b Unbiasedness 676 C2d The Sampling Variance of Estimators 678 C2e Efficiency 679 C3 Asymptotic or Large Sample Properties of Estimators 681 C3a Consistency 681 C3b Asymptotic Normality 683 C4 General Approaches to Parameter Estimation 684 C4a Method of Moments 685 C4b Maximum Likelihood 685 C4c Least Squares 686 C5 Interval Estimation and Confidence Intervals 687 C5a The Nature of Interval Estimation 687 C5b Confidence Intervals for the Mean from a Normally Distributed Population 689 C5c A Simple Rule of Thumb for a 95 Confidence Interval 691 C5d Asymptotic Confidence Intervals for Nonnormal Populations 692 C6 Hypothesis Testing 693 C6a Fundamentals of Hypothesis Testing 693 C6b Testing Hypotheses about the Mean in a Normal Population 695 C6c Asymptotic Tests for Nonnormal Populations 698 C6d Computing and Using pValues 698 C6e The Relationship between Confidence Intervals and Hypothesis Testing 701 C6f Practical versus Statistical Significance 702 C7 Remarks on notation 703 Summary 703 Key Terms 704 Problems 704 appendix d Summary of Matrix Algebra 709 D1 Basic definitions 709 D2 Matrix operations 710 D2a Matrix Addition 710 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents xi D2b Scalar Multiplication 710 D2c Matrix Multiplication 711 D2d Transpose 712 D2e Partitioned Matrix Multiplication 712 D2f Trace 713 D2g Inverse 713 D3 Linear Independence and Rank of a Matrix 714 D4 Quadratic Forms and Positive definite Matrices 714 D5 Idempotent Matrices 715 D6 differentiation of Linear and Quadratic Forms 715 D7 Moments and distributions of Random Vectors 716 D7a Expected Value 716 D7b VarianceCovariance Matrix 716 D7c Multivariate Normal Distribution 716 D7d ChiSquare Distribution 717 D7e t Distribution 717 D7f F Distribution 717 Summary 717 Key Terms 717 Problems 718 appendix e The Linear Regression Model in Matrix Form 720 E1 The Model and ordinary Least Squares Estimation 720 E1a The FrischWaugh Theorem 722 E2 Finite Sample Properties of oLS 723 E3 Statistical Inference 726 E4 Some Asymptotic Analysis 728 E4a Wald Statistics for Testing Multiple Hypotheses 730 Summary 731 Key Terms 731 Problems 731 appendix F Answers to Chapter Questions 734 appendix G Statistical Tables 743 References 750 Glossary 756 Index 771 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xii preface My motivation for writing the first edition of Introductory Econometrics A Modern Approach was that I saw a fairly wide gap between how econometrics is taught to undergraduates and how empirical researchers think about and apply econometric methods I became convinced that teaching introduc tory econometrics from the perspective of professional users of econometrics would actually simplify the presentation in addition to making the subject much more interesting Based on the positive reactions to earlier editions it appears that my hunch was correct Many instructors having a variety of backgrounds and interests and teaching students with different lev els of preparation have embraced the modern approach to econometrics espoused in this text The emphasis in this edition is still on applying econometrics to realworld problems Each econometric method is motivated by a particular issue facing researchers analyzing nonexperimental data The focus in the main text is on understanding and interpreting the assumptions in light of actual empiri cal applications the mathematics required is no more than college algebra and basic probability and statistics Organized for Todays Econometrics Instructor The sixth edition preserves the overall organization of the fifth The most noticeable feature that distinguishes this text from most others is the separation of topics by the kind of data being ana lyzed This is a clear departure from the traditional approach which presents a linear model lists all assumptions that may be needed at some future point in the analysis and then proves or asserts results without clearly connecting them to the assumptions My approach is first to treat in Part 1 multiple regression analysis with crosssectional data under the assumption of random sampling This set ting is natural to students because they are familiar with random sampling from a population in their introductory statistics courses Importantly it allows us to distinguish assumptions made about the underlying population regression modelassumptions that can be given economic or behavioral con tentfrom assumptions about how the data were sampled Discussions about the consequences of nonrandom sampling can be treated in an intuitive fashion after the students have a good grasp of the multiple regression model estimated using random samples An important feature of a modern approach is that the explanatory variablesalong with the dependent variableare treated as outcomes of random variables For the social sciences allow ing random explanatory variables is much more realistic than the traditional assumption of nonran dom explanatory variables As a nontrivial benefit the population modelrandom sampling approach reduces the number of assumptions that students must absorb and understand Ironically the classical approach to regression analysis which treats the explanatory variables as fixed in repeated samples and is still pervasive in introductory texts literally applies to data collected in an experimental setting In addition the contortions required to state and explain assumptions can be confusing to students My focus on the population model emphasizes that the fundamental assumptions underlying regression analysis such as the zero mean assumption on the unobservable error term are properly Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xiii Preface stated conditional on the explanatory variables This leads to a clear understanding of the kinds of problems such as heteroskedasticity nonconstant variance that can invalidate standard inference procedures By focusing on the population I am also able to dispel several misconceptions that arise in econometrics texts at all levels For example I explain why the usual Rsquared is still valid as a goodnessoffit measure in the presence of heteroskedasticity Chapter 8 or serially correlated errors Chapter 12 I provide a simple demonstration that tests for functional form should not be viewed as general tests of omitted variables Chapter 9 and I explain why one should always include in a regression model extra control variables that are uncorrelated with the explanatory variable of inter est which is often a key policy variable Chapter 6 Because the assumptions for crosssectional analysis are relatively straightforward yet realis tic students can get involved early with serious crosssectional applications without having to worry about the thorny issues of trends seasonality serial correlation high persistence and spurious regres sion that are ubiquitous in time series regression models Initially I figured that my treatment of regression with crosssectional data followed by regression with time series data would find favor with instructors whose own research interests are in applied microeconomics and that appears to be the case It has been gratifying that adopters of the text with an applied time series bent have been equally enthusiastic about the structure of the text By postponing the econometric analysis of time series data I am able to put proper focus on the potential pitfalls in analyzing time series data that do not arise with crosssectional data In effect time series econometrics finally gets the serious treat ment it deserves in an introductory text As in the earlier editions I have consciously chosen topics that are important for reading journal articles and for conducting basic empirical research Within each topic I have deliberately omitted many tests and estimation procedures that while traditionally included in textbooks have not with stood the empirical test of time Likewise I have emphasized more recent topics that have clearly demonstrated their usefulness such as obtaining test statistics that are robust to heteroskedasticity or serial correlation of unknown form using multiple years of data for policy analysis or solving the omitted variable problem by instrumental variables methods I appear to have made fairly good choices as I have received only a handful of suggestions for adding or deleting material I take a systematic approach throughout the text by which I mean that each topic is presented by building on the previous material in a logical fashion and assumptions are introduced only as they are needed to obtain a conclusion For example empirical researchers who use econometrics in their research understand that not all of the GaussMarkov assumptions are needed to show that the ordi nary least squares OLS estimators are unbiased Yet the vast majority of econometrics texts intro duce a complete set of assumptions many of which are redundant or in some cases even logically conflicting before proving the unbiasedness of OLS Similarly the normality assumption is often included among the assumptions that are needed for the GaussMarkov Theorem even though it is fairly well known that normality plays no role in showing that the OLS estimators are the best linear unbiased estimators My systematic approach is illustrated by the order of assumptions that I use for multiple regres sion in Part 1 This structure results in a natural progression for briefly summarizing the role of each assumption MLR1 Introduce the population model and interpret the population parameters which we hope to estimate MLR2 Introduce random sampling from the population and describe the data that we use to estimate the population parameters MLR3 Add the assumption on the explanatory variables that allows us to compute the estimates from our sample this is the socalled no perfect collinearity assumption MLR4 Assume that in the population the mean of the unobservable error does not depend on the values of the explanatory variables this is the mean independence assumption combined with a zero population mean for the error and it is the key assumption that delivers unbiasedness of OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xiv Preface After introducing Assumptions MLR1 to MLR3 one can discuss the algebraic properties of ordi nary least squaresthat is the properties of OLS for a particular set of data By adding Assumption MLR4 we can show that OLS is unbiased and consistent Assumption MLR5 homoskedastic ity is added for the GaussMarkov Theorem and for the usual OLS variance formulas to be valid Assumption MLR6 normality which is not introduced until Chapter 4 is added to round out the classical linear model assumptions The six assumptions are used to obtain exact statistical inference and to conclude that the OLS estimators have the smallest variances among all unbiased estimators I use parallel approaches when I turn to the study of largesample properties and when I treat regression for time series data in Part 2 The careful presentation and discussion of assumptions makes it relatively easy to transition to Part 3 which covers advanced topics that include using pooled crosssectional data exploiting panel data structures and applying instrumental variables methods Generally I have strived to provide a unified view of econometrics where all estimators and test sta tistics are obtained using just a few intuitively reasonable principles of estimation and testing which of course also have rigorous justification For example regressionbased tests for heteroskedasticity and serial correlation are easy for students to grasp because they already have a solid understanding of regression This is in contrast to treatments that give a set of disjointed recipes for outdated econo metric testing procedures Throughout the text I emphasize ceteris paribus relationships which is why after one chapter on the simple regression model I move to multiple regression analysis The multiple regression setting motivates students to think about serious applications early I also give prominence to policy analysis with all kinds of data structures Practical topics such as using proxy variables to obtain ceteris pari bus effects and interpreting partial effects in models with interaction terms are covered in a simple fashion New to This Edition I have added new exercises to almost every chapter including the appendices Most of the new com puter exercises use new data sets including a data set on student performance and attending a Catholic high school and a time series data set on presidential approval ratings and gasoline prices I have also added some harder problems that require derivations There are several changes to the text worth noting Chapter 2 contains a more extensive dis cussion about the relationship between the simple regression coefficient and the correlation coef ficient Chapter 3 clarifies issues with comparing Rsquareds from models when data are missing on some variables thereby reducing sample sizes available for regressions with more explanatory variables Chapter 6 introduces the notion of an average partial effect APE for models linear in the param eters but including nonlinear functions primarily quadratics and interaction terms The notion of an APE which was implicit in previous editions has become an important concept in empirical work understanding how to compute and interpret APEs in the context of OLS is a valuable skill For more advanced classes the introduction in Chapter 6 eases the way to the discussion of APEs in the non linear models studied in Chapter 17 which also includes an expanded discussion of APEsincluding now showing APEs in tables alongside coefficients in logit probit and Tobit applications In Chapter 8 I refine some of the discussion involving the issue of heteroskedasticity including an expanded discussion of Chow tests and a more precise description of weighted least squares when the weights must be estimated Chapter 9 which contains some optional slightly more advanced topics defines terms that appear often in the large literature on missing data A common practice in empirical work is to create indicator variables for missing data and to include them in a multiple regression analysis Chapter 9 discusses how this method can be implemented and when it will pro duce unbiased and consistent estimators Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xv Preface The treatment of unobserved effects panel data models in chapter 14 has been expanded to include more of a discussion of unbalanced panel data sets including how the fixed effects random effects and correlated random effects approaches still can be applied Another important addition is a much more detailed discussion on applying fixed effects and random effects methods to cluster sam ples I also include discussion of some subtle issues that can arise in using clustered standard errors when the data have been obtained from a random sampling scheme Chapter 15 now has a more detailed discussion of the problem of weak instrumental variables so that students can access the basics without having to track down more advanced sources Targeted at Undergraduates Adaptable for Masters Students The text is designed for undergraduate economics majors who have taken college algebra and one semester of introductory probability and statistics Appendices A B and C contain the requisite background material A onesemester or onequarter econometrics course would not be expected to cover all or even any of the more advanced material in Part 3 A typical introductory course includes Chapters 1 through 8 which cover the basics of simple and multiple regression for crosssectional data Provided the emphasis is on intuition and interpreting the empirical exam ples the material from the first eight chapters should be accessible to undergraduates in most economics departments Most instructors will also want to cover at least parts of the chapters on regression analysis with time series data Chapters 10 and 12 in varying degrees of depth In the onesemester course that I teach at Michigan State I cover Chapter 10 fairly carefully give an overview of the material in Chapter 11 and cover the material on serial correlation in Chapter 12 I find that this basic onesemester course puts students on a solid footing to write empirical papers such as a term paper a senior seminar paper or a senior thesis Chapter 9 contains more specialized topics that arise in analyzing crosssectional data including data problems such as outliers and nonrandom sampling for a onesemester course it can be skipped without loss of continuity The structure of the text makes it ideal for a course with a crosssectional or policy analysis focus the time series chapters can be skipped in lieu of topics from Chapters 9 or 15 Chapter 13 is advanced only in the sense that it treats two new data structures independently pooled cross sections and twoperiod panel data analysis Such data structures are especially useful for policy analysis and the chapter provides several examples Students with a good grasp of Chapters 1 through 8 will have little difficulty with Chapter 13 Chapter 14 covers more advanced panel data methods and would probably be covered only in a second course A good way to end a course on crosssectional methods is to cover the rudiments of instrumental variables estimation in Chapter 15 I have used selected material in Part 3 including Chapters 13 and 17 in a senior seminar geared to producing a serious research paper Along with the basic onesemester course students who have been exposed to basic panel data analysis instrumental variables estimation and limited dependent variable models are in a position to read large segments of the applied social sciences literature Chapter 17 provides an introduction to the most common limited dependent variable models The text is also well suited for an introductory masters level course where the emphasis is on applications rather than on derivations using matrix algebra Several instructors have used the text to teach policy analysis at the masters level For instructors wanting to present the material in matrix form Appendices D and E are selfcontained treatments of the matrix algebra and the multiple regres sion model in matrix form At Michigan State PhD students in many fields that require data analysisincluding accounting agricultural economics development economics economics of education finance international eco nomics labor economics macroeconomics political science and public financehave found the text Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xvi Preface to be a useful bridge between the empirical work that they read and the more theoretical econometrics they learn at the PhD level Design Features Numerous intext questions are scattered throughout with answers supplied in Appendix F These questions are intended to provide students with immediate feedback Each chapter contains many numbered examples Several of these are case studies drawn from recently published papers but where I have used my judgment to simplify the analysis hopefully without sacrificing the main point The endofchapter problems and computer exercises are heavily oriented toward empirical work rather than complicated derivations The students are asked to reason carefully based on what they have learned The computer exercises often expand on the intext examples Several exercises use data sets from published works or similar data sets that are motivated by published research in economics and other fields A pioneering feature of this introductory econometrics text is the extensive glossary The short definitions and descriptions are a helpful refresher for students studying for exams or reading empiri cal research that uses econometric methods I have added and updated several entries for the fifth edition Data SetsAvailable in Six Formats This edition adds R data set as an additional format for viewing and analyzing data In response to popular demand this edition also provides the Minitab format With more than 100 data sets in six different formats including Stata EViews Minitab Microsoft Excel and R the instructor has many options for problem sets examples and term projects Because most of the data sets come from actual research some are very large Except for partial lists of data sets to illustrate the various data structures the data sets are not reported in the text This book is geared to a course where computer work plays an integral role Updated Data Sets Handbook An extensive data description manual is also available online This manual contains a list of data sources along with suggestions for ways to use the data sets that are not described in the text This unique handbook created by author Jeffrey M Wooldridge lists the source of all data sets for quick reference and how each might be used Because the data book contains page numbers it is easy to see how the author used the data in the text Students may want to view the descriptions of each data set and it can help guide instructors in generating new homework exercises exam problems or term projects The author also provides suggestions on improving the data sets in this detailed resource that is available on the books companion website at httplogincengagecom and students can access it free at wwwcengagebraincom Instructor Supplements Instructors Manual with Solutions The Instructors Manual with Solutions contains answers to all problems and exercises as well as teaching tips on how to present the material in each chapter The instructors manual also contains Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xvii Preface sources for each of the data files with many suggestions for how to use them on problem sets exams and term papers This supplement is available online only to instructors at httplogincengagecom PowerPoint Slides Exceptional PowerPoint presentation slides help you create engaging memorable lectures You will find teaching slides for each chapter in this edition including the advanced chapters in Part 3 You can modify or customize the slides for your specific course PowerPoint slides are available for conve nient download on the instructoronly passwordprotected portion of the books companion website at httplogincengagecom Scientific Word Slides Developed by the author Scientific Word slides offer an alternative format for instructors who prefer the Scientific Word platform the word processor created by MacKichan Software Inc for composing mathematical and technical documents using LaTeX typesetting These slides are based on the authors actual lectures and are available in PDF and TeX formats for convenient download on the instructoronly passwordprotected section of the books companion website at httplogin cengagecom Test Bank Cengage Learning Testing powered by Cognero is a flexible online system that allows you to import edit and manipulate content from the texts test bank or elsewhere You have the flexibility to include your own favorite test questions create multiple test versions in an instant and deliver tests from your LMS your classroom or wherever you want In the test bank for INTRODUCTORY ECONOMETRICS 6E you will find a wealth and variety of problems ranging from multiplechoice to questions that require simple statistical derivations to questions that require interpreting computer output Student Supplements MindTap MindTap for INTRODUCTORY ECONOMETRICS 6E provides you with the tools you need to better manage your limited timeyou can complete assignments whenever and wherever you are ready to learn with course material specially customized by your instructor and streamlined in one proven easytouse interface With an array of tools and appsfrom note taking to flashcardsyou will get a true understanding of course concepts helping you to achieve better grades and setting the groundwork for your future courses Aplia Millions of students use Aplia to better prepare for class and for their exams Aplia assignments mean no surpriseswith an ataglance view of current assignments organized by due date You always know whats due and when Aplia ties your lessons into realworld applications so you get a bigger better picture of how youll use your education in your future workplace Automatic grading and immediate feedback helps you master content the right way the first time Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xviii Preface Student Solutions Manual Now you can maximize your study time and further your course success with this dynamic online resource This helpful Solutions Manual includes detailed steps and solutions to oddnumbered prob lems as well as computer exercises in the text This supplement is available as a free resource at wwwcengagebraincom Suggestions for Designing Your Course I have already commented on the contents of most of the chapters as well as possible outlines for courses Here I provide more specific comments about material in chapters that might be covered or skipped Chapter 9 has some interesting examples such as a wage regression that includes IQ score as an explanatory variable The rubric of proxy variables does not have to be formally introduced to present these kinds of examples and I typically do so when finishing up crosssectional analysis In Chapter 12 for a onesemester course I skip the material on serial correlation robust inference for ordinary least squares as well as dynamic models of heteroskedasticity Even in a second course I tend to spend only a little time on Chapter 16 which covers simultane ous equations analysis I have found that instructors differ widely in their opinions on the importance of teaching simultaneous equations models to undergraduates Some think this material is funda mental others think it is rarely applicable My own view is that simultaneous equations models are overused see Chapter 16 for a discussion If one reads applications carefully omitted variables and measurement error are much more likely to be the reason one adopts instrumental variables estima tion and this is why I use omitted variables to motivate instrumental variables estimation in Chapter 15 Still simultaneous equations models are indispensable for estimating demand and supply func tions and they apply in some other important cases as well Chapter 17 is the only chapter that considers models inherently nonlinear in their parameters and this puts an extra burden on the student The first material one should cover in this chapter is on probit and logit models for binary response My presentation of Tobit models and censored regression still appears to be novel in introductory texts I explicitly recognize that the Tobit model is applied to corner solution outcomes on random samples while censored regression is applied when the data col lection process censors the dependent variable at essentially arbitrary thresholds Chapter 18 covers some recent important topics from time series econometrics including test ing for unit roots and cointegration I cover this material only in a secondsemester course at either the undergraduate or masters level A fairly detailed introduction to forecasting is also included in Chapter 18 Chapter 19 which would be added to the syllabus for a course that requires a term paper is much more extensive than similar chapters in other texts It summarizes some of the methods appropriate for various kinds of problems and data structures points out potential pitfalls explains in some detail how to write a term paper in empirical economics and includes suggestions for possible projects Acknowledgments I would like to thank those who reviewed and provided helpful comments for this and previous editions of the text Erica Johnson Gonzaga University Mary Ellen Benedict Bowling Green State University Yan Li Temple University Melissa Tartari Yale University Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xix Preface Michael Allgrunn University of South Dakota Gregory Colman Pace University YooMi Chin Missouri University of Science and Technology Arsen Melkumian Western Illinois University Kevin J Murphy Oakland University Kristine Grimsrud University of New Mexico Will Melick Kenyon College Philip H Brown Colby College Argun Saatcioglu University of Kansas Ken Brown University of Northern Iowa Michael R Jonas University of San Francisco Melissa Yeoh Berry College Nikolaos Papanikolaou SUNY at New Paltz Konstantin Golyaev University of Minnesota Soren Hauge Ripon College Kevin Williams University of Minnesota Hailong Qian Saint Louis University Rod Hissong University of Texas at Arlington Steven Cuellar Sonoma State University Yanan Di Wagner College John Fitzgerald Bowdoin College Philip N Jefferson Swarthmore College Yongsheng Wang Washington and Jefferson College ShengKai Chang National Taiwan University Damayanti Ghosh Binghamton University Susan Averett Lafayette College Kevin J Mumford Purdue University Nicolai V Kuminoff Arizona State University Subarna K Samanta The College of New Jersey Jing Li South Dakota State University Gary Wagner University of ArkansasLittle Rock Kelly Cobourn Boise State University Timothy Dittmer Central Washington University Daniel Fischmar Westminster College Subha Mani Fordham University John Maluccio Middlebury College James Warner College of Wooster Christopher Magee Bucknell University Andrew Ewing Eckerd College Debra Israel Indiana State University Jay Goodliffe Brigham Young University Stanley R Thompson The Ohio State University Michael Robinson Mount Holyoke College Ivan Jeliazkov University of California Irvine Heather ONeill Ursinus College Leslie Papke Michigan State University Timothy Vogelsang Michigan State University Stephen Woodbury Michigan State University Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xx Preface Some of the changes I discussed earlier were driven by comments I received from people on this list and I continue to mull over other specific suggestions made by one or more reviewers Many students and teaching assistants too numerous to list have caught mistakes in earlier editions or have suggested rewording some paragraphs I am grateful to them As always it was a pleasure working with the team at Cengage Learning Mike Worls my long time Product Director has learned very well how to guide me with a firm yet gentle hand Chris Rader has quickly mastered the difficult challenges of being the developmental editor of a dense techni cal textbook His careful reading of the manuscript and fine eye for detail have improved this sixth edition considerably This book is dedicated to my wife Leslie Papke who contributed materially to this edition by writing the initial versions of the Scientific Word slides for the chapters in Part 3 she then used the slides in her public policy course Our children have contributed too Edmund has helped me keep the data handbook current and Gwenyth keeps us entertained with her artistic talents Jeffrey M Wooldridge Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xxi About the Author Jeffrey M Wooldridge is University Distinguished Professor of Economics at Michigan State University where he has taught since 1991 From 1986 to 1991 he was an assistant professor of eco nomics at the Massachusetts Institute of Technology He received his bachelor of arts with majors in computer science and economics from the University of California Berkeley in 1982 and received his doctorate in economics in 1986 from the University of California San Diego He has published more than 60 articles in internationally recognized journals as well as several book chapters He is also the author of Econometric Analysis of Cross Section and Panel Data second edition His awards include an Alfred P Sloan Research Fellowship the Plura Scripsit award from Econometric Theory the Sir Richard Stone prize from the Journal of Applied Econometrics and three graduate teacheroftheyear awards from MIT He is a fellow of the Econometric Society and of the Journal of Econometrics He is past editor of the Journal of Business and Economic Statistics and past econometrics coeditor of Economics Letters He has served on the editorial boards of Econometric Theory the Journal of Economic Literature the Journal of Econometrics the Review of Economics and Statistics and the Stata Journal He has also acted as an occasional econometrics consultant for Arthur Andersen Charles River Associates the Washington State Institute for Public Policy Stratus Consulting and Industrial Economics Incorporated Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 1 C hapter 1 discusses the scope of econometrics and raises general issues that arise in the application of econometric methods Section 11 provides a brief discussion about the purpose and scope of econometrics and how it fits into economic analysis Section 12 provides exam ples of how one can start with an economic theory and build a model that can be estimated using data Section 13 examines the kinds of data sets that are used in business economics and other social sciences Section 14 provides an intuitive discussion of the difficulties associated with the inference of causality in the social sciences 11 What Is Econometrics Imagine that you are hired by your state government to evaluate the effectiveness of a publicly funded job training program Suppose this program teaches workers various ways to use computers in the manufacturing process The 20week program offers courses during nonworking hours Any hourly manufacturing worker may participate and enrollment in all or part of the program is volun tary You are to determine what if any effect the training program has on each workers subsequent hourly wage Now suppose you work for an investment bank You are to study the returns on different invest ment strategies involving shortterm US treasury bills to decide whether they comply with implied economic theories The task of answering such questions may seem daunting at first At this point you may only have a vague idea of the kind of data you would need to collect By the end of this introductory econometrics course you should know how to use econometric methods to formally evaluate a job training program or to test a simple economic theory The Nature of Econometrics and Economic Data c h a p t e r 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 2 Econometrics is based upon the development of statistical methods for estimating economic relationships testing economic theories and evaluating and implementing government and business policy The most common application of econometrics is the forecasting of such important macroeco nomic variables as interest rates inflation rates and gross domestic product GDP Whereas fore casts of economic indicators are highly visible and often widely published econometric methods can be used in economic areas that have nothing to do with macroeconomic forecasting For example we will study the effects of political campaign expenditures on voting outcomes We will consider the effect of school spending on student performance in the field of education In addition we will learn how to use econometric methods for forecasting economic time series Econometrics has evolved as a separate discipline from mathematical statistics because the for mer focuses on the problems inherent in collecting and analyzing nonexperimental economic data Nonexperimental data are not accumulated through controlled experiments on individuals firms or segments of the economy Nonexperimental data are sometimes called observational data or retrospective data to emphasize the fact that the researcher is a passive collector of the data Experimental data are often collected in laboratory environments in the natural sciences but they are much more difficult to obtain in the social sciences Although some social experiments can be devised it is often impossible prohibitively expensive or morally repugnant to conduct the kinds of controlled experiments that would be needed to address economic issues We give some specific examples of the differences between experimental and nonexperimental data in Section 14 Naturally econometricians have borrowed from mathematical statisticians whenever possible The method of multiple regression analysis is the mainstay in both fields but its focus and interpreta tion can differ markedly In addition economists have devised new techniques to deal with the com plexities of economic data and to test the predictions of economic theories 12 Steps in Empirical Economic Analysis Econometric methods are relevant in virtually every branch of applied economics They come into play either when we have an economic theory to test or when we have a relationship in mind that has some importance for business decisions or policy analysis An empirical analysis uses data to test a theory or to estimate a relationship How does one go about structuring an empirical economic analysis It may seem obvious but it is worth emphasizing that the first step in any empirical analysis is the careful formulation of the question of interest The question might deal with testing a certain aspect of an economic theory or it might pertain to testing the effects of a government policy In principle econometric methods can be used to answer a wide range of questions In some cases especially those that involve the testing of economic theories a formal economic model is constructed An economic model consists of mathematical equations that describe various relationships Economists are well known for their building of models to describe a vast array of be haviors For example in intermediate microeconomics individual consumption decisions subject to a budget constraint are described by mathematical models The basic premise underlying these models is utility maximization The assumption that individuals make choices to maximize their wellbeing subject to resource constraints gives us a very powerful framework for creating tractable economic models and making clear predictions In the context of consumption decisions utility maximization leads to a set of demand equations In a demand equation the quantity demanded of each commodity depends on the price of the goods the price of substitute and complementary goods the consumers income and the individuals characteristics that affect taste These equations can form the basis of an econometric analysis of consumer demand Economists have used basic economic tools such as the utility maximization framework to explain behaviors that at first glance may appear to be noneconomic in nature A classic example is Beckers 1968 economic model of criminal behavior Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 3 ExamplE 11 Economic model of Crime In a seminal article Nobel Prize winner Gary Becker postulated a utility maximization framework to describe an individuals participation in crime Certain crimes have clear economic rewards but most criminal behaviors have costs The opportunity costs of crime prevent the criminal from participating in other activities such as legal employment In addition there are costs associated with the possibility of being caught and then if convicted the costs associated with incarceration From Beckers per spective the decision to undertake illegal activity is one of resource allocation with the benefits and costs of competing activities taken into account Under general assumptions we can derive an equation describing the amount of time spent in criminal activity as a function of various factors We might represent such a function as y 5 f 1x1 x2 x3 x4 x5 x6 x72 11 where y 5 hours spent in criminal activities x1 5 wage for an hour spent in criminal activity x2 5 hourly wage in legal employment x3 5 income other than from crime or employment x4 5 probability of getting caught x5 5 probability of being convicted if caught x6 5 expected sentence if convicted and x7 5 age Other factors generally affect a persons decision to participate in crime but the list above is rep resentative of what might result from a formal economic analysis As is common in economic theory we have not been specific about the function f in 11 This function depends on an underlying util ity function which is rarely known Nevertheless we can use economic theoryor introspectionto predict the effect that each variable would have on criminal activity This is the basis for an econometric analysis of individual criminal activity Formal economic modeling is sometimes the starting point for empirical analysis but it is more com mon to use economic theory less formally or even to rely entirely on intuition You may agree that the deter minants of criminal behavior appearing in equation 11 are reasonable based on common sense we might arrive at such an equation directly without starting from utility maximization This view has some merit although there are cases in which formal derivations provide insights that intuition can overlook Next is an example of an equation that we can derive through somewhat informal reasoning ExamplE 12 Job Training and Worker productivity Consider the problem posed at the beginning of Section 11 A labor economist would like to examine the effects of job training on worker productivity In this case there is little need for formal economic theory Basic economic understanding is sufficient for realizing that factors such as education experi ence and training affect worker productivity Also economists are well aware that workers are paid commensurate with their productivity This simple reasoning leads to a model such as wage 5 f 1educ exper training2 12 where wage 5 hourly wage educ 5 years of formal education exper 5 years of workforce experience and training 5 weeks spent in job training Again other factors generally affect the wage rate but equation 12 captures the essence of the problem Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 4 After we specify an economic model we need to turn it into what we call an econometric model Because we will deal with econometric models throughout this text it is important to know how an econometric model relates to an economic model Take equation 11 as an example The form of the function f must be specified before we can undertake an econometric analysis A second issue con cerning 11 is how to deal with variables that cannot reasonably be observed For example consider the wage that a person can earn in criminal activity In principle such a quantity is well defined but it would be difficult if not impossible to observe this wage for a given individual Even variables such as the probability of being arrested cannot realistically be obtained for a given individual but at least we can observe relevant arrest statistics and derive a variable that approximates the probability of arrest Many other factors affect criminal behavior that we cannot even list let alone observe but we must somehow account for them The ambiguities inherent in the economic model of crime are resolved by specifying a particular econometric model crime 5 b0 1 b1wagem 1 b2othinc 1 b3freqarr 1 b4freqconv 1 b5avgsen 1 b6age 1 u 13 where crime 5 some measure of the frequency of criminal activity wagem 5 the wage that can be earned in legal employment othinc 5 the income from other sources assets inheritance and so on freqarr 5 the frequency of arrests for prior infractions to approximate the probability of arrest freqconv 5 the frequency of conviction and avgsen 5 the average sentence length after conviction The choice of these variables is determined by the economic theory as well as data considerations The term u contains unobserved factors such as the wage for criminal activity moral character fam ily background and errors in measuring things like criminal activity and the probability of arrest We could add family background variables to the model such as number of siblings parents education and so on but we can never eliminate u entirely In fact dealing with this error term or disturbance term is perhaps the most important component of any econometric analysis The constants b0 b1 c b6 are the parameters of the econometric model and they describe the directions and strengths of the relationship between crime and the factors used to determine crime in the model A complete econometric model for Example 12 might be wage 5 b0 1 b1educ 1 b2exper 1 b3training 1 u 14 where the term u contains factors such as innate ability quality of education family background and the myriad other factors that can influence a persons wage If we are specifically concerned about the effects of job training then b3 is the parameter of interest For the most part econometric analysis begins by specifying an econometric model without con sideration of the details of the models creation We generally follow this approach largely because careful derivation of something like the economic model of crime is time consuming and can take us into some specialized and often difficult areas of economic theory Economic reasoning will play a role in our examples and we will merge any underlying economic theory into the econometric model specification In the economic model of crime example we would start with an econometric model such as 13 and use economic reasoning and common sense as guides for choosing the variables Although this approach loses some of the richness of economic analysis it is commonly and effec tively applied by careful researchers Once an econometric model such as 13 or 14 has been specified various hypotheses of in terest can be stated in terms of the unknown parameters For example in equation 13 we might hypothesize that wagem the wage that can be earned in legal employment has no effect on criminal behavior In the context of this particular econometric model the hypothesis is equivalent to b1 5 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 5 An empirical analysis by definition requires data After data on the relevant variables have been collected econometric methods are used to estimate the parameters in the econometric model and to formally test hypotheses of interest In some cases the econometric model is used to make predic tions in either the testing of a theory or the study of a policys impact Because data collection is so important in empirical work Section 13 will describe the kinds of data that we are likely to encounter 13 The Structure of Economic Data Economic data sets come in a variety of types Whereas some econometric methods can be applied with little or no modification to many different kinds of data sets the special features of some data sets must be accounted for or should be exploited We next describe the most important data structures encountered in applied work 13a CrossSectional Data A crosssectional data set consists of a sample of individuals households firms cities states countries or a variety of other units taken at a given point in time Sometimes the data on all units do not cor respond to precisely the same time period For example several families may be surveyed during different weeks within a year In a pure crosssectional analysis we would ignore any minor timing differences in collecting the data If a set of families was surveyed during different weeks of the same year we would still view this as a crosssectional data set An important feature of crosssectional data is that we can often assume that they have been obtained by random sampling from the underlying population For example if we obtain informa tion on wages education experience and other characteristics by randomly drawing 500 people from the working population then we have a random sample from the population of all working people Random sampling is the sampling scheme covered in introductory statistics courses and it simplifies the analysis of crosssectional data A review of random sampling is contained in Appendix C Sometimes random sampling is not appropriate as an assumption for analyzing crosssectional data For example suppose we are interested in studying factors that influence the accumulation of family wealth We could survey a random sample of families but some families might refuse to report their wealth If for example wealthier families are less likely to disclose their wealth then the result ing sample on wealth is not a random sample from the population of all families This is an illustra tion of a sample selection problem an advanced topic that we will discuss in Chapter 17 Another violation of random sampling occurs when we sample from units that are large relative to the population particularly geographical units The potential problem in such cases is that the popula tion is not large enough to reasonably assume the observations are independent draws For example if we want to explain new business activity across states as a function of wage rates energy prices corporate and property tax rates services provided quality of the workforce and other state charac teristics it is unlikely that business activities in states near one another are independent It turns out that the econometric methods that we discuss do work in such situations but they sometimes need to be refined For the most part we will ignore the intricacies that arise in analyzing such situations and treat these problems in a random sampling framework even when it is not technically correct to do so Crosssectional data are widely used in economics and other social sciences In economics the analysis of crosssectional data is closely aligned with the applied microeconomics fields such as labor economics state and local public finance industrial organization urban economics demogra phy and health economics Data on individuals households firms and cities at a given point in time are important for testing microeconomic hypotheses and evaluating economic policies The crosssectional data used for econometric analysis can be represented and stored in comput ers Table 11 contains in abbreviated form a crosssectional data set on 526 working individuals Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 6 for the year 1976 This is a subset of the data in the file WAGE1 The variables include wage in dollars per hour educ years of education exper years of potential labor force experience female an indicator for gender and married marital status These last two variables are binary zeroone in nature and serve to indicate qualitative features of the individual the person is female or not the person is married or not We will have much to say about binary variables in Chapter 7 and beyond The variable obsno in Table 11 is the observation number assigned to each person in the sample Unlike the other variables it is not a characteristic of the individual All econometrics and statistics software packages assign an observation number to each data unit Intuition should tell you that for data such as that in Table 11 it does not matter which person is labeled as observation 1 which per son is called observation 2 and so on The fact that the ordering of the data does not matter for econo metric analysis is a key feature of crosssectional data sets obtained from random sampling Different variables sometimes correspond to different time periods in crosssectional data sets For example to determine the effects of government policies on longterm economic growth econo mists have studied the relationship between growth in real per capita GDP over a certain period say 1960 to 1985 and variables determined in part by government policy in 1960 government consump tion as a percentage of GDP and adult secondary education rates Such a data set might be repre sented as in Table 12 which constitutes part of the data set used in the study of crosscountry growth rates by De Long and Summers 1991 The variable gpcrgdp represents average growth in real per capita GDP over the period 1960 to 1985 The fact that govcons60 government consumption as a percentage of GDP and second60 TAblE 11 A CrossSectional Data Set on Wages and Other Individual Characteristics obsno wage educ exper female married 1 310 11 2 1 0 2 324 12 22 1 1 3 300 11 2 0 0 4 600 8 44 0 1 5 530 12 7 0 1 525 1156 16 5 0 1 526 350 14 5 1 0 TAblE 12 A Data Set on Economic Growth Rates and Country Characteristics obsno country gpcrgdp govcons60 second60 1 Argentina 089 9 32 2 Austria 332 16 50 3 Belgium 256 13 69 4 Bolivia 124 18 12 61 Zimbabwe 230 17 6 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 7 percentage of adult population with a secondary education correspond to the year 1960 while gpcrgdp is the average growth over the period from 1960 to 1985 does not lead to any special prob lems in treating this information as a crosssectional data set The observations are listed alphabeti cally by country but nothing about this ordering affects any subsequent analysis 13b Time Series Data A time series data set consists of observations on a variable or several variables over time Examples of time series data include stock prices money supply consumer price index GDP annual homicide rates and automobile sales figures Because past events can influence future events and lags in behav ior are prevalent in the social sciences time is an important dimension in a time series data set Unlike the arrangement of crosssectional data the chronological ordering of observations in a time series conveys potentially important information A key feature of time series data that makes them more difficult to analyze than crosssectional data is that economic observations can rarely if ever be assumed to be independent across time Most economic and other time series are related often strongly related to their recent histories For example knowing something about the GDP from last quarter tells us quite a bit about the likely range of the GDP during this quarter because GDP tends to remain fairly stable from one quarter to the next Although most econometric procedures can be used with both crosssectional and time series data more needs to be done in specifying econometric models for time series data before standard econometric methods can be justified In addition modifications and embellishments to standard econometric techniques have been developed to account for and exploit the dependent nature of economic time series and to address other issues such as the fact that some economic variables tend to display clear trends over time Another feature of time series data that can require special attention is the data frequency at which the data are collected In economics the most common frequencies are daily weekly monthly quarterly and annually Stock prices are recorded at daily intervals excluding Saturday and Sunday The money supply in the US economy is reported weekly Many macroeconomic series are tabulated monthly including inflation and unemployment rates Other macro series are recorded less frequently such as every three months every quarter GDP is an important example of a quarterly series Other time series such as infant mortality rates for states in the United States are available only on an annual basis Many weekly monthly and quarterly economic time series display a strong seasonal pattern which can be an important factor in a time series analysis For example monthly data on housing starts differ across the months simply due to changing weather conditions We will learn how to deal with seasonal time series in Chapter 10 Table 13 contains a time series data set obtained from an article by CastilloFreeman and Freeman 1992 on minimum wage effects in Puerto Rico The earliest year in the data set is the first TAblE 13 Minimum Wage Unemployment and Related Data for Puerto Rico obsno year avgmin avgcov prunemp prgnp 1 1950 020 201 154 8787 2 1951 021 207 160 9250 3 1952 023 226 148 10159 37 1986 335 581 189 42816 38 1987 335 582 168 44967 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 8 observation and the most recent year available is the last observation When econometric methods are used to analyze time series data the data should be stored in chronological order The variable avgmin refers to the average minimum wage for the year avgcov is the average cov erage rate the percentage of workers covered by the minimum wage law prunemp is the unemploy ment rate and prgnp is the gross national product in millions of 1954 dollars We will use these data later in a time series analysis of the effect of the minimum wage on employment 13c Pooled Cross Sections Some data sets have both crosssectional and time series features For example suppose that two crosssectional household surveys are taken in the United States one in 1985 and one in 1990 In 1985 a random sample of households is surveyed for variables such as income savings fam ily size and so on In 1990 a new random sample of households is taken using the same survey questions To increase our sample size we can form a pooled cross section by combining the two years Pooling cross sections from different years is often an effective way of analyzing the effects of a new government policy The idea is to collect data from the years before and after a key policy change As an example consider the following data set on housing prices taken in 1993 and 1995 before and after a reduction in property taxes in 1994 Suppose we have data on 250 houses for 1993 and on 270 houses for 1995 One way to store such a data set is given in Table 14 Observations 1 through 250 correspond to the houses sold in 1993 and observations 251 through 520 correspond to the 270 houses sold in 1995 Although the order in which we store the data turns out not to be crucial keeping track of the year for each observation is usually very important This is why we enter year as a separate variable A pooled cross section is analyzed much like a standard cross section except that we often need to account for secular differences in the variables across the time In fact in addition to increasing the sample size the point of a pooled crosssectional analysis is often to see how a key relationship has changed over time TAblE 14 Pooled Cross Sections Two Years of Housing Prices obsno year hprice proptax sqrft bdrms bthrms 1 1993 85500 42 1600 3 20 2 1993 67300 36 1440 3 25 3 1993 134000 38 2000 4 25 250 1993 243600 41 2600 4 30 251 1995 65000 16 1250 2 10 252 1995 182400 20 2200 4 20 253 1995 97500 15 1540 3 20 520 1995 57200 16 1100 2 15 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 9 13d Panel or Longitudinal Data A panel data or longitudinal data set consists of a time series for each crosssectional member in the data set As an example suppose we have wage education and employment history for a set of individuals followed over a 10year period Or we might collect information such as investment and financial data about the same set of firms over a fiveyear time period Panel data can also be collected on geographical units For example we can collect data for the same set of counties in the United States on immigration flows tax rates wage rates government expenditures and so on for the years 1980 1985 and 1990 The key feature of panel data that distinguishes them from a pooled cross section is that the same crosssectional units individuals firms or counties in the preceding examples are followed over a given time period The data in Table 14 are not considered a panel data set because the houses sold are likely to be different in 1993 and 1995 if there are any duplicates the number is likely to be so small as to be unimportant In contrast Table 15 contains a twoyear panel data set on crime and related statistics for 150 cities in the United States There are several interesting features in Table 15 First each city has been given a number from 1 through 150 Which city we decide to call city 1 city 2 and so on is irrelevant As with a pure cross section the ordering in the cross section of a panel data set does not matter We could use the city name in place of a number but it is often useful to have both A second point is that the two years of data for city 1 fill the first two rows or observations Observations 3 and 4 correspond to city 2 and so on Because each of the 150 cities has two rows of data any econometrics package will view this as 300 observations This data set can be treated as a pooled cross section where the same cities happen to show up in each year But as we will see in Chapters 13 and 14 we can also use the panel structure to analyze questions that cannot be answered by simply viewing this as a pooled cross section In organizing the observations in Table 15 we place the two years of data for each city adjacent to one another with the first year coming before the second in all cases For just about every practi cal purpose this is the preferred way for ordering panel data sets Contrast this organization with the way the pooled cross sections are stored in Table 14 In short the reason for ordering panel data as in Table 15 is that we will need to perform data transformations for each city across the two years Because panel data require replication of the same units over time panel data sets especially those on individuals households and firms are more difficult to obtain than pooled cross sections Not surprisingly observing the same units over time leads to several advantages over crosssectional data or even pooled crosssectional data The benefit that we will focus on in this text is that having TAblE 15 A TwoYear Panel Data Set on City Crime Statistics obsno city year murders population unem police 1 1 1986 5 350000 87 440 2 1 1990 8 359200 72 471 3 2 1986 2 64300 54 75 4 2 1990 1 65100 55 75 297 149 1986 10 260700 96 286 298 149 1990 6 245000 98 334 299 150 1986 25 543000 43 520 300 150 1990 32 546200 52 493 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 10 multiple observations on the same units allows us to control for certain unobserved characteristics of individuals firms and so on As we will see the use of more than one observation can facilitate causal inference in situations where inferring causality would be very difficult if only a single cross section were available A second advantage of panel data is that they often allow us to study the importance of lags in behavior or the result of decision making This information can be significant because many economic policies can be expected to have an impact only after some time has passed Most books at the undergraduate level do not contain a discussion of econometric methods for panel data However economists now recognize that some questions are difficult if not impossible to answer satisfactorily without panel data As you will see we can make considerable progress with simple panel data analysis a method that is not much more difficult than dealing with a standard crosssectional data set 13e A Comment on Data Structures Part 1 of this text is concerned with the analysis of crosssectional data because this poses the fewest conceptual and technical difficulties At the same time it illustrates most of the key themes of econo metric analysis We will use the methods and insights from crosssectional analysis in the remainder of the text Although the econometric analysis of time series uses many of the same tools as crosssectional analysis it is more complicated because of the trending highly persistent nature of many economic time series Examples that have been traditionally used to illustrate the manner in which economet ric methods can be applied to time series data are now widely believed to be flawed It makes little sense to use such examples initially since this practice will only reinforce poor econometric practice Therefore we will postpone the treatment of time series econometrics until Part 2 when the impor tant issues concerning trends persistence dynamics and seasonality will be introduced In Part 3 we will treat pooled cross sections and panel data explicitly The analysis of indepen dently pooled cross sections and simple panel data analysis are fairly straightforward extensions of pure crosssectional analysis Nevertheless we will wait until Chapter 13 to deal with these topics 14 Causality and the Notion of Ceteris Paribus in Econometric Analysis In most tests of economic theory and certainly for evaluating public policy the economists goal is to infer that one variable such as education has a causal effect on another variable such as worker productivity Simply finding an association between two or more variables might be suggestive but unless causality can be established it is rarely compelling The notion of ceteris paribuswhich means other relevant factors being equalplays an important role in causal analysis This idea has been implicit in some of our earlier discussion par ticularly Examples 11 and 12 but thus far we have not explicitly mentioned it You probably remember from introductory economics that most economic questions are ceteris paribus by nature For example in analyzing consumer demand we are interested in knowing the ef fect of changing the price of a good on its quantity demanded while holding all other factorssuch as income prices of other goods and individual tastesfixed If other factors are not held fixed then we cannot know the causal effect of a price change on quantity demanded Holding other factors fixed is critical for policy analysis as well In the job training example Example 12 we might be interested in the effect of another week of job training on wages with all other components being equal in particular education and experience If we succeed in holding all other relevant factors fixed and then find a link between job training and wages we can conclude that job training has a causal effect on worker productivity Although this may seem pretty simple even at this early stage it should be clear that except in very special cases it will not be possible to literally hold all else equal The key question in most empirical studies is Have enough other factors Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 11 been held fixed to make a case for causality Rarely is an econometric study evaluated without raising this issue In most serious applications the number of factors that can affect the variable of interestsuch as criminal activity or wagesis immense and the isolation of any particular variable may seem like a hopeless effort However we will eventually see that when carefully applied econometric methods can simulate a ceteris paribus experiment At this point we cannot yet explain how econometric methods can be used to estimate ceteris paribus effects so we will consider some problems that can arise in trying to infer causality in eco nomics We do not use any equations in this discussion For each example the problem of inferring causality disappears if an appropriate experiment can be carried out Thus it is useful to describe how such an experiment might be structured and to observe that in most cases obtaining experimental data is impractical It is also helpful to think about why the available data fail to have the important features of an experimental data set We rely for now on your intuitive understanding of such terms as random independence and correlation all of which should be familiar from an introductory probability and statistics course These concepts are reviewed in Appendix B We begin with an example that illustrates some of these important issues ExamplE 13 Effects of Fertilizer on Crop Yield Some early econometric studies for example Griliches 1957 considered the effects of new fertilizers on crop yields Suppose the crop under consideration is soybeans Since fertilizer amount is only one factor affecting yieldssome others include rainfall quality of land and presence of para sitesthis issue must be posed as a ceteris paribus question One way to determine the causal effect of fertilizer amount on soybean yield is to conduct an experiment which might include the following steps Choose several oneacre plots of land Apply different amounts of fertilizer to each plot and subsequently measure the yields this gives us a crosssectional data set Then use statistical methods to be introduced in Chapter 2 to measure the association between yields and fertilizer amounts As described earlier this may not seem like a very good experiment because we have said noth ing about choosing plots of land that are identical in all respects except for the amount of fertilizer In fact choosing plots of land with this feature is not feasible some of the factors such as land quality cannot even be fully observed How do we know the results of this experiment can be used to measure the ceteris paribus effect of fertilizer The answer depends on the specifics of how fertilizer amounts are chosen If the levels of fertilizer are assigned to plots independently of other plot features that affect yieldthat is other characteristics of plots are completely ignored when deciding on fertilizer amountsthen we are in business We will justify this statement in Chapter 2 The next example is more representative of the difficulties that arise when inferring causality in applied economics ExamplE 14 measuring the Return to Education Labor economists and policy makers have long been interested in the return to education Somewhat informally the question is posed as follows If a person is chosen from the population and given an other year of education by how much will his or her wage increase As with the previous examples this is a ceteris paribus question which implies that all other factors are held fixed while another year of education is given to the person We can imagine a social planner designing an experiment to get at this issue much as the agri cultural researcher can design an experiment to estimate fertilizer effects Assume for the moment Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 12 that the social planner has the ability to assign any level of education to any person How would this planner emulate the fertilizer experiment in Example 13 The planner would choose a group of people and randomly assign each person an amount of education some people are given an eighth grade education some are given a high school education some are given two years of college and so on Subsequently the planner measures wages for this group of people where we assume that each person then works in a job The people here are like the plots in the fertilizer example where educa tion plays the role of fertilizer and wage rate plays the role of soybean yield As with Example 13 if levels of education are assigned independently of other characteristics that affect productivity such as experience and innate ability then an analysis that ignores these other factors will yield useful results Again it will take some effort in Chapter 2 to justify this claim for now we state it without support Unlike the fertilizeryield example the experiment described in Example 14 is unfeasible The ethi cal issues not to mention the economic costs associated with randomly determining education levels for a group of individuals are obvious As a logistical matter we could not give someone only an eighthgrade education if he or she already has a college degree Even though experimental data cannot be obtained for measuring the return to education we can certainly collect nonexperimental data on education levels and wages for a large group by sampling randomly from the population of working people Such data are available from a variety of surveys used in labor economics but these data sets have a feature that makes it difficult to estimate the ceteris paribus return to education People choose their own levels of education therefore education levels are probably not determined independently of all other factors affecting wage This problem is a feature shared by most nonexperimental data sets One factor that affects wage is experience in the workforce Since pursuing more educa tion generally requires postponing entering the workforce those with more education usually have less experience Thus in a nonexperimental data set on wages and education education is likely to be negatively associated with a key variable that also affects wage It is also believed that people with more innate ability often choose higher levels of education Since higher ability leads to higher wages we again have a correlation between education and a critical factor that affects wage The omitted factors of experience and ability in the wage example have analogs in the fertilizer example Experience is generally easy to measure and therefore is similar to a variable such as rain fall Ability on the other hand is nebulous and difficult to quantify it is similar to land quality in the fertilizer example As we will see throughout this text accounting for other observed factors such as experience when estimating the ceteris paribus effect of another variable such as education is rela tively straightforward We will also find that accounting for inherently unobservable factors such as ability is much more problematic It is fair to say that many of the advances in econometric methods have tried to deal with unobserved factors in econometric models One final parallel can be drawn between Examples 13 and 14 Suppose that in the fertilizer example the fertilizer amounts were not entirely determined at random Instead the assistant who chose the fertilizer levels thought it would be better to put more fertilizer on the higherquality plots of land Agricultural researchers should have a rough idea about which plots of land are of bet ter quality even though they may not be able to fully quantify the differences This situation is completely analogous to the level of schooling being related to unobserved ability in Example 14 Because better land leads to higher yields and more fertilizer was used on the better plots any observed relationship between yield and fertilizer might be spurious Difficulty in inferring causality can also arise when studying data at fairly high levels of aggregation as the next example on city crime rates shows Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 13 ExamplE 15 The Effect of law Enforcement on City Crime levels The issue of how best to prevent crime has been and will probably continue to be with us for some time One especially important question in this regard is Does the presence of more police officers on the street deter crime The ceteris paribus question is easy to state If a city is randomly chosen and given say ten additional police officers by how much would its crime rates fall Another way to state the question is If two cities are the same in all respects except that city A has ten more police officers than city B by how much would the two cities crime rates differ It would be virtually impossible to find pairs of communities identical in all respects except for the size of their police force Fortunately econometric analysis does not require this What we do need to know is whether the data we can collect on community crime levels and the size of the police force can be viewed as experimental We can certainly imagine a true experiment involving a large collec tion of cities where we dictate how many police officers each city will use for the upcoming year Although policies can be used to affect the size of police forces we clearly cannot tell each city how many police officers it can hire If as is likely a citys decision on how many police officers to hire is correlated with other city factors that affect crime then the data must be viewed as nonexperimental In fact one way to view this problem is to see that a citys choice of police force size and the amount of crime are simultaneously determined We will explicitly address such problems in Chapter 16 The first three examples we have discussed have dealt with crosssectional data at various levels of aggregation for example at the individual or city levels The same hurdles arise when inferring causality in time series problems ExamplE 16 The Effect of the minimum Wage on Unemployment An important and perhaps contentious policy issue concerns the effect of the minimum wage on unemployment rates for various groups of workers Although this problem can be studied in a variety of data settings crosssectional time series or panel data time series data are often used to look at aggregate effects An example of a time series data set on unemployment rates and minimum wages was given in Table 13 Standard supply and demand analysis implies that as the minimum wage is increased above the market clearing wage we slide up the demand curve for labor and total employment decreases Labor supply exceeds labor demand To quantify this effect we can study the relationship between employment and the minimum wage over time In addition to some special difficulties that can arise in dealing with time series data there are possible problems with inferring causality The minimum wage in the United States is not determined in a vacuum Various economic and political forces impinge on the final minimum wage for any given year The minimum wage once determined is usually in place for several years unless it is indexed for inflation Thus it is probable that the amount of the minimum wage is related to other factors that have an effect on employment levels We can imagine the US government conducting an experiment to determine the employment effects of the minimum wage as opposed to worrying about the welfare of lowwage workers The minimum wage could be randomly set by the government each year and then the employment out comes could be tabulated The resulting experimental time series data could then be analyzed using fairly simple econometric methods But this scenario hardly describes how minimum wages are set If we can control enough other factors relating to employment then we can still hope to estimate the ceteris paribus effect of the minimum wage on employment In this sense the problem is very similar to the previous crosssectional examples Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 14 CHAPTER 1 The Nature of Econometrics and Economic Data Even when economic theories are not most naturally described in terms of causality they often have predictions that can be tested using econometric methods The following example demonstrates this approach ExamplE 17 The Expectations Hypothesis The expectations hypothesis from financial economics states that given all information available to investors at the time of investing the expected return on any two investments is the same For exam ple consider two possible investments with a threemonth investment horizon purchased at the same time 1 Buy a threemonth Tbill with a face value of 10000 for a price below 10000 in three months you receive 10000 2 Buy a sixmonth Tbill at a price below 10000 and in three months sell it as a threemonth Tbill Each investment requires roughly the same amount of initial capital but there is an important difference For the first investment you know exactly what the return is at the time of purchase because you know the initial price of the threemonth Tbill along with its face value This is not true for the second investment although you know the price of a sixmonth Tbill when you purchase it you do not know the price you can sell it for in three months Therefore there is uncertainty in this investment for someone who has a threemonth investment horizon The actual returns on these two investments will usually be different According to the expecta tions hypothesis the expected return from the second investment given all information at the time of investment should equal the return from purchasing a threemonth Tbill This theory turns out to be fairly easy to test as we will see in Chapter 11 Summary In this introductory chapter we have discussed the purpose and scope of econometric analysis Econometrics is used in all applied economics fields to test economic theories to inform government and private policy makers and to predict economic time series Sometimes an econometric model is derived from a formal economic model but in other cases econometric models are based on informal economic reasoning and intuition The goals of any econometric analysis are to estimate the parameters in the model and to test hypotheses about these parameters the values and signs of the parameters determine the validity of an economic theory and the effects of certain policies Crosssectional time series pooled crosssectional and panel data are the most common types of data structures that are used in applied econometrics Data sets involving a time dimension such as time series and panel data require special treatment because of the correlation across time of most economic time series Other issues such as trends and seasonality arise in the analysis of time series data but not cross sectional data In Section 14 we discussed the notions of ceteris paribus and causal inference In most cases hypoth eses in the social sciences are ceteris paribus in nature all other relevant factors must be fixed when study ing the relationship between two variables Because of the nonexperimental nature of most data collected in the social sciences uncovering causal relationships is very challenging Key Terms Causal Effect Ceteris Paribus CrossSectional Data Set Data Frequency Econometric Model Economic Model Empirical Analysis Experimental Data Nonexperimental Data Observational Data Panel Data Pooled Cross Section Random Sampling Retrospective Data Time Series Data Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 15 Problems 1 Suppose that you are asked to conduct a study to determine whether smaller class sizes lead to improved student performance of fourth graders i If you could conduct any experiment you want what would you do Be specific ii More realistically suppose you can collect observational data on several thousand fourth grad ers in a given state You can obtain the size of their fourthgrade class and a standardized test score taken at the end of fourth grade Why might you expect a negative correlation between class size and test score iii Would a negative correlation necessarily show that smaller class sizes cause better performance Explain 2 A justification for job training programs is that they improve worker productivity Suppose that you are asked to evaluate whether more job training makes workers more productive However rather than having data on individual workers you have access to data on manufacturing firms in Ohio In particu lar for each firm you have information on hours of job training per worker training and number of nondefective items produced per worker hour output i Carefully state the ceteris paribus thought experiment underlying this policy question ii Does it seem likely that a firms decision to train its workers will be independent of worker characteristics What are some of those measurable and unmeasurable worker characteristics iii Name a factor other than worker characteristics that can affect worker productivity iv If you find a positive correlation between output and training would you have convincingly established that job training makes workers more productive Explain 3 Suppose at your university you are asked to find the relationship between weekly hours spent study ing study and weekly hours spent working work Does it make sense to characterize the problem as inferring whether study causes work or work causes study Explain 4 States and provinces that have control over taxation sometimes reduce taxes in an attempt to spur economic growth Suppose that you are hired by a state to estimate the effect of corporate tax rates on say the growth in per capita gross state product GSP i What kind of data would you need to collect to undertake a statistical analysis ii Is it feasible to do a controlled experiment What would be required iii Is a correlation analysis between GSP growth and tax rates likely to be convincing Explain Computer Exercises C1 Use the data in WAGE1 for this exercise i Find the average education level in the sample What are the lowest and highest years of education ii Find the average hourly wage in the sample Does it seem high or low iii The wage data are reported in 1976 dollars Using the Internet or a printed source find the Consumer Price Index CPI for the years 1976 and 2013 iv Use the CPI values from part iii to find the average hourly wage in 2013 dollars Now does the average hourly wage seem reasonable v How many women are in the sample How many men C2 Use the data in BWGHT to answer this question i How many women are in the sample and how many report smoking during pregnancy ii What is the average number of cigarettes smoked per day Is the average a good measure of the typical woman in this case Explain iii Among women who smoked during pregnancy what is the average number of cigarettes smoked per day How does this compare with your answer from part ii and why Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 16 CHAPTER 1 The Nature of Econometrics and Economic Data iv Find the average of fatheduc in the sample Why are only 1192 observations used to compute this average v Report the average family income and its standard deviation in dollars C3 The data in MEAP01 are for the state of Michigan in the year 2001 Use these data to answer the fol lowing questions i Find the largest and smallest values of math4 Does the range make sense Explain ii How many schools have a perfect pass rate on the math test What percentage is this of the total sample iii How many schools have math pass rates of exactly 50 iv Compare the average pass rates for the math and reading scores Which test is harder to pass v Find the correlation between math4 and read4 What do you conclude vi The variable exppp is expenditure per pupil Find the average of exppp along with its standard deviation Would you say there is wide variation in per pupil spending vii Suppose School A spends 6000 per student and School B spends 5500 per student By what percentage does School As spending exceed School Bs Compare this to 100 log6000 log5500 which is the approximation percentage difference based on the difference in the natural logs See Section A4 in Appendix A C4 The data in JTRAIN2 come from a job training experiment conducted for lowincome men during 19761977 see Lalonde 1986 i Use the indicator variable train to determine the fraction of men receiving job training ii The variable re78 is earnings from 1978 measured in thousands of 1982 dollars Find the averages of re78 for the sample of men receiving job training and the sample not receiving job training Is the difference economically large iii The variable unem78 is an indicator of whether a man is unemployed or not in 1978 What fraction of the men who received job training are unemployed What about for men who did not receive job training Comment on the difference iv From parts ii and iii does it appear that the job training program was effective What would make our conclusions more convincing C5 The data in FERTIL2 were collected on women living in the Republic of Botswana in 1988 The vari able children refers to the number of living children The variable electric is a binary indicator equal to one if the womans home has electricity and zero if not i Find the smallest and largest values of children in the sample What is the average of children ii What percentage of women have electricity in the home iii Compute the average of children for those without electricity and do the same for those with electricity Comment on what you find iv From part iii can you infer that having electricity causes women to have fewer children Explain C6 Use the data in COUNTYMURDERS to answer this question Use only the year 1996 The variable murders is the number of murders reported in the county The variable execs is the number of execu tions that took place of people sentenced to death in the given county Most states in the United States have the death penalty but several do not i How many counties are there in the data set Of these how many have zero murders What percentage of counties have zero executions Remember use only the 1996 data ii What is the largest number of murders What is the largest number of executions Why is the average number of executions so small iii Compute the correlation coefficient between murders and execs and describe what you find iv You should have computed a positive correlation in part iii Do you think that more executions cause more murders to occur What might explain the positive correlation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 17 C7 The data set in ALCOHOL contains information on a sample of men in the United States Two key variables are selfreported employment status and alcohol abuse along with many other variables The variables employ and abuse are both binary or indicator variables they take on only the values zero and one i What is percentage of the men in the sample report abusing alcohol What is the employment rate ii Consider the group of men who abuse alcohol What is the employment rate iii What is the employment rate for the group of men who do not abuse alcohol iv Discuss the difference in your answers to parts ii and iii Does this allow you to conclude that alcohol abuse causes unemployment Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 19 P art 1 of the text covers regression analysis with crosssectional data It builds upon a solid base of college algebra and basic concepts in probability and statistics Appendices A B and C contain complete reviews of these topics Chapter 2 begins with the simple linear regression model where we explain one variable in terms of another variable Although simple regression is not widely used in applied econometrics it is used occasionally and serves as a natural starting point because the algebra and interpretations are relatively straightforward Chapters 3 and 4 cover the fundamentals of multiple regression analysis where we allow more than one variable to affect the variable we are trying to explain Multiple regression is still the most commonly used method in empirical research and so these chapters deserve careful attention Chapter 3 focuses on the algebra of the method of ordinary least squares OLS while also estab lishing conditions under which the OLS estimator is unbiased and best linear unbiased Chapter 4 covers the important topic of statistical inference Chapter 5 discusses the large sample or asymptotic properties of the OLS estimators This provides justification of the inference procedures in Chapter 4 when the errors in a regression model are not normally distributed Chapter 6 covers some additional topics in regression analysis including advanced functional form issues data scaling prediction and goodnessoffit Chapter 7 explains how qualitative information can be incorporated into multiple regression models Chapter 8 illustrates how to test for and correct the problem of heteroskedasticity or noncon stant variance in the error terms We show how the usual OLS statistics can be adjusted and we also present an extension of OLS known as weighted least squares which explicitly accounts for different variances in the errors Chapter 9 delves further into the very important problem of correla tion between the error term and one or more of the explanatory variables We demonstrate how the availability of a proxy variable can solve the omitted variables problem In addition we establish the bias and inconsistency in the OLS estimators in the presence of certain kinds of measurement errors in the variables Various data problems are also discussed including the problem of outliers Regression Analysis with CrossSectional Data Part 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 20 c h a p t e r 2 The Simple Regression Model T he simple regression model can be used to study the relationship between two variables For reasons we will see the simple regression model has limitations as a general tool for empirical analysis Nevertheless it is sometimes appropriate as an empirical tool Learning how to inter pret the simple regression model is good practice for studying multiple regression which we will do in subsequent chapters 21 Definition of the Simple Regression Model Much of applied econometric analysis begins with the following premise y and x are two variables representing some population and we are interested in explaining y in terms of x or in studying how y varies with changes in x We discussed some examples in Chapter 1 including y is soybean crop yield and x is amount of fertilizer y is hourly wage and x is years of education and y is a com munity crime rate and x is number of police officers In writing down a model that will explain y in terms of x we must confront three issues First since there is never an exact relationship between two variables how do we allow for other factors to affect y Second what is the functional relationship between y and x And third how can we be sure we are capturing a ceteris paribus relationship between y and x if that is a desired goal We can resolve these ambiguities by writing down an equation relating y to x A simple equation is y 5 b0 1 b1x 1 u 21 Equation 21 which is assumed to hold in the population of interest defines the simple linear regression model It is also called the twovariable linear regression model or bivariate linear Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 21 regression model because it relates the two variables x and y We now discuss the meaning of each of the quantities in equation 21 Incidentally the term regression has origins that are not espe cially important for most modern econometric applications so we will not explain it here See Stigler 1986 for an engaging history of regression analysis When related by equation 21 the variables y and x have several different names used inter changeably as follows y is called the dependent variable the explained variable the response variable the predicted variable or the regressand x is called the independent variable the explanatory variable the control variable the predictor variable or the regressor The term covariate is also used for x The terms dependent variable and independent variable are fre quently used in econometrics But be aware that the label independent here does not refer to the statistical notion of independence between random variables see Appendix B The terms explained and explanatory variables are probably the most descriptive Response and control are used mostly in the experimental sciences where the variable x is under the experi menters control We will not use the terms predicted variable and predictor although you some times see these in applications that are purely about prediction and not causality Our terminology for simple regression is summarized in Table 21 The variable u called the error term or disturbance in the relationship represents factors other than x that affect y A simple regression analysis effectively treats all factors affecting y other than x as being unobserved You can usefully think of u as standing for unobserved Equation 21 also addresses the issue of the functional relationship between y and x If the other factors in u are held fixed so that the change in u is zero Du 5 0 then x has a linear effect on y Dy 5 b1Dx if Du 5 0 22 Thus the change in y is simply b1 multiplied by the change in x This means that b1 is the slope parameter in the relationship between y and x holding the other factors in u fixed it is of primary interest in applied economics The intercept parameter b0 sometimes called the constant term also has its uses although it is rarely central to an analysis ExamplE 21 Soybean Yield and Fertilizer Suppose that soybean yield is determined by the model yield 5 b0 1 b1 fertilizer 1 u 23 so that y 5 yield and x 5 fertilizer The agricultural researcher is interested in the effect of fertilizer on yield holding other factors fixed This effect is given by b1 The error term u contains factors such as land quality rainfall and so on The coefficient b1 measures the effect of fertilizer on yield hold ing other factors fixed Dyield 5 b1D fertilizer TAblE 21 Terminology for Simple Regression Y X Dependent variable Independent variable Explained variable Explanatory variable Response variable Control variable Predicted variable Predictor variable Regressand Regressor Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 22 ExamplE 22 a Simple Wage Equation A model relating a persons wage to observed education and other unobserved factors is wage 5 b0 1 b1educ 1 u 24 If wage is measured in dollars per hour and educ is years of education then b1 measures the change in hourly wage given another year of education holding all other factors fixed Some of those factors include labor force experience innate ability tenure with current employer work ethic and numerous other things The linearity of equation 21 implies that a oneunit change in x has the same effect on y regardless of the initial value of x This is unrealistic for many economic applications For example in the wageeducation example we might want to allow for increasing returns the next year of educa tion has a larger effect on wages than did the previous year We will see how to allow for such pos sibilities in Section 24 The most difficult issue to address is whether model 21 really allows us to draw ceteris paribus conclusions about how x affects y We just saw in equation 22 that b1 does measure the effect of x on y holding all other factors in u fixed Is this the end of the causality issue Unfortunately no How can we hope to learn in general about the ceteris paribus effect of x on y holding other factors fixed when we are ignoring all those other factors Section 25 will show that we are only able to get reliable estimators of b0 and b1 from a random sample of data when we make an assumption restricting how the unobservable u is related to the explanatory variable x Without such a restriction we will not be able to estimate the ceteris paribus effect b1 Because u and x are random variables we need a concept grounded in probability Before we state the key assumption about how x and u are related we can always make one assumption about u As long as the intercept b0 is included in the equation nothing is lost by assum ing that the average value of u in the population is zero Mathematically E1u2 5 0 25 Assumption 25 says nothing about the relationship between u and x but simply makes a state ment about the distribution of the unobserved factors in the population Using the previous exam ples for illustration we can see that assumption 25 is not very restrictive In Example 21 we lose nothing by normalizing the unobserved factors affecting soybean yield such as land quality to have an average of zero in the population of all cultivated plots The same is true of the unobserved factors in Example 22 Without loss of generality we can assume that things such as average ability are zero in the population of all working people If you are not convinced you should work through Problem 2 to see that we can always redefine the intercept in equation 21 to make equa tion 25 true We now turn to the crucial assumption regarding how u and x are related A natural measure of the association between two random variables is the correlation coefficient See Appendix B for definition and properties If u and x are uncorrelated then as random variables they are not linearly related Assuming that u and x are uncorrelated goes a long way toward defining the sense in which u and x should be unrelated in equation 21 But it does not go far enough because correlation meas ures only linear dependence between u and x Correlation has a somewhat counterintuitive feature it is possible for u to be uncorrelated with x while being correlated with functions of x such as x2 See Section B4 for further discussion This possibility is not acceptable for most regression purposes as it causes problems for interpreting the model and for deriving statistical properties A better assump tion involves the expected value of u given x Because u and x are random variables we can define the conditional distribution of u given any value of x In particular for any x we can obtain the expected or average value of u for that slice of Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 23 the population described by the value of x The crucial assumption is that the average value of u does not depend on the value of x We can write this assumption as E1u0x2 5 E1u2 26 Equation 26 says that the average value of the unobservables is the same across all slices of the population determined by the value of x and that the common average is necessarily equal to the average of u over the entire population When assumption 26 holds we say that u is mean inde pendent of x Of course mean independence is implied by full independence between u and x an assumption often used in basic probability and statistics When we combine mean independence with assumption 25 we obtain the zero conditional mean assumption E1u0x2 5 0 It is critical to remember that equation 26 is the assumption with impact assumption 25 essentially defines the intercept b0 Let us see what equation 26 entails in the wage example To simplify the discussion assume that u is the same as innate ability Then equation 26 requires that the average level of ability is the same regardless of years of education For example if E1abil082 denotes the average ability for the group of all people with eight years of education and E1abil0162 denotes the average ability among people in the population with sixteen years of education then equation 26 implies that these must be the same In fact the average ability level must be the same for all education levels If for exam ple we think that average ability increases with years of education then equation 26 is false This would happen if on average people with more ability choose to become more educated As we can not observe innate ability we have no way of know ing whether or not average ability is the same for all education levels But this is an issue that we must address before relying on simple regression analysis In the fertilizer example if fertilizer amounts are chosen independently of other features of the plots then equation 26 will hold the average land quality will not depend on the amount of fertilizer However if more fertilizer is put on the higherquality plots of land then the expected value of u changes with the level of fertilizer and equation 26 fails The zero conditional mean assumption gives b1 another interpretation that is often useful Taking the expected value of equation 21 conditional on x and using E1u0x2 5 0 gives E1y0x2 5 b0 1 b1x 28 Equation 28 shows that the population regression function PRF E1y0x2 is a linear function of x The linearity means that a oneunit increase in x changes the expected value of y by the amount b1 For any given value of x the distribution of y is centered about E1y0x2 as illustrated in Figure 21 It is important to understand that equation 28 tells us how the average value of y changes with x it does not say that y equals b0 1 b1x for all units in the population For example suppose that x is the high school grade point average and y is the college GPA and we happen to know that E1colGPA0hsGPA2 5 15 1 05 hsGPA Of course in practice we never know the population intercept and slope but it is useful to pretend momentarily that we do to understand the nature of equation 28 This GPA equation tells us the average college GPA among all students who have a given high school GPA So suppose that hsGPA 5 36 Then the average colGPA for all high school graduates who attend college with hsGPA 5 36 is 15 1 051362 5 33 We are certainly not say ing that every student with hsGPA 5 36 will have a 33 college GPA this is clearly false The PRF gives us a relationship between the average level of y at different levels of x Some students with hsGPA 5 36 will have a college GPA higher than 33 and some will have a lower college GPA Whether the actual colGPA is above or below 33 depends on the unobserved factors in u and those differ among students even within the slice of the population with hsGPA 5 36 Suppose that a score on a final exam score depends on classes attended attend and unobserved factors that affect exam perfor mance such as student ability Then score 5 b0 1 b1attend 1 u 27 When would you expect this model to satisfy equation 26 Exploring FurthEr 21 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 24 Given the zero conditional mean assumption E1u0x2 5 0 it is useful to view equation 21 as breaking y into two components The piece b0 1 b1x which represents E1y0x2 is called the system atic part of ythat is the part of y explained by xand u is called the unsystematic part or the part of y not explained by x In Chapter 3 when we introduce more than one explanatory variable we will discuss how to determine how large the systematic part is relative to the unsystematic part In the next section we will use assumptions 25 and 26 to motivate estimators of b0 and b1 given a random sample of data The zero conditional mean assumption also plays a crucial role in the statistical analysis in Section 25 22 Deriving the Ordinary Least Squares Estimates Now that we have discussed the basic ingredients of the simple regression model we will address the important issue of how to estimate the parameters b0 and b1 in equation 21 To do this we need a sample from the population Let 1xiyi2 i 5 1 c n denote a random sample of size n from the population Because these data come from equation 21 we can write yi 5 b0 1 b1xi 1 ui 29 for each i Here ui is the error term for observation i because it contains all factors affecting yi other than xi As an example xi might be the annual income and yi the annual savings for family i during a par ticular year If we have collected data on 15 families then n 5 15 A scatterplot of such a data set is given in Figure 22 along with the necessarily fictitious population regression function We must decide how to use these data to obtain estimates of the intercept and slope in the popula tion regression of savings on income y x1 Eyx 5 0 1 1x x2 x3 FiguRE 21 E1y0x2 as a linear function of x Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 25 There are several ways to motivate the following estimation procedure We will use equa tion 25 and an important implication of assumption 26 in the population u is uncorrelated with x Therefore we see that u has zero expected value and that the covariance between x and u is zero E1u2 5 0 210 and Cov1xu2 5 E1xu2 5 0 211 where the first equality in equation 211 follows from 210 See Section B4 for the definition and properties of covariance In terms of the observable variables x and y and the unknown param eters b0 and b1 equations 210 and 211 can be written as E1y 2 b0 2 b1x2 5 0 212 and E3x1y 2 b0 2 b1x2 4 5 0 213 respectively Equations 212 and 213 imply two restrictions on the joint probability distribution of xy in the population Since there are two unknown parameters to estimate we might hope that equations 212 and 213 can be used to obtain good estimators of b0 and b1 In fact they can be Given a sample of data we choose estimates b 0 and b 1 to solve the sample counterparts of equations 212 and 213 n21 a n i51 1yi 2 b 0 2 b 1xi2 5 0 214 and n21 a n i51xi1yi 2 b 0 2 b 1xi2 5 0 215 This is an example of the method of moments approach to estimation See Section C4 for a discus sion of different estimation approaches These equations can be solved for b 0 and b 1 Esavingsincome 5 0 1 1income savings 0 income 0 FiguRE 22 Scatterplot of savings and income for 15 families and the population regression E1savings0income2 5 b0 1 b1 income Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 26 Using the basic properties of the summation operator from Appendix A equation 214 can be rewritten as y 5 b 0 1 b 1x 216 where y 5 n21g n i51 yi is the sample average of the yi and likewise for x This equation allows us to write b 0 in terms of b 1 y and x b 0 5 y 2 b 1x 217 Therefore once we have the slope estimate b 1 it is straightforward to obtain the intercept estimate b 0 given y and x Dropping the n21 in 215 since it does not affect the solution and plugging 217 into 215 yields a n i51xi3yi 2 1y 2 b 1x2 2 b 1xi4 5 0 which upon rearrangement gives a n i51xi1yi 2 y2 5 b 1 a n i51xi1xi 2 x2 From basic properties of the summation operator see A7 and A8 a n i51xi1xi 2 x2 5 a n i51 1xi 2 x2 2 and a n i51xi1yi 2 y2 5 a n i51 1xi 2 x2 1yi 2 y2 Therefore provided that a n i51 1xi 2 x2 2 0 218 the estimated slope is b 1 5 a n i51 1xi 2 x2 1yi 2 y2 a n i51 1xi 2 x2 2 219 Equation 219 is simply the sample covariance between xi and yi divided by the sample variance of xi Using simple algebra we can also write b 1 as b 1 5 r xy as x s y b where r xy is the sample correlation between xi and yi and s x s y denote the sample standard devia tions See Appendix C for definitions of correlation and standard deviation Dividing all sums by n21 does not affect the formulas An immediate implication is that if xi and yi are positively corre lated in the sample then b 1 0 if xi and yi are negatively correlated then b 1 0 Not surprisingly the formula for b 1 in terms of the sample correlation and sample standard devia tions is the sample analog of the population relationship b1 5 rxy asx syb where all quantities are defined for the entire population Recognition that b1 is just a scaled version rxy highlights an important limitation of simple regression when we do not have experimental data In effect simple regression is an analysis of correlation between two variables and so one must be careful in inferring causality Although the method for obtaining 217 and 219 is motivated by 26 the only assumption needed to compute the estimates for a particular sample is 218 This is hardly an assumption at all 218 is true provided the xi in the sample are not all equal to the same value If 218 fails then Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 27 we have either been unlucky in obtaining our sample from the population or we have not specified an interesting problem x does not vary in the population For example if y 5 wage and x 5 educ then 218 fails only if everyone in the sample has the same amount of education for example if everyone is a high school graduate see Figure 23 If just one person has a different amount of education then 218 holds and the estimates can be computed The estimates given in 217 and 219 are called the ordinary least squares OLS estimates of b0 and b1 To justify this name for any b 0 and b 1 define a fitted value for y when x 5 xi as yi 5 b 0 1 b 1xi 220 This is the value we predict for y when x 5 xi for the given intercept and slope There is a fitted value for each observation in the sample The residual for observation i is the difference between the actual yi and its fitted value u i 5 yi 2 yi 5 yi 2 b 0 1 b 1xi 221 Again there are n such residuals These are not the same as the errors in 29 a point we return to in Section 25 The fitted values and residuals are indicated in Figure 24 Now suppose we choose b 0 and b 1 to make the sum of squared residuals a n i51u 2 i 5 a n i51 1yi 2 b 0 2 b 1xi2 2 222 as small as possible The appendix to this chapter shows that the conditions necessary for 1b 0b 12 to minimize 222 are given exactly by equations 214 and 215 without n21 Equations 214 and 215 are often called the first order conditions for the OLS estimates a term that comes from opti mization using calculus see Appendix A From our previous calculations we know that the solutions to the OLS first order conditions are given by 217 and 219 The name ordinary least squares comes from the fact that these estimates minimize the sum of squared residuals FiguRE 23 A scatterplot of wage against education when educi 5 12 for all i wage 12 educ 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 28 When we view ordinary least squares as minimizing the sum of squared residuals it is natural to ask Why not minimize some other function of the residuals such as the absolute values of the residu als In fact as we will discuss in the more advanced Section 94 minimizing the sum of the absolute values of the residuals is sometimes very useful But it does have some drawbacks First we can not obtain formulas for the resulting estimators given a data set the estimates must be obtained by numerical optimization routines As a consequence the statistical theory for estimators that minimize the sum of the absolute residuals is very complicated Minimizing other functions of the residuals say the sum of the residuals each raised to the fourth power has similar drawbacks We would never choose our estimates to minimize say the sum of the residuals themselves as residuals large in mag nitude but with opposite signs would tend to cancel out With OLS we will be able to derive unbias edness consistency and other important statistical properties relatively easily Plus as the motivation in equations 212 and 213 suggests and as we will see in Section 25 OLS is suited for estimating the parameters appearing in the conditional mean function 28 Once we have determined the OLS intercept and slope estimates we form the OLS regression line y 5 b 0 1 b 1x 223 where it is understood that b 0 and b 1 have been obtained using equations 217 and 219 The notation y read as y hat emphasizes that the predicted values from equation 223 are estimates The intercept b 0 is the predicted value of y when x 5 0 although in some cases it will not make sense to set x 5 0 In those situations b 0 is not in itself very interesting When using 223 to com pute predicted values of y for various values of x we must account for the intercept in the calcula tions Equation 223 is also called the sample regression function SRF because it is the estimated version of the population regression function E1y0x2 5 b0 1 b1x It is important to remember that the PRF is something fixed but unknown in the population Because the SRF is obtained for a given sample of data a new sample will generate a different slope and intercept in equation 223 In most cases the slope estimate which we can write as b 1 5 DyDx 224 y 5 0 1 1x y ˆ ˆ ˆ x1 xi x yi yi 5 fitted value y1 ûi 5 residual ˆ y1ˆ FiguRe 24 Fitted values and residuals Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 29 is of primary interest It tells us the amount by which y changes when x increases by one unit Equivalently Dy 5 b 1Dx 225 so that given any change in x whether positive or negative we can compute the predicted change in y We now present several examples of simple regression obtained by using real data In other words we find the intercept and slope estimates with equations 217 and 219 Since these exam ples involve many observations the calculations were done using an econometrics software package At this point you should be careful not to read too much into these regressions they are not neces sarily uncovering a causal relationship We have said nothing so far about the statistical properties of OLS In Section 25 we consider statistical properties after we explicitly impose assumptions on the population model equation 21 ExamplE 23 CEO Salary and Return on Equity For the population of chief executive officers let y be annual salary salary in thousands of dol lars Thus y 5 8563 indicates an annual salary of 856300 and y 5 14526 indicates a salary of 1452600 Let x be the average return on equity roe for the CEOs firm for the previous three years Return on equity is defined in terms of net income as a percentage of common equity For example if roe 5 10 then average return on equity is 10 To study the relationship between this measure of firm performance and CEO compensation we postulate the simple model salary 5 b0 1 b1roe 1 u The slope parameter b1 measures the change in annual salary in thousands of dollars when return on equity increases by one percentage point Because a higher roe is good for the company we think b1 0 The data set CEOSAL1 contains information on 209 CEOs for the year 1990 these data were obtained from Business Week 5691 In this sample the average annual salary is 1281120 with the smallest and largest being 223000 and 14822000 respectively The average return on equity for the years 1988 1989 and 1990 is 1718 with the smallest and largest values being 05 and 563 respectively Using the data in CEOSAL1 the OLS regression line relating salary to roe is salary 5 963191 1 18501 roe 226 n 5 209 where the intercept and slope estimates have been rounded to three decimal places we use salary hat to indicate that this is an estimated equation How do we interpret the equation First if the return on equity is zero roe 5 0 then the predicted salary is the intercept 963191 which equals 963191 since salary is measured in thousands Next we can write the predicted change in salary as a func tion of the change in roe Dsalary 5 18501 1Droe2 This means that if the return on equity increases by one percentage point Droe 5 1 then salary is predicted to change by about 185 or 18500 Because 226 is a linear equation this is the estimated change regardless of the initial salary We can easily use 226 to compare predicted salaries at different values of roe Suppose roe 5 30 Then salary 5 963191 1 185011302 5 1518221 which is just over 15 million How ever this does not mean that a particular CEO whose firm had a roe 5 30 earns 1518221 Many other factors affect salary This is just our prediction from the OLS regression line 226 The esti mated line is graphed in Figure 25 along with the population regression function E1salary0roe2 We will never know the PRF so we cannot tell how close the SRF is to the PRF Another sample of data will give a different regression line which may or may not be closer to the population regression line Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 30 ExamplE 24 Wage and Education For the population of people in the workforce in 1976 let y 5 wage where wage is measured in dol lars per hour Thus for a particular person if wage 5 675 the hourly wage is 675 Let x 5 educ denote years of schooling for example educ 5 12 corresponds to a complete high school education Since the average wage in the sample is 590 the Consumer Price Index indicates that this amount is equivalent to 1906 in 2003 dollars Using the data in WAGE1 where n 5 526 individuals we obtain the following OLS regression line or sample regression function wage 5 2090 1 054 educ 227 n 5 526 We must interpret this equation with caution The intercept of 090 literally means that a person with no education has a predicted hourly wage of 90 an hour This of course is silly It turns out that only 18 people in the sample of 526 have less than eight years of education Consequently it is not surprising that the regression line does poorly at very low levels of education For a person with eight years of education the predicted wage is wage 5 2090 1 054182 5 342 or 342 per hour in 1976 dollars The slope estimate in 227 implies that one more year of education increases hourly wage by 54 an hour Therefore four more years of educa tion increase the predicted wage by 410542 5 216 or 216 per hour These are fairly large effects salary 963191 salary 5 963191 1 18501 roe Esalaryroe 5 0 1 1roe roe FiguRE 25 The OLS regression line salary 5 963191 1 18501roe and the unknown population regression function The estimated wage from 227 when educ 5 8 is 342 in 1976 dollars What is this value in 2003 dollars Hint You have enough information in Example 24 to answer this question Exploring FurthEr 22 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 31 Because of the linear nature of 227 another year of education increases the wage by the same amount regardless of the initial level of education In Section 24 we discuss some methods that allow for nonconstant marginal effects of our explanatory variables ExamplE 25 Voting Outcomes and Campaign Expenditures The file VOTE1 contains data on election outcomes and campaign expenditures for 173 twoparty races for the US House of Representatives in 1988 There are two candidates in each race A and B Let voteA be the percentage of the vote received by Candidate A and shareA be the percentage of total campaign expenditures accounted for by Candidate A Many factors other than shareA affect the election outcome including the quality of the candidates and possibly the dollar amounts spent by A and B Nevertheless we can estimate a simple regression model to find out whether spending more relative to ones challenger implies a higher percentage of the vote The estimated equation using the 173 observations is voteA 5 2681 1 0464 shareA 228 n 5 173 This means that if Candidate As share of spending increases by one percentage point Candidate A receives almost onehalf a percentage point 0464 more of the total vote Whether or not this is a causal effect is unclear but it is not unbelievable If shareA 5 50 voteA is predicted to be about 50 or half the vote In some cases regression analysis is not used to determine causality but to simply look at whether two variables are positively or negatively related much like a standard correlation analysis An example of this occurs in Computer Exercise C3 where you are asked to use data from Biddle and Hamermesh 1990 on time spent sleeping and working to investi gate the tradeoff between these two factors 22a A Note on Terminology In most cases we will indicate the estimation of a relationship through OLS by writing an equation such as 226 227 or 228 Sometimes for the sake of brevity it is useful to indicate that an OLS regression has been run without actually writing out the equation We will often indicate that equation 223 has been obtained by OLS in saying that we run the regression of y on x 229 or simply that we regress y on x The positions of y and x in 229 indicate which is the depend ent variable and which is the independent variable we always regress the dependent variable on the independent variable For specific applications we replace y and x with their names Thus to obtain 226 we regress salary on roe or to obtain 228 we regress voteA on shareA When we use such terminology in 229 we will always mean that we plan to estimate the intercept b 0 along with the slope b 1 This case is appropriate for the vast majority of applications In Example 25 what is the predicted vote for Candidate A if shareA 5 60 which means 60 Does this answer seem reasonable Exploring FurthEr 23 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 32 Occasionally we may want to estimate the relationship between y and x assuming that the intercept is zero so that x 5 0 implies that y 5 0 we cover this case briefly in Section 26 Unless explicitly stated otherwise we always estimate an intercept along with a slope 23 Properties of OLS on Any Sample of Data In the previous section we went through the algebra of deriving the formulas for the OLS intercept and slope estimates In this section we cover some further algebraic properties of the fitted OLS regression line The best way to think about these properties is to remember that they hold by con struction for any sample of data The harder taskconsidering the properties of OLS across all pos sible random samples of datais postponed until Section 25 Several of the algebraic properties we are going to derive will appear mundane Nevertheless having a grasp of these properties helps us to figure out what happens to the OLS estimates and related statistics when the data are manipulated in certain ways such as when the measurement units of the dependent and independent variables change 23a Fitted Values and Residuals We assume that the intercept and slope estimates b 0 and b 1 have been obtained for the given sam ple of data Given b 0 and b 1 we can obtain the fitted value yi for each observation This is given by equation 220 By definition each fitted value of yi is on the OLS regression line The OLS residual associated with observation i u i is the difference between yi and its fitted value as given in equation 221 If u i is positive the line underpredicts yi if u i is negative the line overpredicts yi The ideal case for observation i is when u i 5 0 but in most cases every residual is not equal to zero In other words none of the data points must actually lie on the OLS line ExamplE 26 CEO Salary and Return on Equity Table 22 contains a listing of the first 15 observations in the CEO data set along with the fitted values called salaryhat and the residuals called uhat The first four CEOs have lower salaries than what we predicted from the OLS regression line 226 in other words given only the firms roe these CEOs make less than what we predicted As can be seen from the positive uhat the fifth CEO makes more than predicted from the OLS regression line 23b Algebraic Properties of OLS Statistics There are several useful algebraic properties of OLS estimates and their associated statistics We now cover the three most important of these 1 The sum and therefore the sample average of the OLS residuals is zero Mathematically a n i51u i 5 0 230 This property needs no proof it follows immediately from the OLS first order condition 214 when we remember that the residuals are defined by u i 5 yi 2 b 0 2 b 1xi In other words the OLS estimates Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 33 b 0 and b 1 are chosen to make the residuals add up to zero for any data set This says nothing about the residual for any particular observation i 2 The sample covariance between the regressors and the OLS residuals is zero This follows from the first order condition 215 which can be written in terms of the residuals as a n i51xiu i 5 0 231 The sample average of the OLS residuals is zero so the lefthand side of 231 is proportional to the sample covariance between xi and u i 3 The point 1xy2 is always on the OLS regression line In other words if we take equation 223 and plug in x for x then the predicted value is y This is exactly what equation 216 showed us ExamplE 27 Wage and Education For the data in WAGE1 the average hourly wage in the sample is 590 rounded to two decimal places and the average education is 1256 If we plug educ 5 1256 into the OLS regression line 227 we get wage 5 2090 1 054112562 5 58824 which equals 59 when rounded to the first decimal place These figures do not exactly agree because we have rounded the average wage and education as well as the intercept and slope estimates If we did not initially round any of the values we would get the answers to agree more closely but to little useful effect TAblE 22 Fitted Values and Residuals for the First 15 CEOs obsno roe salary salaryhat uhat 1 141 1095 1224058 1290581 2 109 1001 1164854 1638542 3 235 1122 1397969 2759692 4 59 578 1072348 4943484 5 138 1368 1218508 1494923 6 200 1145 1333215 1882151 7 164 1078 1266611 1886108 8 163 1094 1264761 1707606 9 105 1237 1157454 7954626 10 263 833 1449773 6167726 11 259 567 1442372 8753721 12 268 933 1459023 5260231 13 148 1339 1237009 1019911 14 223 937 1375768 4387678 15 563 2011 2004808 6191895 Writing each yi as its fitted value plus its residual provides another way to interpret an OLS regression For each i write yi 5 yi 1 u i 232 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 34 From property 1 the average of the residuals is zero equivalently the sample average of the fitted values yi is the same as the sample average of the yi or y 5 y Further properties 1 and 2 can be used to show that the sample covariance between yi and u i is zero Thus we can view OLS as decom posing each yi into two parts a fitted value and a residual The fitted values and residuals are uncor related in the sample Define the total sum of squares SST the explained sum of squares SSE and the residual sum of squares SSR also known as the sum of squared residuals as follows SST a n i51 1yi 2 y2 2 233 SSE a n i51 1yi 2 y2 2 234 SSR a n i51u 2 i 235 SST is a measure of the total sample variation in the yi that is it measures how spread out the yi are in the sample If we divide SST by n 2 1 we obtain the sample variance of y as discussed in Appendix C Similarly SSE measures the sample variation in the yi where we use the fact that y 5 y and SSR measures the sample variation in the u i The total variation in y can always be expressed as the sum of the explained variation and the unexplained variation SSR Thus SST 5 SSE 1 SSR 236 Proving 236 is not difficult but it requires us to use all of the properties of the summation operator covered in Appendix A Write a n i51 1yi 2 y2 2 5 a n i51 3 1yi 2 yi2 1 1yi 2 y2 42 5 a n i51 3u i 1 1yi 2 y2 42 5 a n i51u 2 i 1 2 a n i51u i1yi 2 y2 1 a n i51 1yi 2 y2 2 5 SSR 1 2 a n i51u i1yi 2 y2 1 SSE Now 236 holds if we show that a n i51u i1yi 2 y2 5 0 237 But we have already claimed that the sample covariance between the residuals and the fitted values is zero and this covariance is just 237 divided by n 2 1 Thus we have established 236 Some words of caution about SST SSE and SSR are in order There is no uniform agree ment on the names or abbreviations for the three quantities defined in equations 233 234 and 235 The total sum of squares is called either SST or TSS so there is little confusion here Unfortunately the explained sum of squares is sometimes called the regression sum of squares If this term is given its natural abbreviation it can easily be confused with the term residual sum of squares Some regression packages refer to the explained sum of squares as the model sum of squares To make matters even worse the residual sum of squares is often called the error sum of squares This is especially unfortunate because as we will see in Section 25 the errors and the residuals are different quantities Thus we will always call 235 the residual sum of squares or the sum of squared residuals We prefer to use the abbreviation SSR to denote the sum of squared residu als because it is more common in econometric packages Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 35 23c GoodnessofFit So far we have no way of measuring how well the explanatory or independent variable x explains the dependent variable y It is often useful to compute a number that summarizes how well the OLS regression line fits the data In the following discussion be sure to remember that we assume that an intercept is estimated along with the slope Assuming that the total sum of squares SST is not equal to zerowhich is true except in the very unlikely event that all the yi equal the same valuewe can divide 236 by SST to get 1 5 SSESST 1 SSRSST The Rsquared of the regression sometimes called the coefficient of determination is defined as R2 SSESST 5 1 2 SSRSST 238 R2 is the ratio of the explained variation compared to the total variation thus it is interpreted as the fraction of the sample variation in y that is explained by x The second equality in 238 provides another way for computing R2 From 236 the value of R2 is always between zero and one because SSE can be no greater than SST When interpreting R2 we usually multiply it by 100 to change it into a percent 100 R2 is the percentage of the sample variation in y that is explained by x If the data points all lie on the same line OLS provides a perfect fit to the data In this case R2 5 1 A value of R2 that is nearly equal to zero indicates a poor fit of the OLS line very little of the variation in the yi is captured by the variation in the yi which all lie on the OLS regression line In fact it can be shown that R2 is equal to the square of the sample correlation coefficient between yi and yi This is where the term Rsquared came from The letter R was traditionally used to denote an estimate of a population correlation coefficient and its usage has survived in regression analysis ExamplE 28 CEO Salary and Return on Equity In the CEO salary regression we obtain the following salary 5 963191 1 18501 roe 239 n 5 209 R2 5 00132 We have reproduced the OLS regression line and the number of observations for clarity Using the Rsquared rounded to four decimal places reported for this equation we can see how much of the variation in salary is actually explained by the return on equity The answer is not much The firms return on equity explains only about 13 of the variation in salaries for this sample of 209 CEOs That means that 987 of the salary variations for these CEOs is left unexplained This lack of explanatory power may not be too surprising because many other characteristics of both the firm and the individual CEO should influence salary these factors are necessarily included in the errors in a simple regression analysis In the social sciences low Rsquareds in regression equations are not uncommon especially for crosssectional analysis We will discuss this issue more generally under multiple regres sion analysis but it is worth emphasizing now that a seemingly low Rsquared does not neces sarily mean that an OLS regression equation is useless It is still possible that 239 is a good estimate of the ceteris paribus relationship between salary and roe whether or not this is true does not depend directly on the size of Rsquared Students who are first learning econometrics tend to put too much weight on the size of the Rsquared in evaluating regression equations For now be aware that using Rsquared as the main gauge of success for an econometric analysis can lead to trouble Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 36 Sometimes the explanatory variable explains a substantial part of the sample variation in the dependent variable ExamplE 29 Voting Outcomes and Campaign Expenditures In the voting outcome equation in 228 R2 5 0856 Thus the share of campaign expenditures explains over 85 of the variation in the election outcomes for this sample This is a sizable portion 24 Units of Measurement and Functional Form Two important issues in applied economics are 1 understanding how changing the units of measure ment of the dependent andor independent variables affects OLS estimates and 2 knowing how to incorporate popular functional forms used in economics into regression analysis The mathematics needed for a full understanding of functional form issues is reviewed in Appendix A 24a The Effects of Changing Units of Measurement on OLS Statistics In Example 23 we chose to measure annual salary in thousands of dollars and the return on equity was measured as a percentage rather than as a decimal It is crucial to know how salary and roe are measured in this example in order to make sense of the estimates in equation 239 We must also know that OLS estimates change in entirely expected ways when the units of measurement of the dependent and independent variables change In Example 23 suppose that rather than measuring salary in thousands of dollars we measure it in dollars Let salardol be salary in dollars salardol 5 845761 would be interpreted as 845761 Of course salardol has a simple relationship to the salary measured in thousands of dollars salardol 5 1000 salary We do not need to actually run the regression of salardol on roe to know that the estimated equation is salardol 5 963191 1 18501 roe 240 We obtain the intercept and slope in 240 simply by multiplying the intercept and the slope in 239 by 1000 This gives equations 239 and 240 the same interpretation Looking at 240 if roe 5 0 then salardol 5 963191 so the predicted salary is 963191 the same value we obtained from equa tion 239 Furthermore if roe increases by one then the predicted salary increases by 18501 again this is what we concluded from our earlier analysis of equation 239 Generally it is easy to figure out what happens to the intercept and slope estimates when the dependent variable changes units of measurement If the dependent variable is multiplied by the constant cwhich means each value in the sample is multiplied by cthen the OLS intercept and slope estimates are also multiplied by c This assumes nothing has changed about the independent variable In the CEO sal ary example c 5 1000 in moving from salary to salardol We can also use the CEO salary example to see what happens when we change the units of measurement of the independent variable Define roedec 5 roe100 to be the decimal equivalent of roe thus roedec 5 023 means a return on equity of 23 To focus on changing the units of measurement Suppose that salary is measured in hun dreds of dollars rather than in thousands of dollars say salarhun What will be the OLS intercept and slope estimates in the regres sion of salarhun on roe Exploring FurthEr 24 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 37 of the independent variable we return to our original dependent variable salary which is measured in thousands of dollars When we regress salary on roedec we obtain salary 5 963191 1 18501 roedec 241 The coefficient on roedec is 100 times the coefficient on roe in 239 This is as it should be Changing roe by one percentage point is equivalent to Droedec 5 001 From 241 if Droedec 5 001 then Dsalary 5 1850110012 5 18501 which is what is obtained by using 239 Note that in moving from 239 to 241 the independent variable was divided by 100 and so the OLS slope estimate was multiplied by 100 preserving the interpretation of the equation Generally if the independent variable is divided or multiplied by some nonzero constant c then the OLS slope coefficient is multiplied or divided by c respectively The intercept has not changed in 241 because roedec 5 0 still corresponds to a zero return on equity In general changing the units of measurement of only the independent variable does not affect the intercept In the previous section we defined Rsquared as a goodnessoffit measure for OLS regression We can also ask what happens to R2 when the unit of measurement of either the independent or the dependent variable changes Without doing any algebra we should know the result the goodness of fit of the model should not depend on the units of measurement of our variables For example the amount of variation in salary explained by the return on equity should not depend on whether salary is measured in dollars or in thousands of dollars or on whether return on equity is a percentage or a decimal This intuition can be verified mathematically using the definition of R2 it can be shown that R2 is in fact invariant to changes in the units of y or x 24b Incorporating Nonlinearities in Simple Regression So far we have focused on linear relationships between the dependent and independent variables As we mentioned in Chapter 1 linear relationships are not nearly general enough for all economic applications Fortunately it is rather easy to incorporate many nonlinearities into simple regression analysis by appropriately defining the dependent and independent variables Here we will cover two possibilities that often appear in applied work In reading applied work in the social sciences you will often encounter regression equations where the dependent variable appears in logarithmic form Why is this done Recall the wage education example where we regressed hourly wage on years of education We obtained a slope esti mate of 054 see equation 227 which means that each additional year of education is predicted to increase hourly wage by 54 cents Because of the linear nature of 227 54 cents is the increase for either the first year of education or the twentieth year this may not be reasonable Probably a better characterization of how wage changes with education is that each year of edu cation increases wage by a constant percentage For example an increase in education from 5 years to 6 years increases wage by say 8 ceteris paribus and an increase in education from 11 to 12 years also increases wage by 8 A model that gives approximately a constant percentage effect is log1wage2 5 b0 1 b1educ 1 u 242 where log12 denotes the natural logarithm See Appendix A for a review of logarithms In particu lar if Du 5 0 then Dwage 1100 b12Deduc 243 Notice how we multiply b1 by 100 to get the percentage change in wage given one additional year of education Since the percentage change in wage is the same for each additional year of edu cation the change in wage for an extra year of education increases as education increases in other words 242 implies an increasing return to education By exponentiating 242 we can write wage 5 exp1b0 1 b1educ 1 u2 This equation is graphed in Figure 26 with u 5 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 38 ExamplE 210 a log Wage Equation Using the same data as in Example 24 but using logwage as the dependent variable we obtain the following relationship log1wage2 5 0584 1 0083 educ 244 n 5 526 R2 5 0186 The coefficient on educ has a percentage interpretation when it is multiplied by 100 wage increases by 83 for every additional year of education This is what economists mean when they refer to the return to another year of education It is important to remember that the main reason for using the log of wage in 242 is to impose a constant percentage effect of education on wage Once equation 244 is obtained the natural log of wage is rarely mentioned In particular it is not correct to say that another year of education increases logwage by 83 The intercept in 244 is not very meaningful because it gives the predicted logwage when educ 5 0 The Rsquared shows that educ explains about 186 of the variation in logwage not wage Finally equation 244 might not capture all of the nonlinearity in the relationship between wage and schooling If there are diploma effects then the twelfth year of educationgraduation from high schoolcould be worth much more than the eleventh year We will learn how to allow for this kind of nonlinearity in Chapter 7 wage educ 0 FiguRE 26 wage 5 exp 1b0 1 b1educ2 with b1 0 Estimating a model such as 242 is straightforward when using simple regression Just define the dependent variable y to be y 5 log1wage2 The independent variable is represented by x 5 educ The mechanics of OLS are the same as before the intercept and slope estimates are given by the formulas 217 and 219 In other words we obtain b 0 and b 1 from the OLS regression of logwage on educ Another important use of the natural log is in obtaining a constant elasticity model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 39 ExamplE 211 CEO Salary and Firm Sales We can estimate a constant elasticity model relating CEO salary to firm sales The data set is the same one used in Example 23 except we now relate salary to sales Let sales be annual firm sales meas ured in millions of dollars A constant elasticity model is log1salary2 5 b0 1 b1log1sales2 1 u 245 where b1 is the elasticity of salary with respect to sales This model falls under the simple regression model by defining the dependent variable to be y logsalary and the independent variable to be x 5 log1sales2 Estimating this equation by OLS gives log1salary2 5 4822 1 0257 log1sales2 246 n 5 209 R2 5 0211 The coefficient of logsales is the estimated elasticity of salary with respect to sales It implies that a 1 increase in firm sales increases CEO salary by about 0257the usual interpretation of an elasticity The two functional forms covered in this section will often arise in the remainder of this text We have covered models containing natural logarithms here because they appear so frequently in applied work The interpretation of such models will not be much different in the multiple regression case It is also useful to note what happens to the intercept and slope estimates if we change the units of measurement of the dependent variable when it appears in logarithmic form Because the change to logarithmic form approximates a proportionate change it makes sense that nothing happens to the slope We can see this by writing the rescaled variable as c1yi for each observation i The original equation is log1yi2 5 b0 1 b1xi 1 ui If we add log1c12 to both sides we get log1c12 1 log1yi2 5 3log1c12 1 b04 1 b1xi 1 ui or log1c1yi2 5 3log1c12 1 b04 1 b1xi 1 ui Remember that the sum of the logs is equal to the log of their product as shown in Appendix A Therefore the slope is still b1 but the intercept is now log1c12 1 b0 Similarly if the independent variable is log1x2 and we change the units of measurement of x before taking the log the slope remains the same but the intercept changes You will be asked to verify these claims in Problem 9 We end this subsection by summarizing four combinations of functional forms available from using either the original variable or its natural log In Table 23 x and y stand for the variables in their original form The model with y as the dependent variable and x as the independent variable is called the levellevel model because each variable appears in its level form The model with log1y2 as the dependent variable and x as the independent variable is called the loglevel model We will not explic itly discuss the levellog model here because it arises less often in practice In any case we will see examples of this model in later chapters The last column in Table 23 gives the interpretation of b1 In the loglevel model 100 b1 is sometimes called the semielasticity of y with respect to x As we mentioned in Example 211 in the loglog model b1 is the elasticity of y with respect to x Table 23 warrants careful study as we will refer to it often in the remainder of the text TAblE 23 Summary of Functional Forms Involving Logarithms Model Dependent Variable Independent Variable Interpretation of b1 Levellevel y x Dy 5 b1Dx Levellog y logx Dy 5 1b11002Dx Loglevel logy x Dy 5 1100b12Dx Loglog logy logx Dy 5 b1Dx Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 40 24c The Meaning of Linear Regression The simple regression model that we have studied in this chapter is also called the simple linear regression model Yet as we have just seen the general model also allows for certain nonlinear relationships So what does linear mean here You can see by looking at equation 21 that y 5 b0 1 b1x 1 u The key is that this equation is linear in the parameters b0 and b1 There are no restrictions on how y and x relate to the original explained and explanatory variables of interest As we saw in Examples 210 and 211 y and x can be natural logs of variables and this is quite com mon in applications But we need not stop there For example nothing prevents us from using simple regression to estimate a model such as cons 5 b0 1 b1inc 1 u where cons is annual consumption and inc is annual income Whereas the mechanics of simple regression do not depend on how y and x are defined the interpretation of the coefficients does depend on their definitions For successful empirical work it is much more important to become proficient at interpreting coefficients than to become efficient at computing formulas such as 219 We will get much more practice with interpreting the estimates in OLS regression lines when we study multiple regression Plenty of models cannot be cast as a linear regression model because they are not linear in their parameters an example is cons 5 11b0 1 b1inc2 1 u Estimation of such models takes us into the realm of the nonlinear regression model which is beyond the scope of this text For most applica tions choosing a model that can be put into the linear regression framework is sufficient 25 Expected Values and Variances of the OLS Estimators In Section 21 we defined the population model y 5 b0 1 b1x 1 u and we claimed that the key assumption for simple regression analysis to be useful is that the expected value of u given any value of x is zero In Sections 22 23 and 24 we discussed the algebraic properties of OLS estimation We now return to the population model and study the statistical properties of OLS In other words we now view b 0 and b 1 as estimators for the parameters b0 and b1 that appear in the population model This means that we will study properties of the distributions of b 0 and b 1 over different random sam ples from the population Appendix C contains definitions of estimators and reviews some of their important properties 25a Unbiasedness of OLS We begin by establishing the unbiasedness of OLS under a simple set of assumptions For future ref erence it is useful to number these assumptions using the prefix SLR for simple linear regression The first assumption defines the population model Assumption SLR1 linear in parameters In the population model the dependent variable y is related to the independent variable x and the error or disturbance u as y 5 b0 1 b1x 1 u 247 where b0 and b1 are the population intercept and slope parameters respectively To be realistic y x and u are all viewed as random variables in stating the population model We dis cussed the interpretation of this model at some length in Section 21 and gave several examples In the previous section we learned that equation 247 is not as restrictive as it initially seems by choosing Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 41 y and x appropriately we can obtain interesting nonlinear relationships such as constant elasticity models We are interested in using data on y and x to estimate the parameters b0 and especially b1 We assume that our data were obtained as a random sample See Appendix C for a review of random sampling Assumption SLR2 Random Sampling We have a random sample of size n 5 1xi yi2 i 5 1 2 c n6 following the population model in equation 247 We will have to address failure of the random sampling assumption in later chapters that deal with time series analysis and sample selection problems Not all crosssectional samples can be viewed as outcomes of random samples but many can be We can write 247 in terms of the random sample as yi 5 b0 1 b1xi 1 ui i 5 1 2 c n 248 where ui is the error or disturbance for observation i for example person i firm i city i and so on Thus ui contains the unobservables for observation i that affect yi The ui should not be confused with the residuals u i that we defined in Section 23 Later on we will explore the relationship between the errors and the residuals For interpreting b0 and b1 in a particular application 247 is most informa tive but 248 is also needed for some of the statistical derivations The relationship 248 can be plotted for a particular outcome of data as shown in Figure 27 As we already saw in Section 22 the OLS slope and intercept estimates are not defined unless we have some sample variation in the explanatory variable We now add variation in the xi to our list of assumptions y x1 xi x yi u1 y1 ui Eyx 5 0 1 1x PRF FiguRE 27 Graph of yi 5 b0 1 b1xi 1 ui Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 42 Assumption SLR3 Sample Variation in the Explanatory Variable The sample outcomes on x namely 5xi i 5 1 c n6 are not all the same value This is a very weak assumptioncertainly not worth emphasizing but needed nevertheless If x varies in the population random samples on x will typically contain variation unless the population variation is minimal or the sample size is small Simple inspection of summary statistics on xi reveals whether Assumption SLR3 fails if the sample standard deviation of xi is zero then Assumption SLR3 fails otherwise it holds Finally in order to obtain unbiased estimators of b0 and b1 we need to impose the zero condi tional mean assumption that we discussed in some detail in Section 21 We now explicitly add it to our list of assumptions Assumption SLR4 Zero Conditional mean The error u has an expected value of zero given any value of the explanatory variable In other words E1u0x2 5 0 For a random sample this assumption implies that E1ui0xi2 5 0 for all i 5 1 2 c n In addition to restricting the relationship between u and x in the population the zero conditional mean assumptioncoupled with the random sampling assumptionallows for a convenient technical simplification In particular we can derive the statistical properties of the OLS estimators as conditional on the values of the xi in our sample Technically in statistical derivations conditioning on the sample values of the independent variable is the same as treating the xi as fixed in repeated samples which we think of as follows We first choose n sample values for x1 x2 c xn These can be repeated Given these values we then obtain a sample on y effectively by obtaining a random sample of the ui Next another sample of y is obtained using the same values for x1 x2 c xn Then another sample of y is obtained again using the same x1 x2 c xn And so on The fixedinrepeatedsamples scenario is not very realistic in nonexperimental contexts For instance in sampling individuals for the wageeducation example it makes little sense to think of choosing the values of educ ahead of time and then sampling individuals with those particular levels of education Random sampling where individuals are chosen randomly and their wage and education are both recorded is representative of how most data sets are obtained for empirical analysis in the social sciences Once we assume that E1u0x2 5 0 and we have random sampling nothing is lost in derivations by treating the xi as nonrandom The danger is that the fixedin repeatedsamples assumption always implies that ui and xi are independent In deciding when sim ple regression analysis is going to produce unbiased estimators it is critical to think in terms of Assumption SLR4 Now we are ready to show that the OLS estimators are unbiased To this end we use the fact that g n i511xi 2 x2 1yi 2 y2 5 g n i511xi 2 x2yi see Appendix A to write the OLS slope estimator in equation 219 as b 1 5 a n i51 1xi 2 x2yi a n i51 1xi 2 x2 2 249 Because we are now interested in the behavior of b 1 across all possible samples b 1 is properly viewed as a random variable Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 43 We can write b 1 in terms of the population coefficient and errors by substituting the righthand side of 248 into 249 We have b 1 5 a n i51 1xi 2 x2yi SSTx 5 a n i51 1xi 2 x2 1b0 1 b1xi 1 ui2 SSTx 250 where we have defined the total variation in xi as SSTx 5 g n i511xi 2 x2 2 to simplify the notation This is not quite the sample variance of the xi because we do not divide by n 2 1 Using the algebra of the summation operator write the numerator of b 1 as a n i51 1xi 2 x2b0 1 a n i51 1xi 2 x2b1xi 1 a n i51 1xi 2 x2ui 5 b0 a n i51 1xi 2 x2 1 b1 a n i51 1xi 2 x2xi 1 a n i51 1xi 2 x2ui 251 As shown in Appendix A g n i511xi 2 x2 5 0 and g n i511xi 2 x2xi 5 g n i511xi 2 x2 2 5 SSTx Therefore we can write the numerator of b 1 as b1SSTx 1 g n i511xi 2 x2ui Putting this over the denominator gives b 1 5 b1 1 a n i51 1xi 2 x2ui SSTx 5 b1 1 11SSTx2 a n i51diui 252 where di 5 xi 2 x We now see that the estimator b 1 equals the population slope b1 plus a term that is a linear combination in the errors 3u1 u2 c un4 Conditional on the values of xi the randomness in b 1 is due entirely to the errors in the sample The fact that these errors are generally different from zero is what causes b 1 to differ from b1 Using the representation in 252 we can prove the first important statistical property of OLS UnbiaSEdnESS OF OlS Using Assumptions SLR1 through SLR4 E1b 02 5 b0 and E1b 12 5 b1 253 for any values of b0 and b1 In other words b 0 is unbiased for b0 and b 1 is unbiased for b1 PROOF In this proof the expected values are conditional on the sample values of the independent variable Because SSTx and di are functions only of the xi they are nonrandom in the conditioning Therefore from 252 and keeping the conditioning on 5x1 x2 c xn6 implicit we have E1b 12 5 b1 1 E3 11SSTx2 a n i51di ui4 5 b1 1 11SSTx2 a n i51 E1di ui2 5 b1 1 11SSTx2 a n i51di E1Ui2 5 b1 1 11SSTx2 a n i51di 0 5 b1 where we have used the fact that the expected value of each ui conditional on 5x1 x2 c xn6 2 is zero under Assumptions SLR2 and SLR4 Since unbiasedness holds for any outcome on 5x1 x2 c xn6 unbiasedness also holds without conditioning on 5x1 x2 c xn6 The proof for b 0 is now straightforward Average 248 across i to get y 5 b0 1 b1x 1 u and plug this into the formula for b 0 b 0 5 y 2 b 1x 5 b0 1 b1x 1 u 2 b 1x 5 b0 1 1b1 2 b 12x 1 u thEorEm 21 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 44 Remember that unbiasedness is a feature of the sampling distributions of b 1 and b 0 which says noth ing about the estimate that we obtain for a given sample We hope that if the sample we obtain is somehow typical then our estimate should be near the population value Unfortunately it is always possible that we could obtain an unlucky sample that would give us a point estimate far from b1 and we can never know for sure whether this is the case You may want to review the material on unbiased estimators in Appendix C especially the simulation exercise in Table C1 that illustrates the concept of unbiasedness Unbiasedness generally fails if any of our four assumptions fail This means that it is important to think about the veracity of each assumption for a particular application Assumption SLR1 requires that y and x be linearly related with an additive disturbance This can certainly fail But we also know that y and x can be chosen to yield interesting nonlinear relationships Dealing with the failure of 247 requires more advanced methods that are beyond the scope of this text Later we will have to relax Assumption SLR2 the random sampling assumption for time series analysis But what about using it for crosssectional analysis Random sampling can fail in a cross section when samples are not representative of the underlying population in fact some data sets are constructed by intentionally oversampling different parts of the population We will discuss problems of nonrandom sampling in Chapters 9 and 17 As we have already discussed Assumption SLR3 almost always holds in interesting regression applications Without it we cannot even obtain the OLS estimates The assumption we should concentrate on for now is SLR4 If SLR4 holds the OLS estimators are unbiased Likewise if SLR4 fails the OLS estimators generally will be biased There are ways to determine the likely direction and size of the bias which we will study in Chapter 3 The possibility that x is correlated with u is almost always a concern in simple regression analy sis with nonexperimental data as we indicated with several examples in Section 21 Using simple regression when u contains factors affecting y that are also correlated with x can result in spurious correlation that is we find a relationship between y and x that is really due to other unobserved fac tors that affect y and also happen to be correlated with x ExamplE 212 Student math performance and the School lunch program Let math10 denote the percentage of tenth graders at a high school receiving a passing score on a standardized mathematics exam Suppose we wish to estimate the effect of the federally funded school lunch program on student performance If anything we expect the lunch program to have a positive ceteris paribus effect on performance all other factors being equal if a student who is too poor to eat regular meals becomes eligible for the school lunch program his or her performance should improve Let lnchprg denote the percentage of students who are eligible for the lunch pro gram Then a simple regression model is math10 5 b0 1 b1 lnchprg 1 u 254 where u contains school and student characteristics that affect overall school performance Using the data in MEAP93 on 408 Michigan high schools for the 19921993 school year we obtain math10 5 3214 2 0319 lnchprg n 5 408 R2 5 0171 Then conditional on the values of the xi E1b 02 5 b0 1 E3 1b1 2 b 12x4 1 E1u2 5 b0 1 E3 1b1 2 b 12 4 x since E1u2 5 0 by Assumptions SLR2 and SLR4 But we showed that E1b 12 5 b1 which implies that E3 1b 1 2 b12 4 5 0 Thus E1b 02 5 b0 Both of these arguments are valid for any values of b0 and b1 and so we have established unbiasedness Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 45 This equation predicts that if student eligibility in the lunch program increases by 10 percentage points the percentage of students passing the math exam falls by about 32 percentage points Do we really believe that higher participation in the lunch program actually causes worse performance Almost certainly not A better explanation is that the error term u in equation 254 is correlated with lnchprg In fact u contains factors such as the poverty rate of children attending school which affects student performance and is highly correlated with eligibility in the lunch program Variables such as school quality and resources are also contained in u and these are likely correlated with lnchprg It is important to remember that the estimate 0319 is only for this particular sample but its sign and magnitude make us suspect that u and x are correlated so that simple regression is biased In addition to omitted variables there are other reasons for x to be correlated with u in the simple regression model Because the same issues arise in multiple regression analysis we will postpone a systematic treatment of the problem until then 25b Variances of the OLS Estimators In addition to knowing that the sampling distribution of b 1 is centered about b1 b 1is unbiased it is important to know how far we can expect b 1 to be away from b1 on average Among other things this allows us to choose the best estimator among all or at least a broad class of unbiased estimators The measure of spread in the distribution of b 1 and b 0 that is easiest to work with is the variance or its square root the standard deviation See Appendix C for a more detailed discussion It turns out that the variance of the OLS estimators can be computed under Assumptions SLR1 through SLR4 However these expressions would be somewhat complicated Instead we add an assumption that is traditional for crosssectional analysis This assumption states that the variance of the unobservable u conditional on x is constant This is known as the homoskedasticity or con stant variance assumption Assumption SLR5 Homoskedasticity The error u has the same variance given any value of the explanatory variable In other words Var1u0x2 5 s2 We must emphasize that the homoskedasticity assumption is quite distinct from the zero con ditional mean assumption E1u0x2 5 0 Assumption SLR4 involves the expected value of u while Assumption SLR5 concerns the variance of u both conditional on x Recall that we established the unbiasedness of OLS without Assumption SLR5 the homoskedasticity assumption plays no role in showing that b 0 and b 1 are unbiased We add Assumption SLR5 because it simplifies the variance calculations for b 0 and b 1 and because it implies that ordinary least squares has certain efficiency properties which we will see in Chapter 3 If we were to assume that u and x are independent then the distribution of u given x does not depend on x and so E1u0x2 5 E1u2 5 0 and Var1u0x2 5 s2 But independence is sometimes too strong of an assumption Because Var1u0x2 5 E1u20x2 2 3E1u0x2 42 and E1u0x2 5 0 s2 5 E1u20x2 which means s2 is also the unconditional expectation of u2 Therefore s2 5 E1u22 5 Var1u2 because E1u2 5 0 In other words s2 is the unconditional variance of u and so s2 is often called the error variance or disturbance variance The square root of s2 s is the standard deviation of the error A larger s means that the distribution of the unobservables affecting y is more spread out Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 46 It is often useful to write Assumptions SLR4 and SLR5 in terms of the conditional mean and conditional variance of y E1y0x2 5 b0 1 b1x 255 Var1y0x2 5 s2 256 In other words the conditional expectation of y given x is linear in x but the variance of y given x is constant This situation is graphed in Figure 28 where b0 0 and b1 0 When Var1u0x2 depends on x the error term is said to exhibit heteroskedasticity or nonconstant variance Because Var1u0x2 5 Var1y0x2 heteroskedasticity is present whenever Var1y0x2 is a func tion of x ExamplE 213 Heteroskedasticity in a Wage Equation In order to get an unbiased estimator of the ceteris paribus effect of educ on wage we must assume that E1u0educ2 5 0 and this implies E1wage0educ2 5 b0 1 b1educ If we also make the homoskedastic ity assumption then Var1u0educ2 5 s2 does not depend on the level of education which is the same as assuming Var1wage0educ2 5 s2 Thus while average wage is allowed to increase with education levelit is this rate of increase that we are interested in estimatingthe variability in wage about its mean is assumed to be constant across all education levels This may not be realistic It is likely that people with more education have a wider variety of interests and job opportunities which could lead to more wage variability at higher levels of education People with very low levels of education have fewer opportunities and often must work at the minimum wage this serves to reduce wage variability at low education levels This situation is shown in Figure 29 Ultimately whether Assumption SLR5 holds is an empirical issue and in Chapter 8 we will show how to test Assumption SLR5 x1 x2 x Eyx 5 0 1 1x fyx x3 y FiguRE 28 The simple regression model under homoskedasticity Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 47 With the homoskedasticity assumption in place we are ready to prove the following 8 12 educ Ewageeduc 5 0 1 1educ fwageeduc 16 wage FiguRE 29 Var1wage0educ2 increasing with educ thEorEm 22 Sampling VaRianCES OF tHE OlS EStimatORS Under Assumptions SLR1 through SLR5 Var1b 12 5 s2 a n i51 1xi 2 x2 2 5 s2SSTxr 257 and Var1b 02 5 s2n21 a n i51 xi 2 a n i51 1xi 2 x2 2 258 where these are conditional on the sample values 5x1 c xn6 PROOF We derive the formula for Var1b 12 leaving the other derivation as Problem 10 The starting point is equation 12522 b 1 5 b1 1 11SSTx2 g n i51di ui Because b1 is just a constant and we are condition ing on the xi SSTx and di 5 xi 2 x are also nonrandom Furthermore because the ui are independent Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 48 Equations 257 and 258 are the standard formulas for simple regression analysis which are invalid in the presence of heteroskedasticity This will be important when we turn to confidence intervals and hypothesis testing in multiple regression analysis For most purposes we are interested in Var1b 12 It is easy to summarize how this variance depends on the error variance s2 and the total variation in 5x1 x2 c xn6 SSTx First the larger the error variance the larger is Var1b 12 This makes sense since more variation in the unobservables affecting y makes it more difficult to precisely estimate b1 On the other hand more variability in the independent variable is preferred as the variability in the xi increases the variance of b 1 decreases This also makes intuitive sense since the more spread out is the sample of independent variables the easier it is to trace out the relationship between E1y0x2 and x That is the easier it is to estimate b1 If there is little variation in the xi then it can be hard to pinpoint how E1y0x2 varies with x As the sample size increases so does the total variation in the xi Therefore a larger sample size results in a smaller variance for b 1 This analysis shows that if we are interested in b1 and we have a choice then we should choose the xi to be as spread out as possible This is sometimes possible with experimental data but rarely do we have this luxury in the social sciences usually we must take the xi that we obtain via random sampling Sometimes we have an opportunity to obtain larger sample sizes although this can be costly For the purposes of constructing confidence intervals and deriving test statistics we will need to work with the standard deviations of b 1 and b 0 sdb 1 and sdb 0 Recall that these are obtained by taking the square roots of the variances in 257 and 258 In particular sd1b 12 5 sSSTx where s is the square root of s2 and SSTx is the square root of SSTx 25c Estimating the Error Variance The formulas in 257 and 258 allow us to isolate the factors that contribute to Var1b 12 and Var1b 02 But these formulas are unknown except in the extremely rare case that s2 is known Nevertheless we can use the data to estimate s2 which then allows us to estimate Var1b 12 and Var1b 02 This is a good place to emphasize the difference between the errors or disturbances and the residuals since this distinction is crucial for constructing an estimator of s2 Equation 248 shows how to write the population model in terms of a randomly sampled observation as yi 5 b0 1 b1xi 1 ui where ui is the error for observation i We can also express yi in terms of its fitted value and residual as in equation 12322 yi 5 b 0 1 b 1xi 1 u i Comparing these two equations we see that the error shows random variables across i by random sampling the variance of the sum is the sum of the variances Using these facts we have Var1b 12 5 11SSTx2 2 Vara a n i51 di uib 5 11SSTx2 2 a a n i51 d2 i Var1ui2b 5 11SSTx2 2 a a n i51 d2 i s2b 3since Var1ui2 5 s2 for all i4 5 s211SSTx2 2 a a n i51 d2 i b 5 s211SSTx2 2SSTx 5 s2SSTx which is what we wanted to show Show that when estimating b0 it is best to have x 5 0 What is Var1b 02 in this case Hint For any sample of numbers g n i51 x2 i g n i511xi 2 x2 2 with equality only if x 5 04 Exploring FurthEr 25 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 49 up in the equation containing the population parameters b0 and b1 On the other hand the residuals show up in the estimated equation with b 0 and b 1 The errors are never observed while the residuals are computed from the data We can use equations 232 and 248 to write the residuals as a function of the errors u i 5 yi 2 b 0 2 b 1xi 5 1b0 1 b1xi 1 ui2 2 b 0 2 b 1xi or u i 5 ui 2 1b 0 2 b02 2 1b 1 2 b12xi 259 Although the expected value of b 0 equals b0 and similarly for b 1 u i is not the same as ui The differ ence between them does have an expected value of zero Now that we understand the difference between the errors and the residuals we can return to esti mating s2 First s2 5 E1u22 so an unbiased estimator of s2 is n21g n i51 u2 i Unfortunately this is not a true estimator because we do not observe the errors ui But we do have estimates of the ui namely the OLS residuals u i If we replace the errors with the OLS residuals we have n21g n i51 u 2 i 5 SSRn This is a true estimator because it gives a computable rule for any sample of data on x and y One slight drawback to this estimator is that it turns out to be biased although for large n the bias is small Because it is easy to compute an unbiased estimator we use that instead The estimator SSRn is biased essentially because it does not account for two restrictions that must be satisfied by the OLS residuals These restrictions are given by the two OLS first order conditions a n i51u i 5 0 a n i51xiu i 5 0 260 One way to view these restrictions is this if we know n 2 2 of the residuals we can always get the other two residuals by using the restrictions implied by the first order conditions in 260 Thus there are only n 2 2 degrees of freedom in the OLS residuals as opposed to n degrees of freedom in the errors It is important to understand that if we replace u i with ui in 260 the restrictions would no longer hold The unbiased estimator of s2 that we will use makes a degrees of freedom adjustment s 2 5 1 1n 2 22 a n i51u 2 i 5 SSR1n 2 22 261 This estimator is sometimes denoted as S2 but we continue to use the convention of putting hats over estimators UnbiaSEd EStimatiOn OF s2 Under Assumptions SLR1 through SLR5 E1s 22 5 s2 PROOF If we average equation 259 across all i and use the fact that the OLS residuals average out to zero we have 0 5 u 2 1b 0 2 b02 2 1b 1 2 b12x subtracting this from 259 gives u i 5 1ui 2 u2 2 1b 1 2 b12 1xi 2 x2 Therefore u 2 i 5 1ui 2 u2 2 1 1b 1 2 b12 21xi 2 x2 2 2 21ui 2 u2 1b 1 2 b12 1xi 2 x2 Summing across all i gives g n i51u 2 i 5 g n i511ui 2 u2 2 1 1b 1 2 b12 2 g n i511xi 2 x2 2 2 21b 1 2 b12 g n i51 ui1xi 2 x2 Now the expected value of the first term is 1n 2 12s2 something that is shown in Appendix C The expected value of the second term is simply s2 because E3 1b 1 2 b12 24 5 Var1b 12 5 s2SSTx Finally the third term can be written as 221b 1 2 b12 2SSTx taking expectations gives 22s2 Putting these three terms together gives E1 g n i51u 2 i 2 5 1n 2 12s2 1 s2 2 2s2 5 1n 2 22s2 so that E3SSR1n 2 22 4 5 s2 thEorEm 23 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 50 If s 2 is plugged into the variance formulas 257 and 258 then we have unbiased estimators of Var1b 12 and Var1b 02 Later on we will need estimators of the standard deviations of b 1 and b 0 and this requires estimating s The natural estimator of s is s 5 s 2 262 and is called the standard error of the regression SER Other names for s are the standard error of the estimate and the root mean squared error but we will not use these Although s is not an unbi ased estimator of s we can show that it is a consistent estimator of s see Appendix C and it will serve our purposes well The estimate s is interesting because it is an estimate of the standard deviation in the unobservables affecting y equivalently it estimates the standard deviation in y after the effect of x has been taken out Most regression packages report the value of s along with the Rsquared intercept slope and other OLS statistics under one of the several names listed above For now our primary interest is in using s to esti mate the standard deviations of b 0 and b 1 Since sd1b 12 5 sSSTx the natural estimator of sd1b 12is se1b 12 5 s SSTx 5 s a a n i51 1xi 2 x2 2b 12 this is called the standard error of b 1 Note that se1b 12 is viewed as a random variable when we think of running OLS over different samples of y this is true because s varies with different samples For a given sample se1b 12 is a number just as b 1 is simply a number when we compute it from the given data Similarly se1b 02 is obtained from sd1b 02 by replacing s with s The standard error of any esti mate gives us an idea of how precise the estimator is Standard errors play a central role throughout this text we will use them to construct test statistics and confidence intervals for every econometric procedure we cover starting in Chapter 4 26 Regression through the Origin and Regression on a Constant In rare cases we wish to impose the restriction that when x 5 0 the expected value of y is zero There are certain relationships for which this is reasonable For example if income x is zero then income tax revenues y must also be zero In addition there are settings where a model that originally has a nonzero intercept is transformed into a model without an intercept Formally we now choose a slope estimator which we call b 1 and a line of the form y 5 b 1x 263 where the tildes over b 1 and y are used to distinguish this problem from the much more common problem of estimating an intercept along with a slope Obtaining 263 is called regression through the origin because the line 263 passes through the point x 5 0 y 5 0 To obtain the slope estimate in 263 we still rely on the method of ordinary least squares which in this case minimizes the sum of squared residuals a n i51 1yi 2 b 1xi2 2 264 Using onevariable calculus it can be shown that b 1 must solve the first order condition a n i51xi1yi 2 b 1xi2 5 0 265 From this we can solve for b 1 b 1 5 a n i51xiyi a n i51x2 i 266 provided that not all the xi are zero a case we rule out Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 51 Note how b 1 compares with the slope estimate when we also estimate the intercept rather than set it equal to zero These two estimates are the same if and only if x 5 0 See equation 249 for b 1 Obtaining an estimate of b1 using regression through the origin is not done very often in applied work and for good reason if the intercept b0 2 0 then b 1 is a biased estimator of b1 You will be asked to prove this in Problem 8 In cases where regression through the origin is deemed appropriate one must be careful in inter preting the Rsquared that is typically reported with such regressions Usually unless stated otherwise the Rsquared is obtained without removing the sample average of 5yi i 5 1 c n6 in obtaining SST In other words the Rsquared is computed as 1 2 a n i51 1yi 2 b 1xi2 2 a n i51y2 i 267 The numerator here makes sense because it is the sum of squared residuals but the denominator acts as if we know the average value of y in the population is zero One reason this version of the Rsquared is used is that if we use the usual total sum of squares that is we compute Rsquared as 1 2 a n i51 1yi 2 b 1xi2 2 a n i51 1yi 2 y2 2 268 it can actually be negative If expression 268 is negative then it means that using the sample average y to predict yi provides a better fit than using xi in a regression through the origin Therefore 268 is actually more attractive than equation 267 because equation 268 tells us whether using x is better than ignoring x altogether This discussion about regression through the origin and different ways to measure goodness offit prompts another question what happens if we only regress on a constant That is we set the slope to zero which means we need not even have an x and estimate an intercept only The answer is simple the intercept is y This fact is usually shown in basic statistics where it is shown that the constant that produces the smallest sum of squared deviations is always the sample average In this light equation 268 can be seen as comparing regression on x through the origin with regression only on a constant Summary We have introduced the simple linear regression model in this chapter and we have covered its basic prop erties Given a random sample the method of ordinary least squares is used to estimate the slope and intercept parameters in the population model We have demonstrated the algebra of the OLS regression line including computation of fitted values and residuals and the obtaining of predicted changes in the dependent variable for a given change in the independent variable In Section 24 we discussed two issues of practical importance 1 the behavior of the OLS estimates when we change the units of measurement of the dependent variable or the independent variable and 2 the use of the natural log to allow for constant elasticity and constant semielasticity models In Section 25 we showed that under the four Assumptions SLR1 through SLR4 the OLS estimators are unbiased The key assumption is that the error term u has zero mean given any value of the independent variable x Unfortunately there are reasons to think this is false in many social science applications of sim ple regression where the omitted factors in u are often correlated with x When we add the assumption that the variance of the error given x is constant we get simple formulas for the sampling variances of the OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 52 PART 1 Regression Analysis with CrossSectional Data estimators As we saw the variance of the slope estimator b 1 increases as the error variance increases and it decreases when there is more sample variation in the independent variable We also derived an unbiased estimator for s2 5 Var1u2 In Section 26 we briefly discussed regression through the origin where the slope estimator is ob tained under the assumption that the intercept is zero Sometimes this is useful but it appears infrequently in applied work Much work is left to be done For example we still do not know how to test hypotheses about the pop ulation parameters b0 and b1 Thus although we know that OLS is unbiased for the population parameters under Assumptions SLR1 through SLR4 we have no way of drawing inferences about the population Other topics such as the efficiency of OLS relative to other possible procedures have also been omitted The issues of confidence intervals hypothesis testing and efficiency are central to multiple regression analysis as well Since the way we construct confidence intervals and test statistics is very similar for mul tiple regressionand because simple regression is a special case of multiple regressionour time is better spent moving on to multiple regression which is much more widely applicable than simple regression Our purpose in Chapter 2 was to get you thinking about the issues that arise in econometric analysis in a fairly simple setting The GaussMarkov assuMpTions for siMple reGression For convenience we summarize the GaussMarkov assumptions that we used in this chapter It is impor tant to remember that only SLR1 through SLR4 are needed to show b 0 and b 1 are unbiased We added the homoskedasticity assumption SLR5 to obtain the usual OLS variance formulas 257 and 258 assumption slr1 linear in parameters In the population model the dependent variable y is related to the independent variable x and the error or disturbance u as y 5 b0 1 b1x 1 u where b0 and b1 are the population intercept and slope parameters respectively assumption slr2 random sampling We have a random sample of size n 5 1xiyi2 i 5 1 2 c n6 following the population model in Assump tion SLR1 assumption slr3 sample variation in the explanatory variable The sample outcomes on x namely 5xi i 5 1 c n6 are not all the same value assumption slr4 Zero Conditional Mean The error u has an expected value of zero given any value of the explanatory variable In other words E1u0x2 5 0 assumption slr5 homoskedasticity The error u has the same variance given any value of the explanatory variable In other words Var1u0x2 5 s2 Key Terms Coefficient of Determination Constant Elasticity Model Control Variable Covariate Degrees of Freedom Dependent Variable Elasticity Error Term Disturbance Error Variance Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 53 Explained Sum of Squares SSE Explained Variable Explanatory Variable First Order Conditions Fitted Value GaussMarkov Assumptions Heteroskedasticity Homoskedasticity Independent Variable Intercept Parameter Mean Independent OLS Regression Line Ordinary Least Squares OLS Population Regression Function PRF Predicted Variable Predictor Variable Regressand Regression through the Origin Regressor Residual Residual Sum of Squares SSR Response Variable Rsquared Sample Regression Function SRF Semielasticity Simple Linear Regression Model Slope Parameter Standard Error of b 1 Standard Error of the Regression SER Sum of Squared Residuals SSR Total Sum of Squares SST Zero Conditional Mean Assumption Problems 1 Let kids denote the number of children ever born to a woman and let educ denote years of education for the woman A simple model relating fertility to years of education is kids 5 b0 1 b1educ 1 u where u is the unobserved error i What kinds of factors are contained in u Are these likely to be correlated with level of education ii Will a simple regression analysis uncover the ceteris paribus effect of education on fertility Explain 2 In the simple linear regression model y 5 b0 1 b1x 1 u suppose that E1u2 2 0 Letting a0 5 E1u2 show that the model can always be rewritten with the same slope but a new intercept and error where the new error has a zero expected value 3 The following table contains the ACT scores and the GPA grade point average for eight college stu dents Grade point average is based on a fourpoint scale and has been rounded to one digit after the decimal Student GPA ACT 1 28 21 2 34 24 3 30 26 4 35 27 5 36 29 6 30 25 7 27 25 8 37 30 i Estimate the relationship between GPA and ACT using OLS that is obtain the intercept and slope estimates in the equation GPA 5 b 0 1 b 1ACT Comment on the direction of the relationship Does the intercept have a useful interpretation here Explain How much higher is the GPA predicted to be if the ACT score is increased by five points ii Compute the fitted values and residuals for each observation and verify that the residuals approximately sum to zero Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 54 PART 1 Regression Analysis with CrossSectional Data iii What is the predicted value of GPA when ACT 5 20 iv How much of the variation in GPA for these eight students is explained by ACT Explain 4 The data set BWGHT contains data on births to women in the United States Two variables of interest are the dependent variable infant birth weight in ounces bwght and an explanatory variable average number of cigarettes the mother smoked per day during pregnancy cigs The following simple regres sion was estimated using data on n 5 1388 births bwght 5 11977 2 0514 cigs i What is the predicted birth weight when cigs 5 0 What about when cigs 5 20 one pack per day Comment on the difference ii Does this simple regression necessarily capture a causal relationship between the childs birth weight and the mothers smoking habits Explain iii To predict a birth weight of 125 ounces what would cigs have to be Comment iv The proportion of women in the sample who do not smoke while pregnant is about 85 Does this help reconcile your finding from part iii 5 In the linear consumption function cons 5 b 0 1 b 1inc the estimated marginal propensity to consume MPC out of income is simply the slope b 1 while the average propensity to consume APC is consinc 5 b 0inc 1 b 1 Using observations for 100 families on annual income and consumption both measured in dollars the following equation is obtained cons 5 212484 1 0853 inc n 5 100 R2 5 0692 i Interpret the intercept in this equation and comment on its sign and magnitude ii What is the predicted consumption when family income is 30000 iii With inc on the xaxis draw a graph of the estimated MPC and APC 6 Using data from 1988 for houses sold in Andover Massachusetts from Kiel and McClain 1995 the following equation relates housing price price to the distance from a recently built garbage incin erator dist log1price2 5 940 1 0312 log1dist2 n 5 135 R2 5 0162 i Interpret the coefficient on logdist Is the sign of this estimate what you expect it to be ii Do you think simple regression provides an unbiased estimator of the ceteris paribus elasticity of price with respect to dist Think about the citys decision on where to put the incinerator iii What other factors about a house affect its price Might these be correlated with distance from the incinerator 7 Consider the savings function sav 5 b0 1 b1inc 1 u u 5 inc e where e is a random variable with E1e2 5 0 and Var1e2 5 s2 e Assume that e is independent of inc i Show that E1u0inc2 5 0 so that the key zero conditional mean assumption Assumption SLR4 is satisfied Hint If e is independent of inc then E1e0inc2 5 E1e2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 55 ii Show that Var1u0inc2 5 s2 einc so that the homoskedasticity Assumption SLR5 is violated In particular the variance of sav increases with inc Hint Var1e0inc2 5 Var1e2 if e and inc are independent iii Provide a discussion that supports the assumption that the variance of savings increases with family income 8 Consider the standard simple regression model y 5 b0 1 b1x 1 u under the GaussMarkov Assump tions SLR1 through SLR5 The usual OLS estimators b 0 and b 1 are unbiased for their respective population parameters Let b 1 be the estimator of b1 obtained by assuming the intercept is zero see Section 26 i Find E1b 12 in terms of the xi b0 and b1 Verify that b 1 is unbiased for b1 when the population intercept 1b02 is zero Are there other cases where b 1 is unbiased ii Find the variance of b 1 Hint The variance does not depend on b0 iii Show that Var1b 12 Var1b 12 Hint For any sample of data g n i21x2 i g n i511xi 2 x2 2 with strict inequality unless x 5 0 iv Comment on the tradeoff between bias and variance when choosing between b 1 and b 1 9 i Let b 0 and b 1 be the intercept and slope from the regression of yi on xi using n observations Let c1 and c2 with c2 2 0 be constants Let b 0 and b 1 be the intercept and slope from the regression of c1yi on c2xi Show that b 1 5 1c1c22b 0 and b 0 5 c1b 0 thereby verifying the claims on units of measurement in Section 24 Hint To obtain b 1 plug the scaled versions of x and y into 219 Then use 217 for b 0 being sure to plug in the scaled x and y and the correct slope ii Now let b 0 and b 1 be from the regression of 1c1 1 yi2 on 1c2 1 xi2 with no restriction on c1 or c2 Show that b 1 5 b 1 and b 0 5 b 0 1 c1 2 c2b 1 iii Now let b 0 and b 1 be the OLS estimates from the regression log1yi2 on xi where we must as sume yi 0 for all i For c1 0 let b 0 and b 1 be the intercept and slope from the regression of log1c1yi2 on xi Show that b 1 5 b 1 and b 0 5 log1c12 1 b 0 iv Now assuming that xi 0 for all i let b 0 and b 1 be the intercept and slope from the regression of yi on log1c2xi2 How do b 0 and b 1 compare with the intercept and slope from the regression of yi on log1xi2 10 Let b 0 and b 1 be the OLS intercept and slope estimators respectively and let u be the sample average of the errors not the residuals i Show that b 1 can be written as b 1 5 b1 1 g n i51wiui where wi 5 diSSTx and di 5 xi 2 x ii Use part i along with g n i51wi 5 0 to show that b 1 and u are uncorrelated Hint You are being asked to show that E3 1b 1 2 b12 u4 5 04 iii Show that b 0 can be written as b 0 5 b0 1 u 2 1b 1 2 b12x iv Use parts ii and iii to show that Var1b 02 5 s2n 1 s21x2 2SSTx v Do the algebra to simplify the expression in part iv to equation 258 Hint SSTxn 5 n21g n i51x2 i 2 1x2 2 11 Suppose you are interested in estimating the effect of hours spent in an SAT preparation course hours on total SAT score sat The population is all collegebound high school seniors for a particular year i Suppose you are given a grant to run a controlled experiment Explain how you would structure the experiment in order to estimate the causal effect of hours on sat ii Consider the more realistic case where students choose how much time to spend in a prepara tion course and you can only randomly sample sat and hours from the population Write the population model as sat 5 b0 1 b1hours 5 u where as usual in a model with an intercept we can assume E1u2 5 0 List at least two factors contained in u Are these likely to have positive or negative correlation with hours Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 56 PART 1 Regression Analysis with CrossSectional Data iii In the equation from part ii what should be the sign of b1 if the preparation course is effective iv In the equation from part ii what is the interpretation of b0 12 Consider the problem described at the end of Section 26 running a regression and only estimating an intercept i Given a sample 5yi i 5 1 2 c n6 let b 0 be the solution to min b0 a n i51 1yi 2 b02 2 Show that b 0 5 y that is the sample average minimizes the sum of squared residuals Hint You may use onevariable calculus or you can show the result directly by adding and subtract ing y inside the squared residual and then doing a little algebra ii Define residuals u i 5 yi 2 y Argue that these residuals always sum to zero Computer Exercises C1 The data in 401K are a subset of data analyzed by Papke 1995 to study the relationship between participation in a 401k pension plan and the generosity of the plan The variable prate is the per centage of eligible workers with an active account this is the variable we would like to explain The measure of generosity is the plan match rate mrate This variable gives the average amount the firm contributes to each workers plan for each 1 contribution by the worker For example if mrate 5 050 then a 1 contribution by the worker is matched by a 50 contribution by the firm i Find the average participation rate and the average match rate in the sample of plans ii Now estimate the simple regression equation prate 5 b 0 1 b 1 mrate and report the results along with the sample size and Rsquared iii Interpret the intercept in your equation Interpret the coefficient on mrate iv Find the predicted prate when mrate 5 35 Is this a reasonable prediction Explain what is happening here v How much of the variation in prate is explained by mrate Is this a lot in your opinion C2 The data set in CEOSAL2 contains information on chief executive officers for US corporations The variable salary is annual compensation in thousands of dollars and ceoten is prior number of years as company CEO i Find the average salary and the average tenure in the sample ii How many CEOs are in their first year as CEO that is ceoten 5 0 What is the longest tenure as a CEO iii Estimate the simple regression model log1salary2 5 b0 1 b1ceoten 1 u and report your results in the usual form What is the approximate predicted percentage increase in salary given one more year as a CEO C3 Use the data in SLEEP75 from Biddle and Hamermesh 1990 to study whether there is a tradeoff between the time spent sleeping per week and the time spent in paid work We could use either variable as the dependent variable For concreteness estimate the model sleep 5 b0 1 b1totwrk 1 u where sleep is minutes spent sleeping at night per week and totwrk is total minutes worked dur ing the week i Report your results in equation form along with the number of observations and R2 What does the intercept in this equation mean Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 57 ii If totwrk increases by 2 hours by how much is sleep estimated to fall Do you find this to be a large effect C4 Use the data in WAGE2 to estimate a simple regression explaining monthly salary wage in terms of IQ score IQ i Find the average salary and average IQ in the sample What is the sample standard deviation of IQ IQ scores are standardized so that the average in the population is 100 with a standard de viation equal to 15 ii Estimate a simple regression model where a onepoint increase in IQ changes wage by a con stant dollar amount Use this model to find the predicted increase in wage for an increase in IQ of 15 points Does IQ explain most of the variation in wage iii Now estimate a model where each onepoint increase in IQ has the same percentage effect on wage If IQ increases by 15 points what is the approximate percentage increase in predicted wage C5 For the population of firms in the chemical industry let rd denote annual expenditures on research and development and let sales denote annual sales both are in millions of dollars i Write down a model not an estimated equation that implies a constant elasticity between rd and sales Which parameter is the elasticity ii Now estimate the model using the data in RDCHEM Write out the estimated equation in the usual form What is the estimated elasticity of rd with respect to sales Explain in words what this elasticity means C6 We used the data in MEAP93 for Example 212 Now we want to explore the relationship between the math pass rate math10 and spending per student expend i Do you think each additional dollar spent has the same effect on the pass rate or does a dimin ishing effect seem more appropriate Explain ii In the population model math10 5 b0 1 b1 log1expend2 1 u argue that b110 is the percentage point change in math10 given a 10 increase in expend iii Use the data in MEAP93 to estimate the model from part ii Report the estimated equation in the usual way including the sample size and Rsquared iv How big is the estimated spending effect Namely if spending increases by 10 what is the estimated percentage point increase in math10 v One might worry that regression analysis can produce fitted values for math10 that are greater than 100 Why is this not much of a worry in this data set C7 Use the data in CHARITY obtained from Franses and Paap 2001 to answer the following questions i What is the average gift in the sample of 4268 people in Dutch guilders What percentage of people gave no gift ii What is the average mailings per year What are the minimum and maximum values iii Estimate the model gift 5 b0 1 b1mailsyear 1 u by OLS and report the results in the usual way including the sample size and Rsquared iv Interpret the slope coefficient If each mailing costs one guilder is the charity expected to make a net gain on each mailing Does this mean the charity makes a net gain on every mailing Explain v What is the smallest predicted charitable contribution in the sample Using this simple regres sion analysis can you ever predict zero for gift C8 To complete this exercise you need a software package that allows you to generate data from the uni form and normal distributions i Start by generating 500 observations on xithe explanatory variablefrom the uniform dis tribution with range 010 Most statistical packages have a command for the Uniform01 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 58 PART 1 Regression Analysis with CrossSectional Data distribution just multiply those observations by 10 What are the sample mean and sample standard deviation of the xi ii Randomly generate 500 errors ui from the Normal036 distribution If you generate a Normal01 as is commonly available simply multiply the outcomes by six Is the sample av erage of the ui exactly zero Why or why not What is the sample standard deviation of the ui iii Now generate the yi as yi 5 1 1 2xi 1 ui b0 1 b1xi 1 ui that is the population intercept is one and the population slope is two Use the data to run the regression of yi on xi What are your estimates of the intercept and slope Are they equal to the population values in the above equation Explain iv Obtain the OLS residuals u i and verify that equation 260 holds subject to rounding error v Compute the same quantities in equation 260 but use the errors ui in place of the residuals Now what do you conclude vi Repeat parts i ii and iii with a new sample of data starting with generating the xi Now what do you obtain for b 0 and b 1 Why are these different from what you obtained in part iii C9 Use the data in COUNTYMURDERS to answer this questions Use only the data for 1996 i How many counties had zero murders in 1996 How many counties had at least one execution What is the largest number of executions ii Estimate the equation murders 5 b0 1 b1execs 1 u by OLS and report the results in the usual way including sample size and Rsquared iii Interpret the slope coefficient reported in part ii Does the estimated equation suggest a deter rent effect of capital punishment iv What is the smallest number of murders that can be predicted by the equation What is the residual for a county with zero executions and zero murders v Explain why a simple regression analysis is not well suited for determining whether capital pun ishment has a deterrent effect on murders C10 The data set in CATHOLIC includes test score information on over 7000 students in the United States who were in eighth grade in 1988 The variables math12 and read12 are scores on twelfth grade stan dardized math and reading tests respectively i How many students are in the sample Find the means and standard deviations of math12 and read12 ii Run the simple regression of math12 on read12 to obtain the OLS intercept and slope estimates Report the results in the form math12 5 b 0 1 b 1read12 n 5 R2 5 where you fill in the values for b 0 and b 1 and also replace the question marks iii Does the intercept reported in part ii have a meaningful interpretation Explain iv Are you surprised by the b 1 that you found What about R2 v Suppose that you present your findings to a superintendent of a school district and the superintendent says Your findings show that to improve math scores we just need to improve reading scores so we should hire more reading tutors How would you respond to this comment Hint If you instead run the regression of read12 on math12 what would you expect to find Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 59 APPEndix 2A Minimizing the sum of squared residuals We show that the OLS estimates b 0 and b 1 do minimize the sum of squared residuals as asserted in Section 22 Formally the problem is to characterize the solutions b 0 and b 1 to the minimization problem min b0b1 a n i51 1yi 2 b0 2 b1xi2 2 where b0 and b1 are the dummy arguments for the optimization problem for simplicity call this function Q1b0 b12 By a fundamental result from multivariable calculus see Appendix A a nec essary condition for b 0 and b 1 to solve the minimization problem is that the partial derivatives of Q1b0 b12 with respect to b0 and b1 must be zero when evaluated at b 0 b 1 Q1b 0 b 12b0 5 0 and Q1b 0 b 12b1 5 0 Using the chain rule from calculus these two equations become 22 a n i51 1yi 2 b 0 2 b 1xi2 5 0 22 a n i51xi1yi 2 b 0 2 b 1xi2 5 0 These two equations are just 214 and 215 multiplied by 2n and therefore are solved by the same b 0 and b 1 How do we know that we have actually minimized the sum of squared residuals The first order conditions are necessary but not sufficient conditions One way to verify that we have minimized the sum of squared residuals is to write for any b0 and b1 Q1b0 b12 5 a n i51 3yi 2 b 0 2 b 1xi 1 1b 0 2 b02 1 1b 1 2 b12xi42 5 a n i51 1u i 1 1b 0 2 b02 1 1b 1 2 b12xi42 5 a n i51u 2 i 1 n1b 0 2 b02 2 1 1b 1 2 b12 2 a n i51x2 i 1 21b 0 2 b02 1b 1 2 b12 a n i51xi where we have used equations 230 and 231 The first term does not depend on b0 or b1 while the sum of the last three terms can be written as a n i51 3 1b 0 2 b02 1 1b 1 2 b12xi42 as can be verified by straightforward algebra Because this is a sum of squared terms the smallest it can be is zero Therefore it is smallest when b0 5 b 0 and b1 5 b 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 60 c h a p t e r 3 Multiple Regression Analysis Estimation I n Chapter 2 we learned how to use simple regression analysis to explain a dependent variable y as a function of a single independent variable x The primary drawback in using simple regression analysis for empirical work is that it is very difficult to draw ceteris paribus conclusions about how x affects y the key assumption SLR4that all other factors affecting y are uncorrelated with xis often unrealistic Multiple regression analysis is more amenable to ceteris paribus analysis because it allows us to explicitly control for many other factors that simultaneously affect the dependent variable This is important both for testing economic theories and for evaluating policy effects when we must rely on nonexperimental data Because multiple regression models can accommodate many explanatory variables that may be cor related we can hope to infer causality in cases where simple regression analysis would be misleading Naturally if we add more factors to our model that are useful for explaining y then more of the variation in y can be explained Thus multiple regression analysis can be used to build better models for predicting the dependent variable An additional advantage of multiple regression analysis is that it can incorporate fairly general functional form relationships In the simple regression model only one function of a single explana tory variable can appear in the equation As we will see the multiple regression model allows for much more flexibility Section 31 formally introduces the multiple regression model and further discusses the advan tages of multiple regression over simple regression In Section 32 we demonstrate how to esti mate the parameters in the multiple regression model using the method of ordinary least squares Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 61 In Sections 33 34 and 35 we describe various statistical properties of the OLS estimators includ ing unbiasedness and efficiency The multiple regression model is still the most widely used vehicle for empirical analysis in eco nomics and other social sciences Likewise the method of ordinary least squares is popularly used for estimating the parameters of the multiple regression model 31 Motivation for Multiple Regression 31a The Model with Two Independent Variables We begin with some simple examples to show how multiple regression analysis can be used to solve problems that cannot be solved by simple regression The first example is a simple variation of the wage equation introduced in Chapter 2 for obtaining the effect of education on hourly wage wage 5 b0 1 b1educ 1 b2exper 1 u 31 where exper is years of labor market experience Thus wage is determined by the two explanatory or independent variables education and experience and by other unobserved factors which are con tained in u We are still primarily interested in the effect of educ on wage holding fixed all other fac tors affecting wage that is we are interested in the parameter b1 Compared with a simple regression analysis relating wage to educ equation 31 effectively takes exper out of the error term and puts it explicitly in the equation Because exper appears in the equation its coefficient b2 measures the ceteris paribus effect of exper on wage which is also of some interest Not surprisingly just as with simple regression we will have to make assumptions about how u in 31 is related to the independent variables educ and exper However as we will see in Section 32 there is one thing of which we can be confident because 31 contains experience explicitly we will be able to measure the effect of education on wage holding experience fixed In a simple regression analysiswhich puts exper in the error termwe would have to assume that experience is uncorrelated with education a tenuous assumption As a second example consider the problem of explaining the effect of perstudent spending expend on the average standardized test score avgscore at the high school level Suppose that the average test score depends on funding average family income avginc and other unobserved factors avgscore 5 b0 1 b1expend 1 b2avginc 1 u 32 The coefficient of interest for policy purposes is b1 the ceteris paribus effect of expend on avgscore By including avginc explicitly in the model we are able to control for its effect on avgscore This is likely to be important because average family income tends to be correlated with perstudent spend ing spending levels are often determined by both property and local income taxes In simple regres sion analysis avginc would be included in the error term which would likely be correlated with expend causing the OLS estimator of b1 in the twovariable model to be biased In the two previous similar examples we have shown how observable factors other than the vari able of primary interest educ in equation 31 and expend in equation 32 can be included in a regression model Generally we can write a model with two independent variables as y 5 b0 1 b1x1 1 b2x2 1 u 33 where b0 is the intercept b1 measures the change in y with respect to x1 holding other factors fixed b2 measures the change in y with respect to x2 holding other factors fixed Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 62 Multiple regression analysis is also useful for generalizing functional relationships between variables As an example suppose family consumption cons is a quadratic function of family income inc cons 5 b0 1 b1inc 1 b2inc2 1 u 34 where u contains other factors affecting consumption In this model consumption depends on only one observed factor income so it might seem that it can be handled in a simple regression frame work But the model falls outside simple regression because it contains two functions of income inc and inc2 and therefore three parameters b0 b1 and b2 Nevertheless the consumption function is easily written as a regression model with two independent variables by letting x1 5 inc and x2 5 inc2 Mechanically there will be no difference in using the method of ordinary least squares intro duced in Section 32 to estimate equations as different as 31 and 34 Each equation can be writ ten as 33 which is all that matters for computation There is however an important difference in how one interprets the parameters In equation 31 b1 is the ceteris paribus effect of educ on wage The parameter b1 has no such interpretation in 34 In other words it makes no sense to measure the effect of inc on cons while holding inc2 fixed because if inc changes then so must inc2 Instead the change in consumption with respect to the change in incomethe marginal propensity to consume is approximated by Dcons Dinc b1 1 2b2inc See Appendix A for the calculus needed to derive this equation In other words the marginal effect of income on consumption depends on b2 as well as on b1 and the level of income This example shows that in any particular application the definitions of the independent variables are crucial But for the theoretical development of multiple regression we can be vague about such details We will study examples like this more completely in Chapter 6 In the model with two independent variables the key assumption about how u is related to x1 and x2 is E1u0x1 x22 5 0 35 The interpretation of condition 35 is similar to the interpretation of Assumption SLR4 for simple regression analysis It means that for any values of x1 and x2 in the population the average of the unobserved factors is equal to zero As with simple regression the important part of the assumption is that the expected value of u is the same for all combinations of x1 and x2 that this common value is zero is no assumption at all as long as the intercept b0 is included in the model see Section 21 How can we interpret the zero conditional mean assumption in the previous examples In equa tion 31 the assumption is E1u0educexper2 5 0 This implies that other factors affecting wage are not related on average to educ and exper Therefore if we think innate ability is part of u then we will need average ability levels to be the same across all combinations of education and experience in the working population This may or may not be true but as we will see in Section 33 this is the ques tion we need to ask in order to determine whether the method of ordinary least squares produces unbiased estimators The example measuring student performance equation 32 is similar to the wage equa tion The zero conditional mean assumption is E1u0expend avginc2 5 0 which means that other A simple model to explain city murder rates murdrate in terms of the probability of conviction prbconv and average sentence length avgsen is murdrate 5 b0 1 b1prbconv 1 b2avgsen 1 u What are some factors contained in u Do you think the key assumption 35 is likely to hold Exploring FurthEr 31 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 63 factors affecting test scoresschool or student characteristicsare on average unrelated to per student funding and average family income When applied to the quadratic consumption function in 34 the zero conditional mean assump tion has a slightly different interpretation Written literally equation 35 becomes E1u0incinc22 5 0 Since inc2 is known when inc is known including inc2 in the expectation is redundant E1u0incinc22 5 0 is the same as E1u0inc2 5 0 Nothing is wrong with putting inc2 along with inc in the expectation when stating the assumption but E1u0inc2 5 0 is more concise 31b The Model with k Independent Variables Once we are in the context of multiple regression there is no need to stop with two independent vari ables Multiple regression analysis allows many observed factors to affect y In the wage example we might also include amount of job training years of tenure with the current employer measures of abil ity and even demographic variables like the number of siblings or mothers education In the school funding example additional variables might include measures of teacher quality and school size The general multiple linear regression MLR model also called the multiple regression model can be written in the population as y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 p 1 bkxk 1 u 36 where b0 is the intercept b1 is the parameter associated with x1 b2 is the parameter associated with x2 and so on Since there are k independent variables and an intercept equation 36 contains k 1 unknown population parameters For shorthand purposes we will sometimes refer to the parameters other than the intercept as slope parameters even though this is not always literally what they are See equa tion 34 where neither b1 nor b2 is itself a slope but together they determine the slope of the rela tionship between consumption and income The terminology for multiple regression is similar to that for simple regression and is given in Table 31 Just as in simple regression the variable u is the error term or disturbance It contains factors other than x1 x2 c xk that affect y No matter how many explanatory variables we include in our model there will always be factors we cannot include and these are collectively contained in u When applying the general multiple regression model we must know how to interpret the param eters We will get plenty of practice now and in subsequent chapters but it is useful at this point to be reminded of some things we already know Suppose that CEO salary salary is related to firm sales sales and CEO tenure ceoten with the firm by log1salary2 5 b0 1 b1log1sales2 1 b2ceoten 1 b3ceoten2 1 u 37 This fits into the multiple regression model with k 5 3 by defining y 5 logsalary x1 5 log1sales2 x2 5 ceoten and x3 5 ceoten2 As we know from Chapter 2 the parameter b1 is the ceteris paribus TAblE 31 Terminology for Multiple Regression Y x1 x2 c xk Dependent variable Independent variables Explained variable Explanatory variables Response variable Control variables Predicted variable Predictor variables Regressand Regressors Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 64 elasticity of salary with respect to sales If b3 5 0 then 100b2 is approximately the ceteris paribus percentage increase in salary when ceoten increases by one year When b3 2 0 the effect of ceoten on salary is more complicated We will postpone a detailed treatment of general mod els with quadratics until Chapter 6 Equation 37 provides an important reminder about multiple regression analysis The term lin ear in a multiple linear regression model means that equation 36 is linear in the parameters bj Equation 37 is an example of a multiple regression model that while linear in the bj is a nonlinear relationship between salary and the variables sales and ceoten Many applications of multiple linear regression involve nonlinear relationships among the underlying variables The key assumption for the general multiple regression model is easy to state in terms of a con ditional expectation E1u0x1 x2 p xk2 5 0 38 At a minimum equation 38 requires that all factors in the unobserved error term be uncorrelated with the explanatory variables It also means that we have correctly accounted for the functional rela tionships between the explained and explanatory variables Any problem that causes u to be correlated with any of the independent variables causes 38 to fail In Section 33 we will show that assump tion 38 implies that OLS is unbiased and will derive the bias that arises when a key variable has been omitted from the equation In Chapters 15 and 16 we will study other reasons that might cause 38 to fail and show what can be done in cases where it does fail 32 Mechanics and Interpretation of Ordinary Least Squares We now summarize some computational and algebraic features of the method of ordinary least squares as it applies to a particular set of data We also discuss how to interpret the estimated equation 32a Obtaining the OLS Estimates We first consider estimating the model with two independent variables The estimated OLS equation is written in a form similar to the simple regression case y 5 b 0 1 b 1x1 1 b 2x2 39 where b 0 the estimate of b0 b 1 the estimate of b1 b 2 the estimate of b2 But how do we obtain b 0 b 1 and b 2 The method of ordinary least squares chooses the esti mates to minimize the sum of squared residuals That is given n observations on y x1 and x2 51xi1 xi2 yi2 i 5 1 2 p n6 the estimates b 0 b 1 and b 2 are chosen simultaneously to make a n i51 1yi 2 b 0 2 b 1xi1 2 b 2xi22 2 310 as small as possible To understand what OLS is doing it is important to master the meaning of the indexing of the independent variables in 310 The independent variables have two subscripts here i followed by either 1 or 2 The i subscript refers to the observation number Thus the sum in 310 is over all i 5 1 to n observations The second index is simply a method of distinguishing between different independent variables In the example relating wage to educ and exper xi1 5 educi is education for person i in the sample and xi2 5 experi is experience for person i The sum of squared residu als in equation 310 is g n i51 1wagei 2 b 0 2 b 1educi 2 b 2experi2 2 In what follows the i subscript Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 65 is reserved for indexing the observation number If we write xij then this means the ith observation on the jth independent variable Some authors prefer to switch the order of the observation num ber and the variable number so that x1i is observation i on variable one But this is just a matter of notational taste In the general case with k independent variables we seek estimates b 0 b 1 p b k in the equation y 5 b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk 311 The OLS estimates k 1 of them are chosen to minimize the sum of squared residuals a n i51 1yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 2 312 This minimization problem can be solved using multivariable calculus see Appendix 3A This leads to k 1 linear equations in k 1 unknowns b 0 b 1 p b k a n i51 1yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 a n i51xi11yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 a n i51xi21yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 a n i51xik1yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 313 These are often called the OLS first order conditions As with the simple regression model in Section 22 the OLS first order conditions can be obtained by the method of moments under assump tion 38 E1u2 5 0 and E1xju2 5 0 where j 5 1 2 p k The equations in 313 are the sample counterparts of these population moments although we have omitted the division by the sample size n For even moderately sized n and k solving the equations in 313 by hand calculations is tedi ous Nevertheless modern computers running standard statistics and econometrics software can solve these equations with large n and k very quickly There is only one slight caveat we must assume that the equations in 313 can be solved uniquely for the b j For now we just assume this as it is usually the case in wellspecified models In Section 33 we state the assumption needed for unique OLS estimates to exist see Assumption MLR3 As in simple regression analysis equation 311 is called the OLS regression line or the sample regression function SRF We will call b 0 the OLS intercept estimate and b 1 p b k the OLS slope estimates corresponding to the independent variables x1 x2 p xk To indicate that an OLS regression has been run we will either write out equation 311 with y and x1 p xk replaced by their variable names such as wage educ and exper or we will say that we ran an OLS regression of y on x1 x2 p xk or that we regressed y on x1 x2 p xk These are shorthand for saying that the method of ordinary least squares was used to obtain the OLS equation 311 Unless explicitly stated otherwise we always estimate an intercept along with the slopes 32b Interpreting the OLS Regression Equation More important than the details underlying the computation of the b j is the interpretation of the estimated equation We begin with the case of two independent variables y 5 b 0 1 b 1x1 1 b 2x2 314 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 66 The intercept b 0 in equation 314 is the predicted value of y when x1 5 0 and x2 5 0 Sometimes setting x1 and x2 both equal to zero is an interesting scenario in other cases it will not make sense Nevertheless the intercept is always needed to obtain a prediction of y from the OLS regression line as 314 makes clear The estimates b 1 and b 2 have partial effect or ceteris paribus interpretations From equation 314 we have Dy 5 b 1Dx1 1 b 2Dx2 so we can obtain the predicted change in y given the changes in x1 and x2 Note how the intercept has nothing to do with the changes in y In particular when x2 is held fixed so that Dx2 5 0 then Dy 5 b 1Dx1 holding x2 fixed The key point is that by including x2 in our model we obtain a coefficient on x1 with a ceteris paribus interpretation This is why multiple regression analysis is so useful Similarly Dy 5 b 2Dx2 holding x1 fixed ExamplE 31 Determinants of College Gpa The variables in GPA1 include the college grade point average colGPA high school GPA hsGPA and achievement test score ACT for a sample of 141 students from a large university both college and high school GPAs are on a fourpoint scale We obtain the following OLS regression line to pre dict college GPA from high school GPA and achievement test score colGPA 5 129 1 453 hsGPA 1 0094 ACT n 5 141 315 How do we interpret this equation First the intercept 129 is the predicted college GPA if hsGPA and ACT are both set as zero Since no one who attends college has either a zero high school GPA or a zero on the achievement test the intercept in this equation is not by itself meaningful More interesting estimates are the slope coefficients on hsGPA and ACT As expected there is a positive partial relationship between colGPA and hsGPA Holding ACT fixed another point on hsGPA is associated with 453 of a point on the college GPA or almost half a point In other words if we choose two students A and B and these students have the same ACT score but the high school GPA of Student A is one point higher than the high school GPA of Student B then we predict Student A to have a college GPA 453 higher than that of Student B This says nothing about any two actual people but it is our best prediction The sign on ACT implies that while holding hsGPA fixed a change in the ACT score of 10 points a very large change since the maximum ACT score is 36 and the average score in the sample is about 24 with a standard deviation less than threeaffects colGPA by less than onetenth of a point This is a small effect and it suggests that once high school GPA is accounted for the ACT score is not a strong predictor of college GPA Naturally there are many other factors that contribute to GPA but here we focus on statistics available for high school students Later after we discuss statistical inference we will show that not only is the coefficient on ACT practically small it is also statistically insignificant If we focus on a simple regression analysis relating colGPA to ACT only we obtain colGPA 5 240 1 0271 ACT n 5 141 thus the coefficient on ACT is almost three times as large as the estimate in 315 But this equation does not allow us to compare two people with the same high school GPA it corresponds to a different experiment We say more about the differences between multiple and simple regression later Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 67 The case with more than two independent variables is similar The OLS regression line is y 5 b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk 316 Written in terms of changes Dy 5 b 1Dx1 1 b 2Dx2 1 p 1 b kDxk 317 The coefficient on x1 measures the change in y due to a oneunit increase in x1 holding all other inde pendent variables fixed That is Dy 5 b 1Dx1 318 holding x2 x3 p xk fixed Thus we have controlled for the variables x2 x3 p xk when estimating the effect of x1 on y The other coefficients have a similar interpretation The following is an example with three independent variables ExamplE 32 Hourly Wage Equation Using the 526 observations on workers in WAGE1 we include educ years of education exper years of labor market experience and tenure years with the current employer in an equation explaining logwage The estimated equation is log1wage2 5 284 1 092 educ 1 0041 exper 1 022 tenure n 5 526 319 As in the simple regression case the coefficients have a percentage interpretation The only difference here is that they also have a ceteris paribus interpretation The coefficient 092 means that holding exper and tenure fixed another year of education is predicted to increase logwage by 092 which translates into an approximate 92 100092 increase in wage Alternatively if we take two people with the same levels of experience and job tenure the coefficient on educ is the proportionate dif ference in predicted wage when their education levels differ by one year This measure of the return to education at least keeps two important productivity factors fixed whether it is a good estimate of the ceteris paribus return to another year of education requires us to study the statistical properties of OLS see Section 33 32c On the Meaning of Holding Other Factors Fixed in Multiple Regression The partial effect interpretation of slope coefficients in multiple regression analysis can cause some confusion so we provide a further discussion now In Example 31 we observed that the coefficient on ACT measures the predicted difference in colGPA holding hsGPA fixed The power of multiple regression analysis is that it provides this ceteris paribus interpretation even though the data have not been collected in a ceteris paribus fashion In giv ing the coefficient on ACT a partial effect interpretation it may seem that we actually went out and sampled people with the same high school GPA but possibly with different ACT scores This is not the case The data are a random sample from a large university there were no restrictions placed on the sample values of hsGPA or ACT in obtaining the data Rarely do we have the luxury of holding certain variables fixed in obtaining our sample If we could collect a sample of individuals with the same high school GPA then we could perform a simple regression analysis relating colGPA to ACT Multiple regression effectively allows us to mimic this situation without restricting the values of any independ ent variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 68 The power of multiple regression analysis is that it allows us to do in nonexperimental environ ments what natural scientists are able to do in a controlled laboratory setting keep other factors fixed 32d Changing More Than One Independent Variable Simultaneously Sometimes we want to change more than one independent variable at the same time to find the result ing effect on the dependent variable This is easily done using equation 317 For example in equa tion 319 we can obtain the estimated effect on wage when an individual stays at the same firm for another year exper general workforce experience and tenure both increase by one year The total effect holding educ fixed is Dlog1wage2 5 0041 Dexper 1 022 Dtenure 5 0041 1 022 5 0261 or about 26 Since exper and tenure each increase by one year we just add the coefficients on exper and tenure and multiply by 100 to turn the effect into a percentage 32e OLS Fitted Values and Residuals After obtaining the OLS regression line 311 we can obtain a fitted or predicted value for each observation For observation i the fitted value is simply yi 5 b 0 1 b 1xi1 1 b 2xi2 1 p 1 b kxik 320 which is just the predicted value obtained by plugging the values of the independent variables for observation i into equation 311 We should not forget about the intercept in obtaining the fitted values otherwise the answer can be very misleading As an example if in 315 hsGPAi 5 35 and ACTi 5 24 colGPAi 5 129 1 4531352 1 00941242 5 3101 rounded to three places after the decimal Normally the actual value yi for any observation i will not equal the predicted value yi OLS minimizes the average squared prediction error which says nothing about the prediction error for any particular observation The residual for observation i is defined just as in the simple regression case u i 5 yi 2 yi 321 There is a residual for each observation If u i 0 then yi is below yi which means that for this observation yi is underpredicted If u i 0 then yi yi and yi is overpredicted The OLS fitted values and residuals have some important properties that are immediate exten sions from the single variable case In Example 31 the OLS fitted line explain ing college GPA in terms of high school GPA and ACT score is colGPA 5 129 1 453 hsGPA 1 0094 ACT If the average high school GPA is about 34 and the average ACT score is about 242 what is the average college GPA in the sample Exploring FurthEr 32 1 The sample average of the residuals is zero and so y 5 y 2 The sample covariance between each independent variable and the OLS residuals is zero Consequently the sample covariance between the OLS fitted values and the OLS residuals is zero 3 The point 1x1 x2 p xk y2 is always on the OLS regression line y 5 b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk The first two properties are immediate consequences of the set of equations used to obtain the OLS estimates The first equation in 313 says that the sum of the residuals is zero The remaining equa tions are of the form g n i51 xiju i 5 0 which implies that each inde pendent variable has zero sample covariance with u i Property 3 follows immediately from property 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 69 32f A Partialling Out Interpretation of Multiple Regression When applying OLS we do not need to know explicit formulas for the b j that solve the system of equations in 313 Nevertheless for certain derivations we do need explicit formulas for the b j These formulas also shed further light on the workings of OLS Consider again the case with k 5 2 independent variables y 5 b 0 1 b 1x1 1 b 2x2 For concrete ness we focus on b 1 One way to express b 1 is b 1 5 a a n i51ri1yiba a n i51r2 i1b 322 where the ri1 are the OLS residuals from a simple regression of x1 on x2 using the sample at hand We regress our first independent variable x1 on our second independent variable x2 and then obtain the residuals y plays no role here Equation 322 shows that we can then do a simple regression of y on r1 to obtain b 1 Note that the residuals ri1 have a zero sample average and so b 1 is the usual slope estimate from simple regression The representation in equation 322 gives another demonstration of b 1s partial effect interpre tation The residuals ri1 are the part of xi1 that is uncorrelated with xi2 Another way of saying this is that ri1 is xi1 after the effects of xi2 have been partialled out or netted out Thus b 1 measures the sam ple relationship between y and x1 after x2 has been partialled out In simple regression analysis there is no partialling out of other variables because no other vari ables are included in the regression Computer Exercise C5 steps you through the partialling out pro cess using the wage data from Example 32 For practical purposes the important thing is that b 1 in the equation y 5 b 0 1 b 1x1 1 b 2x2 measures the change in y given a oneunit increase in x1 holding x2 fixed In the general model with k explanatory variables b 1 can still be written as in equation 322 but the residuals ri1 come from the regression of x1 on x2 p xk Thus b 1 measures the effect of x1 on y after x2 p xk have been partialled or netted out In econometrics the general partialling out result is usually called the FrischWaugh theorem It has many uses in theoretical and applied econometrics We will see applications to time series regressions in Chapter 10 32g Comparison of Simple and Multiple Regression Estimates Two special cases exist in which the simple regression of y on x1 will produce the same OLS estimate on x1 as the regression of y on x1 and x2 To be more precise write the simple regression of y on x1 as y 5 b 0 1 b 1x1 and write the multiple regression as y 5 b 0 1 b 1x1 1 b 2x2 We know that the simple regression coefficient b 1 does not usually equal the multiple regression coefficient b 1 It turns out there is a simple relationship between b 1 and b 1 which allows for interesting comparisons between simple and multiple regression b 1 5 b 1 1 b 2d 1 323 where d 1 is the slope coefficient from the simple regression of xi2 on xi1 i 5 1 p n This equation shows how b 1 differs from the partial effect of x1 on y The confounding term is the partial effect of x2 on y times the slope in the sample regression of x2 on x1 See Section 3A4 in the chapter appendix for a more general verification The relationship between b 1 and b 1 also shows there are two distinct cases where they are equal 1 The partial effect of x2 on y is zero in the sample That is b 2 5 0 2 x1 and x2 are uncorrelated in the sample That is d 1 5 0 Even though simple and multiple regression estimates are almost never identical we can use the above formula to characterize why they might be either very different or quite similar For exam ple if b 2 is small we might expect the multiple and simple regression estimates of b1 to be similar Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 70 In Example 31 the sample correlation between hsGPA and ACT is about 0346 which is a nontrivial correlation But the coefficient on ACT is fairly little It is not surprising to find that the simple regres sion of colGPA on hsGPA produces a slope estimate of 482 which is not much different from the estimate 453 in 315 ExamplE 33 participation in 401k pension plans We use the data in 401K to estimate the effect of a plans match rate mrate on the participation rate prate in its 401k pension plan The match rate is the amount the firm contributes to a workers fund for each dollar the worker contributes up to some limit thus mrate 5 75 means that the firm contributes 75 for each dollar contributed by the worker The participation rate is the percentage of eligible workers having a 401k account The variable age is the age of the 401k plan There are 1534 plans in the data set the average prate is 8736 the average mrate is 732 and the average age is 132 Regressing prate on mrate age gives prate 5 8012 1 552 mrate 1 243 age n 5 1534 Thus both mrate and age have the expected effects What happens if we do not control for age The estimated effect of age is not trivial and so we might expect a large change in the estimated effect of mrate if age is dropped from the regression However the simple regression of prate on mrate yields prate 5 8308 1 586 mrate The simple regression estimate of the effect of mrate on prate is clearly different from the multiple regression estimate but the difference is not very big The sim ple regression estimate is only about 62 larger than the multiple regression estimate This can be explained by the fact that the sample correlation between mrate and age is only 12 In the case with k independent variables the simple regression of y on x1 and the multiple regres sion of y on x1 x2 c xk produce an identical estimate of x1 only if 1 the OLS coefficients on x2 through xk are all zero or 2 x1 is uncorrelated with each of x2 c xk Neither of these is very likely in practice But if the coefficients on x2 through xk are small or the sample correlations between x1 and the other independent variables are insubstantial then the simple and multiple regression esti mates of the effect of x1 on y can be similar 32h GoodnessofFit As with simple regression we can define the total sum of squares SST the explained sum of squares SSE and the residual sum of squares or sum of squared residuals SSR as SST a n i51 1yi 2 y2 2 324 SSE a n i51 1yi 2 y2 2 325 SSR a n i51u i 2 326 Using the same argument as in the simple regression case we can show that SST 5 SSE 1 SSR 327 In other words the total variation in 5yi6 is the sum of the total variations in 5yi6 and in 5u i6 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 71 Assuming that the total variation in y is nonzero as is the case unless yi is constant in the sample we can divide 327 by SST to get SSRSST 1 SSESST 5 1 Just as in the simple regression case the Rsquared is defined to be R2 SSESST 5 1 2 SSRSST 328 and it is interpreted as the proportion of the sample variation in yi that is explained by the OLS regres sion line By definition R2 is a number between zero and one R2 can also be shown to equal the squared correlation coefficient between the actual yi and the fitted values yi That is R2 5 a a n i51 1yi 2 y2 1yi 2 y 2 b 2 a a n i51 1yi 2 y2 2b a a n i51 1yi 2 y 2 2b 329 We have put the average of the yi in 329 to be true to the formula for a correlation coefficient we know that this average equals y because the sample average of the residuals is zero and yi 5 yi 1 u i ExamplE 34 Determinants of College Gpa From the grade point average regression that we did earlier the equation with R2 is colGPA 5 129 1 453 hsGPA 1 0094 ACT n 5 141 R2 5 176 This means that hsGPA and ACT together explain about 176 of the variation in college GPA for this sample of students This may not seem like a high percentage but we must remember that there are many other factorsincluding family background personality quality of high school education affinity for collegethat contribute to a students college performance If hsGPA and ACT explained almost all of the variation in colGPA then performance in college would be preordained by high school performance An important fact about R2 is that it never decreases and it usually increases when another inde pendent variable is added to a regression and the same set of observations is used for both regressions This algebraic fact follows because by definition the sum of squared residuals never increases when additional regressors are added to the model For example the last digit of ones social security num ber has nothing to do with ones hourly wage but adding this digit to a wage equation will increase the R2 by a little at least An important caveat to the previous assertion about Rsquared is that it assumes we do not have missing data on the explanatory variables If two regressions use different sets of observations then in general we cannot tell how the Rsquareds will compare even if one regression uses a subset of regressors For example suppose we have a full set of data on the variables y x1 and x2 but for some units in our sample data are missing on x3 Then we cannot say that the Rsquared from regressing y on x1 x2 will be less than that from regressing y on x1 x2 and x3 it could go either way Missing data can be an important practical issue and we will return to it in Chapter 9 The fact that R2 never decreases when any variable is added to a regression makes it a poor tool for deciding whether one variable or several variables should be added to a model The factor that should determine whether an explanatory variable belongs in a model is whether the explanatory Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 72 variable has a nonzero partial effect on y in the population We will show how to test this hypoth esis in Chapter 4 when we cover statistical inference We will also see that when used properly R2 allows us to test a group of variables to see if it is important for explaining y For now we use it as a goodnessoffit measure for a given model ExamplE 35 Explaining arrest Records CRIME1 contains data on arrests during the year 1986 and other information on 2725 men born in either 1960 or 1961 in California Each man in the sample was arrested at least once prior to 1986 The variable narr86 is the number of times the man was arrested during 1986 it is zero for most men in the sample 7229 and it varies from 0 to 12 The percentage of men arrested once during 1986 was 2051 The variable pcnv is the proportion not percentage of arrests prior to 1986 that led to conviction avgsen is average sentence length served for prior convictions zero for most people ptime86 is months spent in prison in 1986 and qemp86 is the number of quarters during which the man was employed in 1986 from zero to four A linear model explaining arrests is narr86 5 b0 1 b1pcnv 1 b2avgsen 1 b3ptime86 1 b4qemp86 1 u where pcnv is a proxy for the likelihood for being convicted of a crime and avgsen is a measure of ex pected severity of punishment if convicted The variable ptime86 captures the incarcerative effects of crime if an individual is in prison he cannot be arrested for a crime outside of prison Labor market opportunities are crudely captured by qemp86 First we estimate the model without the variable avgsen We obtain narr86 5 712 2 150 pcnv 2 034 ptime86 2 104 qemp86 n 5 2725 R2 5 0413 This equation says that as a group the three variables pcnv ptime86 and qemp86 explain about 41 of the variation in narr86 Each of the OLS slope coefficients has the anticipated sign An increase in the proportion of convictions lowers the predicted number of arrests If we increase pcnv by 50 a large increase in the probability of conviction then holding the other factors fixed Dnarr86 5 21501502 5 2075 This may seem unusual because an arrest cannot change by a fraction But we can use this value to obtain the predicted change in expected arrests for a large group of men For example among 100 men the predicted fall in arrests when pcnv increases by 50 is 275 Similarly a longer prison term leads to a lower predicted number of arrests In fact if ptime86 increases from 0 to 12 predicted arrests for a particular man fall by 0341122 5 408 Another quarter in which legal employment is reported lowers predicted arrests by 104 which would be 104 arrests among 100 men If avgsen is added to the model we know that R2 will increase The estimated equation is narr86 5 707 2 151 pcnv 1 0074 avgsen 2 037 ptime86 2 103 qemp86 n 5 2725 R2 5 0422 Thus adding the average sentence variable increases R2 from 0413 to 0422 a practically small ef fect The sign of the coefficient on avgsen is also unexpected it says that a longer average sentence length increases criminal activity Example 35 deserves a final word of caution The fact that the four explanatory variables included in the second regression explain only about 42 of the variation in narr86 does not necessarily mean Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 73 that the equation is useless Even though these variables collectively do not explain much of the vari ation in arrests it is still possible that the OLS estimates are reliable estimates of the ceteris paribus effects of each independent variable on narr86 As we will see whether this is the case does not directly depend on the size of R2 Generally a low R2 indicates that it is hard to predict individual outcomes on y with much accuracy something we study in more detail in Chapter 6 In the arrest example the small R2 reflects what we already suspect in the social sciences it is generally very dif ficult to predict individual behavior 32i Regression through the Origin Sometimes an economic theory or common sense suggests that b0 should be zero and so we should briefly mention OLS estimation when the intercept is zero Specifically we now seek an equation of the form y 5 b 1x1 1 b 2x2 1 p 1 b kxk 330 where the symbol over the estimates is used to distinguish them from the OLS estimates obtained along with the intercept as in 311 In 330 when x1 5 0 x2 5 0 p xk 5 0 the predicted value is zero In this case b 1 p b k are said to be the OLS estimates from the regression of y on x1 x2 p xk through the origin The OLS estimates in 330 as always minimize the sum of squared residuals but with the intercept set at zero You should be warned that the properties of OLS that we derived earlier no longer hold for regression through the origin In particular the OLS residuals no longer have a zero sample average Further if R2 is defined as 1 2 SSRSST where SST is given in 324 and SSR is now a n i511yi 2 b 1xi1 2 p 2 b kxik2 2 then R2 can actually be negative This means that the sample average y explains more of the variation in the yi than the explanatory variables Either we should include an intercept in the regression or conclude that the explanatory variables poorly explain y To always have a nonnegative Rsquared some economists prefer to calculate R2 as the squared correla tion coefficient between the actual and fitted values of y as in 329 In this case the average fitted value must be computed directly since it no longer equals y However there is no set rule on comput ing Rsquared for regression through the origin One serious drawback with regression through the origin is that if the intercept b0 in the popula tion model is different from zero then the OLS estimators of the slope parameters will be biased The bias can be severe in some cases The cost of estimating an intercept when b0 is truly zero is that the variances of the OLS slope estimators are larger 33 The Expected Value of the OLS Estimators We now turn to the statistical properties of OLS for estimating the parameters in an underlying population model In this section we derive the expected value of the OLS estimators In particu lar we state and discuss four assumptions which are direct extensions of the simple regression model assumptions under which the OLS estimators are unbiased for the population parameters We also explicitly obtain the bias in OLS when an important variable has been omitted from the regression You should remember that statistical properties have nothing to do with a particular sample but rather with the property of estimators when random sampling is done repeatedly Thus Sections 33 34 and 35 are somewhat abstract Although we give examples of deriving bias for particular models it is not meaningful to talk about the statistical properties of a set of estimates obtained from a single sample The first assumption we make simply defines the multiple linear regression MLR model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 74 Assumption MLR1 linear in parameters The model in the population can be written as y 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u 331 where b0 b1 p bk are the unknown parameters constants of interest and u is an unobserved random error or disturbance term Equation 331 formally states the population model sometimes called the true model to allow for the possibility that we might estimate a model that differs from 331 The key feature is that the model is linear in the parameters b0 b1 p bk As we know 331 is quite flexible because y and the independent variables can be arbitrary functions of the underlying variables of interest such as natu ral logarithms and squares see for example equation 37 Assumption MLR2 Random Sampling We have a random sample of n observations 5 1xi1 xi2 p xik yi2 i 5 1 2 p n6 following the population model in Assumption MLR1 Sometimes we need to write the equation for a particular observation i for a randomly drawn observation from the population we have yi 5 b0 1 b1xi1 1 b2xi2 1 p 1 bkxik 1 ui 332 Remember that i refers to the observation and the second subscript on x is the variable number For example we can write a CEO salary equation for a particular CEO i as log1salaryi2 5 b0 1 b1log1salesi2 1 b2ceoteni 1 b3ceoteni 2 1 ui 333 The term ui contains the unobserved factors for CEO i that affect his or her salary For applications it is usually easiest to write the model in population form as in 331 It contains less clutter and emphasizes the fact that we are interested in estimating a population relationship In light of model 331 the OLS estimators b 0 b 1 b 2 p b k from the regression of y on x1 p xk are now considered to be estimators of b0 b1 p bk In Section 32 we saw that OLS chooses the intercept and slope estimates for a particular sample so that the residuals average to zero and the sam ple correlation between each independent variable and the residuals is zero Still we did not include conditions under which the OLS estimates are well defined for a given sample The next assumption fills that gap Assumption MLR3 No perfect Collinearity In the sample and therefore in the population none of the independent variables is constant and there are no exact linear relationships among the independent variables Assumption MLR3 is more complicated than its counterpart for simple regression because we must now look at relationships between all independent variables If an independent variable in 331 is an exact linear combination of the other independent variables then we say the model suffers from perfect collinearity and it cannot be estimated by OLS It is important to note that Assumption MLR3 does allow the independent variables to be cor related they just cannot be perfectly correlated If we did not allow for any correlation among the b 0 b 1 c b k Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 75 independent variables then multiple regression would be of very limited use for econometric analysis For example in the model relating test scores to educational expenditures and average family income avgscore 5 b0 1 b1expend 1 b2avginc 1 u we fully expect expend and avginc to be correlated school districts with high average family incomes tend to spend more perstudent on education In fact the primary motivation for including avginc in the equation is that we suspect it is correlated with expend and so we would like to hold it fixed in the analysis Assumption MLR3 only rules out perfect correlation between expend and avginc in our sample We would be very unlucky to obtain a sample where perstudent expenditures are per fectly correlated with average family income But some correlation perhaps a substantial amount is expected and certainly allowed The simplest way that two independent variables can be perfectly correlated is when one vari able is a constant multiple of another This can happen when a researcher inadvertently puts the same variable measured in different units into a regression equation For example in estimating a relation ship between consumption and income it makes no sense to include as independent variables income measured in dollars as well as income measured in thousands of dollars One of these is redundant What sense would it make to hold income measured in dollars fixed while changing income measured in thousands of dollars We already know that different nonlinear functions of the same variable can appear among the regressors For example the model cons 5 b0 1 b1inc 1 b2inc2 1 u does not violate Assumption MLR3 even though x2 5 inc2 is an exact function of x1 5 inc inc2 is not an exact linear function of inc Including inc2 in the model is a useful way to generalize functional form unlike including income measured in dollars and in thousands of dollars Common sense tells us not to include the same explanatory variable measured in different units in the same regression equation There are also more subtle ways that one independent variable can be a multiple of another Suppose we would like to estimate an extension of a constant elasticity con sumption function It might seem natural to specify a model such as log1cons2 5 b0 1 b1log1inc2 1 b2log1inc22 1 u 334 where x1 5 log1inc2 and x2 5 log1inc22 Using the basic properties of the natural log see Appendix A log1inc22 5 2 log1inc2 That is x2 5 2x1 and naturally this holds for all observations in the sample This violates Assumption MLR3 What we should do instead is include 3log1inc2 42 not log1inc22 along with loginc This is a sensible extension of the constant elasticity model and we will see how to interpret such models in Chapter 6 Another way that independent variables can be perfectly collinear is when one independent vari able can be expressed as an exact linear function of two or more of the other independent variables For example suppose we want to estimate the effect of campaign spending on campaign outcomes For simplicity assume that each election has two candidates Let voteA be the percentage of the vote for Candidate A let expendA be campaign expenditures by Candidate A let expendB be campaign expenditures by Candidate B and let totexpend be total campaign expenditures the latter three vari ables are all measured in dollars It may seem natural to specify the model as voteA 5 b0 1 b1expendA 1 b2expendB 1 b3totexpend 1 u 335 in order to isolate the effects of spending by each candidate and the total amount of spending But this model violates Assumption MLR3 because x3 5 x1 1 x2 by definition Trying to interpret this equa tion in a ceteris paribus fashion reveals the problem The parameter of b1 in equation 335 is sup posed to measure the effect of increasing expenditures by Candidate A by one dollar on Candidate As vote holding Candidate Bs spending and total spending fixed This is nonsense because if expendB and totexpend are held fixed then we cannot increase expendA The solution to the perfect collinearity in 335 is simple drop any one of the three variables from the model We would probably drop totexpend and then the coefficient on expendA would Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 76 measure the effect of increasing expenditures by A on the percentage of the vote received by A hold ing the spending by B fixed The prior examples show that Assumption MLR3 can fail if we are not careful in specifying our model Assumption MLR3 also fails if the sample size n is too small in relation to the number of parameters being estimated In the general regression model in equation 331 there are k 1 1 parameters and MLR3 fails if n k 1 1 Intuitively this makes sense to estimate k 1 1 parameters we need at least k 1 1 observations Not surprisingly it is better to have as many observations as pos sible something we will see with our variance calculations in Section 34 If the model is carefully specified and n k 1 1 Assumption MLR3 can fail in rare cases due to bad luck in collecting the sample For example in a wage equation with education and experience as variables it is possible that we could obtain a ran dom sample where each individual has exactly twice as much education as years of experience This sce nario would cause Assumption MLR3 to fail but it can be considered very unlikely unless we have an extremely small sample size The final and most important assumption needed for unbiasedness is a direct extension of Assumption SLR4 Assumption MLR4 Zero Conditional mean The error u has an expected value of zero given any values of the independent variables In other words E1u0x1 x2 p xk2 5 0 336 One way that Assumption MLR4 can fail is if the functional relationship between the explained and explanatory variables is misspecified in equation 331 for example if we forget to include the quad ratic term inc2 in the consumption function cons 5 b0 1 b1inc 1 b2inc2 1 u when we estimate the model Another functional form misspecification occurs when we use the level of a variable when the log of the variable is what actually shows up in the population model or vice versa For example if the true model has logwage as the dependent variable but we use wage as the dependent variable in our regression analysis then the estimators will be biased Intuitively this should be pretty clear We will discuss ways of detecting functional form misspecification in Chapter 9 Omitting an important factor that is correlated with any of x1 x2 p xk causes Assumption MLR4 to fail also With multiple regression analysis we are able to include many factors among the explana tory variables and omitted variables are less likely to be a problem in multiple regression analysis than in simple regression analysis Nevertheless in any application there are always factors that due to data limitations or ignorance we will not be able to include If we think these factors should be controlled for and they are correlated with one or more of the independent variables then Assumption MLR4 will be violated We will derive this bias later There are other ways that u can be correlated with an explanatory variable In Chapters 9 and 15 we will discuss the problem of measurement error in an explanatory variable In Chapter 16 we cover the conceptually more difficult problem in which one or more of the explanatory variables is determined jointly with yas occurs when we view quantities and prices as being determined by the intersection of supply and demand curves We must postpone our study of these problems until we have a firm grasp of multiple regression analysis under an ideal set of assumptions When Assumption MLR4 holds we often say that we have exogenous explanatory variables If xj is correlated with u for any reason then xj is said to be an endogenous explanatory variable The In the previous example if we use as ex planatory variables expendA expendB and shareA where shareA 5 100 expendA totexpend is the percentage share of total campaign expenditures made by Candidate A does this violate Assumption MLR3 Exploring FurthEr 33 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 77 terms exogenous and endogenous originated in simultaneous equations analysis see Chapter 16 but the term endogenous explanatory variable has evolved to cover any case in which an explana tory variable may be correlated with the error term Before we show the unbiasedness of the OLS estimators under MLR1 to MLR4 a word of cau tion Beginning students of econometrics sometimes confuse Assumptions MLR3 and MLR4 but they are quite different Assumption MLR3 rules out certain relationships among the independent or explanatory variables and has nothing to do with the error u You will know immediately when car rying out OLS estimation whether or not Assumption MLR3 holds On the other hand Assumption MLR4the much more important of the tworestricts the relationship between the unobserved factors in u and the explanatory variables Unfortunately we will never know for sure whether the average value of the unobserved factors is unrelated to the explanatory variables But this is the criti cal assumption We are now ready to show unbiasedness of OLS under the first four multiple regression assump tions As in the simple regression case the expectations are conditional on the values of the explana tory variables in the sample something we show explicitly in Appendix 3A but not in the text UNbiaSEDNESS of olS Under Assumptions MLR1 through MLR4 E1b j2 5 bj j 5 0 1 p k 337 for any values of the population parameter bj In other words the OLS estimators are unbiased estimators of the population parameters thEorEm 31 In our previous empirical examples Assumption MLR3 has been satisfied because we have been able to compute the OLS estimates Furthermore for the most part the samples are randomly chosen from a welldefined population If we believe that the specified models are correct under the key Assumption MLR4 then we can conclude that OLS is unbiased in these examples Since we are approaching the point where we can use multiple regression in serious empirical work it is useful to remember the meaning of unbiasedness It is tempting in examples such as the wage equation in 319 to say something like 92 is an unbiased estimate of the return to educa tion As we know an estimate cannot be unbiased an estimate is a fixed number obtained from a particular sample which usually is not equal to the population parameter When we say that OLS is unbiased under Assumptions MLR1 through MLR4 we mean that the procedure by which the OLS estimates are obtained is unbiased when we view the procedure as being applied across all possible random samples We hope that we have obtained a sample that gives us an estimate close to the popu lation value but unfortunately this cannot be assured What is assured is that we have no reason to believe our estimate is more likely to be too big or more likely to be too small 33a Including Irrelevant Variables in a Regression Model One issue that we can dispense with fairly quickly is that of inclusion of an irrelevant variable or overspecifying the model in multiple regression analysis This means that one or more of the inde pendent variables is included in the model even though it has no partial effect on y in the population That is its population coefficient is zero To illustrate the issue suppose we specify the model as y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 u 338 and this model satisfies Assumptions MLR1 through MLR4 However x3 has no effect on y after x1 and x2 have been controlled for which means that b3 5 0 The variable x3 may or may not be Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 78 correlated with x1 or x2 all that matters is that once x1 and x2 are controlled for x3 has no effect on y In terms of conditional expectations E1y0x1 x2 x32 5 E1y0x1 x22 5 b0 1 b1x1 1 b2x2 Because we do not know that b3 5 0 we are inclined to estimate the equation including x3 y 5 b 0 1 b 1x1 1 b 2x2 1 b 3x3 339 We have included the irrelevant variable x3 in our regression What is the effect of including x3 in 339 when its coefficient in the population model 338 is zero In terms of the unbiasedness of b 1 and b 2 there is no effect This conclusion requires no special derivation as it follows immediately from Theorem 31 Remember unbiasedness means E1b j2 5 bj for any value of bj including bj 5 0 Thus we can conclude that E1b 02 5 b0 E1b 12 5 b1 E1b 22 5 b2 E1b 32 5 0 for any values of b0 b1 and b2 Even though b 3 itself will never be exactly zero its average value across all random samples will be zero The conclusion of the preceding example is much more general including one or more irrelevant variables in a multiple regression model or overspecifying the model does not affect the unbiased ness of the OLS estimators Does this mean it is harmless to include irrelevant variables No As we will see in Section 34 including irrelevant variables can have undesirable effects on the variances of the OLS estimators 33b Omitted Variable Bias The Simple Case Now suppose that rather than including an irrelevant variable we omit a variable that actually belongs in the true or population model This is often called the problem of excluding a relevant variable or underspecifying the model We claimed in Chapter 2 and earlier in this chapter that this problem generally causes the OLS estimators to be biased It is time to show this explicitly and just as importantly to derive the direction and size of the bias Deriving the bias caused by omitting an important variable is an example of misspecification analysis We begin with the case where the true population model has two explanatory variables and an error term y 5 b0 1 b1x1 1 b2x2 1 u 340 and we assume that this model satisfies Assumptions MLR1 through MLR4 Suppose that our primary interest is in b1 the partial effect of x1 on y For example y is hourly wage or log of hourly wage x1 is education and x2 is a measure of innate ability In order to get an unbiased estimator of b1 we should run a regression of y on x1 and x2 which gives unbiased estima tors of b0 b1 and b2 However due to our ignorance or data unavailability we estimate the model by excluding x2 In other words we perform a simple regression of y on x1 only obtaining the equation y 5 b 0 1 b 1x1 341 We use the symbol rather than to emphasize that b 1 comes from an underspecified model When first learning about the omitted variable problem it can be difficult to distinguish between the underlying true model 340 in this case and the model that we actually estimate which is cap tured by the regression in 341 It may seem silly to omit the variable x2 if it belongs in the model but often we have no choice For example suppose that wage is determined by wage 5 b0 1 b1educ 1 b2abil 1 u 342 Since ability is not observed we instead estimate the model wage 5 b0 1 b1educ 1 v where v 5 b2abil 1 u The estimator of b1 from the simple regression of wage on educ is what we are calling b 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 79 We derive the expected value of b 1 conditional on the sample values of x1 and x2 Deriving this expectation is not difficult because b 1 is just the OLS slope estimator from a simple regression and we have already studied this estimator extensively in Chapter 2 The difference here is that we must analyze its properties when the simple regression model is misspecified due to an omitted variable As it turns out we have done almost all of the work to derive the bias in the simple regression estimator of b 1 From equation 323 we have the algebraic relationship b 1 5 b 1 1 b 2d 1 where b 1 and b 2 are the slope estimators if we could have them from the multiple regression yi on xi1 xi2 i 5 1 p n 343 and d 1 is the slope from the simple regression xi2 on xi1 i 5 1 p n 344 Because d 1 depends only on the independent variables in the sample we treat it as fixed nonran dom when computing Eb 1 Further since the model in 340 satisfies Assumptions MLR1 through MLR4 we know that b 1 and b 2 would be unbiased for b1 and b2 respectively Therefore E1b 12 5 E1b 1 1 b 2d 12 5 E1b 12 1 E1b 22 d 1 5 b1 1 b2d 1 345 which implies the bias in b 1 is Bias1b 12 5 E1b 12 2 b1 5 b2d 1 346 Because the bias in this case arises from omitting the explanatory variable x2 the term on the right hand side of equation 346 is often called the omitted variable bias From equation 346 we see that there are two cases where b 1 is unbiased The first is pretty obvious if b2 5 0so that x2 does not appear in the true model 340then b 1 is unbiased We already know this from the simple regression analysis in Chapter 2 The second case is more interest ing If d 1 5 0 then b 1 is unbiased for b1 even if b2 2 0 Because d 1 is the sample covariance between x1 and x2 over the sample variance of x1 d 1 5 0 if and only if x1 and x2 are uncorrelated in the sample Thus we have the important conclusion that if x1 and x2 are uncorrelated in the sample then b 1 is unbiased This is not surprising in Section 32 we showed that the simple regression estimator b 1 and the multiple regression estimator b 1 are the same when x1 and x2 are uncorrelated in the sample We can also show that b 1 is unbiased without conditioning on the xi2 if E1x20x12 5 E1x22 then for estimating b1 leaving x2 in the error term does not violate the zero conditional mean assumption for the error once we adjust the intercept When x1 and x2 are correlated d 1 has the same sign as the correlation between x1 and x2 d 1 0 if x1 and x2 are positively correlated and d 1 0 if x1 and x2 are negatively correlated The sign of the bias in b 1 depends on the signs of both b2 and d 1 and is summarized in Table 32 for the four possible cases when there is bias Table 32 warrants careful study For example the bias in b 1 is positive if b2 0 x2 has a positive effect on y and x1 and x2 are positively correlated the bias is negative if b2 0 and x1 and x2 are negatively correlated and so on Table 32 summarizes the direction of the bias but the size of the bias is also very important A small bias of either sign need not be a cause for concern For example if the return to education in the population is 86 and the bias in the OLS estimator is 01 a tenth of one percentage point then TAblE 32 Summary of Bias in b1 When x2 Is Omitted in Estimating Equation 340 Corr1x1 x22 0 Corr1x1 x22 0 b2 0 Positive bias Negative bias b2 0 Negative bias Positive bias Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 80 we would not be very concerned On the other hand a bias on the order of three percentage points would be much more serious The size of the bias is determined by the sizes of b2 and d 1 In practice since b2 is an unknown population parameter we cannot be certain whether b2 is positive or negative Nevertheless we usually have a pretty good idea about the direction of the partial effect of x2 on y Further even though the sign of the correlation between x1 and x2 cannot be known if x2 is not observed in many cases we can make an educated guess about whether x1 and x2 are posi tively or negatively correlated In the wage equation 342 by definition more ability leads to higher productivity and therefore higher wages b2 0 Also there are reasons to believe that educ and abil are positively correlated on average individuals with more innate ability choose higher levels of education Thus the OLS estimates from the simple regression equation wage 5 b0 1 b1educ 1 v are on average too large This does not mean that the estimate obtained from our sample is too big We can only say that if we collect many random samples and obtain the simple regression estimates each time then the average of these estimates will be greater than b1 ExamplE 36 Hourly Wage Equation Suppose the model log1wage2 5 b0 1 b1educ 1 b2abil 1 u satisfies Assumptions MLR1 through MLR4 The data set in WAGE1 does not contain data on ability so we estimate b1 from the simple regression log1wage2 5 584 1 083 educ n 5 526 R2 5 186 347 This is the result from only a single sample so we cannot say that 083 is greater than b1 the true return to education could be lower or higher than 83 and we will never know for sure Nevertheless we know that the average of the estimates across all random samples would be too large As a second example suppose that at the elementary school level the average score for students on a standardized exam is determined by avgscore 5 b0 1 b1expend 1 b2povrate 1 u 348 where expend is expenditure perstudent and povrate is the poverty rate of the children in the school Using school district data we only have observations on the percentage of students with a passing grade and perstudent expenditures we do not have information on poverty rates Thus we estimate b1 from the simple regression of avgscore on expend We can again obtain the likely bias in b 1 First b2 is probably negative there is ample evidence that children living in poverty score lower on average on standardized tests Second the average expenditure perstudent is probably negatively correlated with the poverty rate The higher the pov erty rate the lower the average perstudent spending so that Corr1x1 x22 0 From Table 32 b 1 will have a positive bias This observation has important implications It could be that the true effect of spending is zero that is b1 5 0 However the simple regression estimate of b1 will usually be greater than zero and this could lead us to conclude that expenditures are important when they are not When reading and performing empirical work in economics it is important to master the termi nology associated with biased estimators In the context of omitting a variable from model 340 if E1b 12 b1 then we say that b 1 has an upward bias When E1b 12 b1 b 1 has a downward bias These definitions are the same whether b1 is positive or negative The phrase biased toward zero refers to cases where E1b 12 is closer to zero than is b1 Therefore if b1 is positive then b 1 is biased toward zero if it has a downward bias On the other hand if b1 0 then b 1 is biased toward zero if it has an upward bias Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 81 33c Omitted Variable Bias More General Cases Deriving the sign of omitted variable bias when there are multiple regressors in the estimated model is more difficult We must remember that correlation between a single explanatory variable and the error generally results in all OLS estimators being biased For example suppose the population model y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 u 349 satisfies Assumptions MLR1 through MLR4 But we omit x3 and estimate the model as y 5 b 0 1 b 1x1 1 b 2x2 350 Now suppose that x2 and x3 are uncorrelated but that x1 is correlated with x3 In other words x1 is correlated with the omitted variable but x2 is not It is tempting to think that while b 1 is probably biased based on the derivation in the previous subsection b 2 is unbiased because x2 is uncorrelated with x3 Unfortunately this is not generally the case both b 1 and b 2 will normally be biased The only exception to this is when x1 and x2 are also uncorrelated Even in the fairly simple model above it can be difficult to obtain the direction of bias in b 1 and b 2 This is because x1 x2 and x3 can all be pairwise correlated Nevertheless an approximation is often practically useful If we assume that x1 and x2 are uncorrelated then we can study the bias in b 1 as if x2 were absent from both the population and the estimated models In fact when x1 and x2 are uncorrelated it can be shown that E1b 12 5 b1 1 b3 a n i51 1xi1 2 x12xi3 a n i51 1xi1 2 x12 2 This is just like equation 345 but b3 replaces b2 and x3 replaces x2 in regression 344 Therefore the bias in b 1 is obtained by replacing b2 with b3 and x2 with x3 in Table 32 If b3 0 and Corr 1x1 x32 0 the bias in b 1 is positive and so on As an example suppose we add exper to the wage model wage 5 b0 1 b1educ 1 b2exper 1 b3abil 1 u If abil is omitted from the model the estimators of both b1 and b2 are biased even if we assume exper is uncorrelated with abil We are mostly interested in the return to education so it would be nice if we could conclude that b 1 has an upward or a downward bias due to omitted ability This conclusion is not possible without further assumptions As an approximation let us suppose that in addition to exper and abil being uncorrelated educ and exper are also uncorrelated In reality they are some what negatively correlated Since b3 0 and educ and abil are positively correlated b 1 would have an upward bias just as if exper were not in the model The reasoning used in the previous example is often followed as a rough guide for obtaining the likely bias in estimators in more complicated models Usually the focus is on the relationship between a particular explanatory variable say x1 and the key omitted factor Strictly speaking ignor ing all other explanatory variables is a valid practice only when each one is uncorrelated with x1 but it is still a useful guide Appendix 3A contains a more careful analysis of omitted variable bias with multiple explanatory variables 34 The Variance of the OLS Estimators We now obtain the variance of the OLS estimators so that in addition to knowing the central ten dencies of the b j we also have a measure of the spread in its sampling distribution Before finding the variances we add a homoskedasticity assumption as in Chapter 2 We do this for two reasons Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 82 First the formulas are simplified by imposing the constant error variance assumption Second in Section 35 we will see that OLS has an important efficiency property if we add the homoskedasticity assumption In the multiple regression framework homoskedasticity is stated as follows Assumption MLR5 Homoskedasticity The error u has the same variance given any value of the explanatory variables In other words Var1u0x1 p x k2 5 s2 Assumption MLR5 means that the variance in the error term u conditional on the explanatory vari ables is the same for all combinations of outcomes of the explanatory variables If this assumption fails then the model exhibits heteroskedasticity just as in the twovariable case In the equation wage 5 b0 1 b1educ 1 b2exper 1 b3tenure 1 u homoskedasticity requires that the variance of the unobserved error u does not depend on the levels of education experience or tenure That is Var1u0educ exper tenure2 5 s2 If this variance changes with any of the three explanatory variables then heteroskedasticity is present Assumptions MLR1 through MLR5 are collectively known as the GaussMarkov assumptions for crosssectional regression So far our statements of the assumptions are suitable only when applied to crosssectional analysis with random sampling As we will see the GaussMarkov assump tions for time series analysis and for other situations such as panel data analysis are more difficult to state although there are many similarities In the discussion that follows we will use the symbol x to denote the set of all independent vari ables 1x1 p xk2 Thus in the wage regression with educ exper and tenure as independent variables x 5 1educ exper tenure2 Then we can write Assumptions MLR1 and MLR4 as E1y0x2 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk and Assumption MLR5 is the same as Var1y0x2 5 s2 Stating the assumptions in this way clearly illustrates how Assumption MLR5 differs greatly from Assumption MLR4 Assumption MLR4 says that the expected value of y given x is linear in the parameters but it certainly depends on x1 x2 c xk Assumption MLR5 says that the variance of y given x does not depend on the values of the independent variables We can now obtain the variances of the b j where we again condition on the sample values of the independent variables The proof is in the appendix to this chapter SampliNG VaRiaNCES of tHE olS SlopE EStimatoRS Under Assumptions MLR1 through MLR5 conditional on the sample values of the independent variables Var1b j2 5 s2 SSTj11 2 R2 j 2 r 351 for j 5 1 2 p k where SSTj 5 a n i511xij 2 xj2 2 is the total sample variation in xj and R2 j is the Rsquared from regressing xj on all other independent variables and including an intercept thEorEm 32 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 83 The careful reader may be wondering whether there is a simple formula for the variance of b j where we do not condition on the sample outcomes of the explanatory variables The answer is none that is useful The formula in 351 is a highly nonlinear function of the xij making averaging out across the population distribution of the explanatory variables virtually impossible Fortunately for any practical purpose equation 351 is what we want Even when we turn to approximate large sample properties of OLS in Chapter 5 it turns out that 351 estimates the quantity we need for largesample analysis provided Assumptions MLR1 through MLR5 hold Before we study equation 351 in more detail it is important to know that all of the Gauss Markov assumptions are used in obtaining this formula Whereas we did not need the homoskedastic ity assumption to conclude that OLS is unbiased we do need it to justify equation 351 The size of Var1b j2 is practically important A larger variance means a less precise estimator and this translates into larger confidence intervals and less accurate hypotheses tests as we will see in Chapter 4 In the next subsection we discuss the elements comprising 351 34a The Components of the OLS Variances Multicollinearity Equation 351 shows that the variance of b j depends on three factors s2 SSTj and R2 j Remember that the index j simply denotes any one of the independent variables such as education or poverty rate We now consider each of the factors affecting Var1b j2 in turn The Error Variance s2 From equation 351 a larger s2 means larger sampling variances for the OLS estimators This is not at all surprising more noise in the equation a larger s2 makes it more difficult to estimate the partial effect of any of the independent variables on y and this is re flected in higher variances for the OLS slope estimators Because s2 is a feature of the population it has nothing to do with the sample size It is the one component of 351 that is unknown We will see later how to obtain an unbiased estimator of s2 For a given dependent variable y there is really only one way to reduce the error variance and that is to add more explanatory variables to the equation take some factors out of the error term Unfortunately it is not always possible to find additional legitimate factors that affect y The Total Sample Variation in xj SSTj From equation 351 we see that the larger the total variation in xj is the smaller is Var1b j2 Thus everything else being equal for estimating bj we prefer to have as much sample variation in xj as possible We already discovered this in the simple regression case in Chapter 2 Although it is rarely possible for us to choose the sample values of the independent variables there is a way to increase the sample variation in each of the independent vari ables increase the sample size In fact when one randomly samples from a population SSTj increases without bound as the sample size increasesroughly as a linear function of n This is the component of the variance that systematically depends on the sample size When SSTj is small Var1b j2 can get very large but a small SSTj is not a violation of Assumption MLR3 Technically as SSTj goes to zero Var1b j2 approaches infinity The extreme case of no sam ple variation in xj SSTj 5 0 is not allowed by Assumption MLR3 The Linear Relationships among the Independent Variables R 2 j The term R2 j in equation 351 is the most difficult of the three components to understand This term does not appear in simple regression analysis because there is only one independent variable in such cases It is important to see that this Rsquared is distinct from the Rsquared in the regression of y on x1 x2 p xk R2 j is obtained from a regression involving only the independent variables in the original model where xj plays the role of a dependent variable Consider first the k 5 2 case y 5 b0 1 b1x1 1 b2x2 1 u Then Var1b 12 5 s23SST111 2 R2 12 4 where R2 1 is the Rsquared from the simple regression of x1 on x2 and an intercept as always Because Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 84 the Rsquared measures goodnessoffit a value of R2 1 close to one indicates that x2 explains much of the variation in x1 in the sample This means that x1 and x2 are highly correlated As R2 1 increases to one Var1b 12 gets larger and larger Thus a high degree of linear relationship between x1 and x2 can lead to large variances for the OLS slope estimators A similar argument applies to b 2 See Figure 31 for the relationship between Var1b 12 and the Rsquared from the regression of x1 on x2 In the general case R2 j is the proportion of the total variation in xj that can be explained by the other independent variables appearing in the equation For a given s2 and SSTj the smallest Var1b j2 is obtained when R2 j 5 0 which happens if and only if xj has zero sample correlation with every other independent variable This is the best case for estimating bj but it is rarely encountered The other extreme case R2 j 5 1 is ruled out by Assumption MLR3 because R2 j 5 1 means that in the sample xj is a perfect linear combination of some of the other independent variables in the regression A more relevant case is when R2 j is close to one From equation 351 and Figure 31 we see that this can cause Var1b j2 to be large Var1b j2 S as R2 j S 1 High but not perfect cor relation between two or more independent variables is called multicollinearity Before we discuss the multicollinearity issue further it is important to be very clear on one thing A case where R2 j is close to one is not a violation of Assumption MLR3 Since multicollinearity violates none of our assumptions the problem of multicollinearity is not really well defined When we say that multicollinearity arises for estimating bj when R2 j is close to one we put close in quotation marks because there is no absolute number that we can cite to conclude that multicollinearity is a problem For example R2 j 5 9 means that 90 of the sample variation in xj can be explained by the other independent variables in the regression model Unquestionably this means that xj has a strong linear relationship to the other independent variables But whether this translates into a Var1b j2 that is too large to be useful depends on the sizes of s2 and SSTj As we will see in Chapter 4 for statistical inference what ultimately matters is how big b j is in relation to its standard deviation Var 1 ˆ 0 1 R1 2 FiguRE 31 Var1b 12 as a function of R 2 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 85 Just as a large value of R2 j can cause a large Var1b j2 so can a small value of SSTj Therefore a small sample size can lead to large sampling variances too Worrying about high degrees of cor relation among the independent variables in the sample is really no different from worrying about a small sample size both work to increase Var1b j2 The famous University of Wisconsin econometri cian Arthur Goldberger reacting to econometricians obsession with multicollinearity has tongue in cheek coined the term micronumerosity which he defines as the problem of small sample size For an engaging discussion of multicollinearity and micronumerosity see Goldberger 1991 Although the problem of multicollinearity cannot be clearly defined one thing is clear every thing else being equal for estimating bj it is better to have less correlation between xj and the other independent variables This observation often leads to a discussion of how to solve the multicol linearity problem In the social sciences where we are usually passive collectors of data there is no good way to reduce variances of unbiased estimators other than to collect more data For a given data set we can try dropping other independent variables from the model in an effort to reduce multicol linearity Unfortunately dropping a variable that belongs in the population model can lead to bias as we saw in Section 33 Perhaps an example at this point will help clarify some of the issues raised concerning multicol linearity Suppose we are interested in estimating the effect of various school expenditure categories on student performance It is likely that expenditures on teacher salaries instructional materials ath letics and so on are highly correlated wealthier schools tend to spend more on everything and poorer schools spend less on everything Not surprisingly it can be difficult to estimate the effect of any particular expenditure category on student performance when there is little variation in one category that cannot largely be explained by variations in the other expenditure categories this leads to high R2 j for each of the expenditure variables Such multicollinearity problems can be mitigated by col lecting more data but in a sense we have imposed the problem on ourselves we are asking questions that may be too subtle for the available data to answer with any precision We can probably do much better by changing the scope of the analysis and lumping all expenditure categories together since we would no longer be trying to estimate the partial effect of each separate category Another important point is that a high degree of correlation between certain independent vari ables can be irrelevant as to how well we can estimate other parameters in the model For example consider a model with three independent variables y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 u where x2 and x3 are highly correlated Then Var1b 22 and Var1b 32 may be large But the amount of correlation between x2 and x3 has no direct effect on Var1b 12 In fact if x1 is uncorrelated with x2 and x3 then R2 1 5 0 and Var1b 12 5 s2SST1 regard less of how much correlation there is between x2 and x3 If b1 is the parameter of interest we do not really care about the amount of correlation between x2 and x3 The previous observation is important because economists often include many control variables in order to isolate the causal effect of a particular variable For example in looking at the relation ship between loan approval rates and percentage of minorities in a neighborhood we might include variables like average income average housing value measures of creditworthiness and so on because these factors need to be accounted for in order to draw causal conclusions about discrimina tion Income housing prices and creditworthiness are generally highly correlated with each other Suppose you postulate a model explain ing final exam score in terms of class at tendance Thus the dependent variable is final exam score and the key explanatory variable is number of classes attended To control for student abilities and efforts out side the classroom you include among the explanatory variables cumulative GPA SAT score and measures of high school perfor mance Someone says You cannot hope to learn anything from this exercise because cumulative GPA SAT score and high school performance are likely to be highly collinear What should be your response Exploring FurthEr 34 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 86 But high correlations among these controls do not make it more difficult to determine the effects of discrimination Some researchers find it useful to compute statistics intended to determine the severity of mul ticollinearity in a given application Unfortunately it is easy to misuse such statistics because as we have discussed we cannot specify how much correlation among explanatory variables is too much Some multicollinearity diagnostics are omnibus statistics in the sense that they detect a strong linear relationship among any subset of explanatory variables For reasons that we just saw such statistics are of questionable value because they might reveal a problem simply because two control variables whose coefficients we do not care about are highly correlated Probably the most common omnibus multicollinearity statistic is the socalled condition number which is defined in terms of the full data matrix and is beyond the scope of this text See for example Belsley Kuh and Welsh 1980 Somewhat more useful but still prone to misuse are statistics for individual coefficients The most common of these is the variance inflation factor VIF which is obtained directly from equa tion 351 The VIF for slope coefficient j is simply VIFj 5 111 2 R2 j 2 precisely the term in Var1b j2 that is determined by correlation between xj and the other explanatory variables We can write Var1b j2 in equation 351 as Var1b j2 5 s2 SSTj VIFj which shows that VIFj is the factor by which Var1b j2 is higher because xj is not uncorrelated with the other explanatory variables Because VIFj is a function of R2 jindeed Figure 31 is essentially a graph of VIF1our previous discussion can be cast entirely in terms of the VIF For example if we had the choice we would like VIFj to be smaller other things equal But we rarely have the choice If we think certain explanatory variables need to be included in a regression to infer causality of xj then we are hesitant to drop them and whether we think VIFj is too high cannot really affect that decision If say our main interest is in the causal effect of x1 on y then we should ignore entirely the VIFs of other coefficients Finally setting a cutoff value for VIF above which we conclude multicol linearity is a problem is arbitrary and not especially helpful Sometimes the value 10 is chosen if VIFj is above 10 equivalently R2 j is above 9 then we conclude that multicollinearity is a problem for estimating bj But a VIFj above 10 does not mean that the standard deviation of b j is too large to be useful because the standard deviation also depends on s and SSTj and the latter can be increased by increasing the sample size Therefore just as with looking at the size of R2 j directly looking at the size of VIFj is of limited use although one might want to do so out of curiosity 34b Variances in Misspecified Models The choice of whether to include a particular variable in a regression model can be made by analyzing the tradeoff between bias and variance In Section 33 we derived the bias induced by leaving out a relevant variable when the true model contains two explanatory variables We continue the analysis of this model by comparing the variances of the OLS estimators Write the true population model which satisfies the GaussMarkov assumptions as y 5 b0 1 b1x1 1 b2x2 1 u We consider two estimators of b1 The estimator b 1 comes from the multiple regression y 5 b 0 1 b 1x1 1 b 2x2 352 In other words we include x2 along with x1 in the regression model The estimator b 1 is obtained by omitting x2 from the model and running a simple regression of y on x1 y 5 b 0 1 b 1x1 353 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 87 When b2 2 0 equation 353 excludes a relevant variable from the model and as we saw in Section 33 this induces a bias in b 1 unless x1 and x2 are uncorrelated On the other hand b 1 is unbi ased for b1 for any value of b2 including b2 5 0 It follows that if bias is used as the only criterion b 1 is preferred to b 1 The conclusion that b 1 is always preferred to b 1 does not carry over when we bring variance into the picture Conditioning on the values of x1 and x2 in the sample we have from 351 Var1b 12 5 s23SST111 2 R2 12 4 354 where SST1 is the total variation in x1 and R2 1 is the Rsquared from the regression of x1 on x2 Further a simple modification of the proof in Chapter 2 for twovariable regression shows that Var1b 12 5 s2SST1 355 Comparing 355 to 354 shows that Var1b 12 is always smaller than Var1b 12 unless x1 and x2 are uncorrelated in the sample in which case the two estimators b 1 and b 1 are the same Assuming that x1 and x2 are not uncorrelated we can draw the following conclusions 1 When b2 2 0 b 1 is biased b 1 is unbiased and Var1b 12 Var1b 12 2 When b2 5 0 b 1 and b 1 are both unbiased and Var1b 12 Var1b 12 From the second conclusion it is clear that b 1 is preferred if b2 5 0 Intuitively if x2 does not have a partial effect on y then including it in the model can only exacerbate the multicollinearity problem which leads to a less efficient estimator of b1 A higher variance for the estimator of b1 is the cost of including an irrelevant variable in a model The case where b2 2 0 is more difficult Leaving x2 out of the model results in a biased estimator of b1 Traditionally econometricians have suggested comparing the likely size of the bias due to omit ting x2 with the reduction in the variancesummarized in the size of R2 1to decide whether x2 should be included However when b2 2 0 there are two favorable reasons for including x2 in the model The most important of these is that any bias in b 1 does not shrink as the sample size grows in fact the bias does not necessarily follow any pattern Therefore we can usefully think of the bias as being roughly the same for any sample size On the other hand Var1b 12 and Var1b 12 both shrink to zero as n gets large which means that the multicollinearity induced by adding x2 becomes less important as the sample size grows In large samples we would prefer b 1 The other reason for favoring b 1 is more subtle The variance formula in 355 is conditional on the values of xi1 and xi2 in the sample which provides the best scenario for b 1 When b2 2 0 the vari ance of b 1 conditional only on x1 is larger than that presented in 355 Intuitively when b2 2 0 and x2 is excluded from the model the error variance increases because the error effectively contains part of x2 But the expression in equation 355 ignores the increase in the error variance because it will treat both regressors as nonrandom For practical purposes the s2 term in equation 355 increases when x2 is dropped from the equation A full discussion of the proper conditioning argument when computing the OLS variances would lead us too far astray Suffice it to say that equation 355 is too generous when it comes to measuring the precision of b 1 Fortunately statistical packages report the proper variance estimator and so we need not worry about the subtleties in the theoretical formulas After reading the next subsection you might want to study Problems 14 and 15 for further insight 34c Estimating s2 Standard Errors of the OLS Estimators We now show how to choose an unbiased estimator of s2 which then allows us to obtain unbiased estimators of Var1b j2 Because s2 5 E1u22 an unbiased estimator of s2 is the sample average of the squared errors n21g n i51 u2 i Unfortunately this is not a true estimator because we do not observe the ui Nevertheless recall that the errors can be written as ui 5 yi 2 b0 2 b1xi1 2 b2xi2 2 p 2 bkxik and so the reason Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 88 we do not observe the ui is that we do not know the bj When we replace each bj with its OLS estima tor we get the OLS residuals u i 5 yi 2 b 0 2 b 1xi1 2 b 2xi2 2 p 2 b kxik It seems natural to estimate s2 by replacing ui with the u i In the simple regression case we saw that this leads to a biased estimator The unbiased estimator of s2 in the general multiple regression case is s 2 5 a a n i51u 2 i b1n 2 k 2 12 5 SSR1n 2 k 2 12 356 We already encountered this estimator in the k 5 1 case in simple regression The term n k 1 in 356 is the degrees of freedom df for the general OLS problem with n observations and k independent variables Since there are k 1 parameters in a regression model with k independent variables and an intercept we can write df 5 n 2 1k 1 12 5 1number of observations2 2 1number of estimated parameters2 357 This is the easiest way to compute the degrees of freedom in a particular application count the num ber of parameters including the intercept and subtract this amount from the number of observations In the rare case that an intercept is not estimated the number of parameters decreases by one Technically the division by n k 1 in 356 comes from the fact that the expected value of the sum of squared residuals is E1SSR2 5 1n 2 k 2 12s2 Intuitively we can figure out why the degrees of freedom adjustment is necessary by returning to the first order conditions for the OLS estimators These can be written g n i51u i 5 0 and g n i51xiju i 5 0 where j 5 1 2 p k Thus in obtaining the OLS estimates k 1 1 restrictions are imposed on the OLS residuals This means that given n k 1 1 of the residuals the remaining k 1 1 residuals are known there are only n 2 k 1 1 degrees of freedom in the residuals This can be contrasted with the errors ui which have n degrees of freedom in the sample For reference we summarize this discussion with Theorem 33 We proved this theorem for the case of simple regression analysis in Chapter 2 see Theorem 23 A general proof that requires matrix algebra is provided in Appendix E Unbiased estimation of s2 Under the GaussMarkov assumptions MLR1 through MLR5 E1s 22 5 s2 Theorem 33 The positive square root of s 2 denoted s is called the standard error of the regression SER The SER is an estimator of the standard deviation of the error term This estimate is usually reported by regression packages although it is called different things by different packages In addition to SER s is also called the standard error of the estimate and the root mean squared error Note that s can either decrease or increase when another independent variable is added to a regression for a given sample This is because although SSR must fall when another explanatory variable is added the degrees of freedom also falls by one Because SSR is in the numerator and df is in the denominator we cannot tell beforehand which effect will dominate For constructing confidence intervals and conducting tests in Chapter 4 we will need to estimate the standard deviation of b j which is just the square root of the variance sd1b j2 5 s3SSTj11 2 R2 j 2 412 Since s is unknown we replace it with its estimator s This gives us the standard error of b j se1b j2 5 s 3SSTj11 2 R2 j 2 412 358 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 89 Just as the OLS estimates can be obtained for any given sample so can the standard errors Since se1b j2 depends on s the standard error has a sampling distribution which will play a role in Chapter 4 We should emphasize one thing about standard errors Because 358 is obtained directly from the variance formula in 351 and because 351 relies on the homoskedasticity Assumption MLR5 it follows that the standard error formula in 358 is not a valid estimator of sd1b j2 if the errors exhibit heteroskedasticity Thus while the presence of heteroskedasticity does not cause bias in the b j it does lead to bias in the usual formula for Var1b j2 which then invalidates the standard errors This is important because any regression package computes 358 as the default standard error for each coef ficient with a somewhat different representation for the intercept If we suspect heteroskedasticity then the usual OLS standard errors are invalid and some corrective action should be taken We will see in Chapter 8 what methods are available for dealing with heteroskedasticity For some purposes it is helpful to write se1b j2 5 s nsd1xj21 2 R2 j 359 in which we take sd1xj2 5 n21g n i511xij 2 xj2 2 to be the sample standard deviation where the total sum of squares is divided by n rather than n 1 The importance of equation 359 is that it shows how the sample size n directly affects the standard errors The other three terms in the formulas sd1xj2 and R2 jwill change with different samples but as n gets large they settle down to constants Therefore we can see from equation 359 that the standard errors shrink to zero at the rate 1n This formula demonstrates the value of getting more data the precision of the b j increases as n increases By contrast recall that unbiasedness holds for any sample size subject to being able to compute the estimators We will talk more about large sample properties of OLS in Chapter 5 35 Efficiency of OLS The GaussMarkov Theorem In this section we state and discuss the important GaussMarkov Theorem which justifies the use of the OLS method rather than using a variety of competing estimators We know one justification for OLS already under Assumptions MLR1 through MLR4 OLS is unbiased However there are many unbiased estimators of the bj under these assumptions for example see Problem 13 Might there be other unbiased estimators with variances smaller than the OLS estimators If we limit the class of competing estimators appropriately then we can show that OLS is best within this class Specifically we will argue that under Assumptions MLR1 through MLR5 the OLS estimator b j for bj is the best linear unbiased estimator BLUE To state the theorem we need to understand each component of the acronym BLUE First we know what an estimator is it is a rule that can be applied to any sample of data to produce an estimate We also know what an unbiased estimator is in the current context an estimator say b 1 of bj is an unbiased estimator of bj if E1b j2 5 bj for any b0 b1 p bk What about the meaning of the term linear In the current context an estimator b j of bj is lin ear if and only if it can be expressed as a linear function of the data on the dependent variable b j 5 a n i51 wijyi 360 where each wij can be a function of the sample values of all the independent variables The OLS esti mators are linear as can be seen from equation 322 Finally how do we define best For the current theorem best is defined as having the smallest variance Given two unbiased estimators it is logical to prefer the one with the smallest variance see Appendix C Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 90 Now let b 0 b 1 c b k denote the OLS estimators in model 331 under Assumptions MLR1 through MLR5 The GaussMarkov Theorem says that for any estimator b j that is linear and unbi ased Var1b j2 Var1b j2 and the inequality is usually strict In other words in the class of linear unbiased estimators OLS has the smallest variance under the five GaussMarkov assumptions Actually the theorem says more than this If we want to estimate any linear function of the bj then the corresponding linear combination of the OLS estimators achieves the smallest variance among all linear unbiased estimators We conclude with a theorem which is proven in Appendix 3A GaUSSmaRkoV tHEoREm Under Assumptions MLR1 through MLR5 b 0 b 1 c bk are the best linear unbiased estimators BLUEs of b0 b1 p bk respectively thEorEm 34 It is because of this theorem that Assumptions MLR1 through MLR5 are known as the GaussMarkov assumptions for crosssectional analysis The importance of the GaussMarkov Theorem is that when the standard set of assumptions holds we need not look for alternative unbiased estimators of the form in 360 none will be better than OLS Equivalently if we are presented with an estimator that is both linear and unbiased then we know that the variance of this estimator is at least as large as the OLS variance no additional cal culation is needed to show this For our purposes Theorem 34 justifies the use of OLS to estimate multiple regression models If any of the GaussMarkov assumptions fail then this theorem no longer holds We already know that failure of the zero conditional mean assumption Assumption MLR4 causes OLS to be biased so Theorem 34 also fails We also know that heteroskedasticity failure of Assumption MLR5 does not cause OLS to be biased However OLS no longer has the smallest variance among linear unbiased estimators in the presence of heteroskedasticity In Chapter 8 we analyze an estimator that improves upon OLS when we know the brand of heteroskedasticity 36 Some Comments on the Language of Multiple Regression Analysis It is common for beginners and not unheard of for experienced empirical researchers to report that they estimated an OLS model While we can usually figure out what someone means by this state ment it is important to understand that it is wrongon more than just an aesthetic leveland reflects a misunderstanding about the components of a multiple regression analysis The first thing to remember is that ordinary least squares OLS is an estimation method not a model A model describes an underlying population and depends on unknown parameters The linear model that we have been studying in this chapter can be writtenin the populationas y 5 b0 1 b1x1 1 p 1 bkxk 1 u 361 where the parameters are the bj Importantly we can talk about the meaning of the bj without ever looking at data It is true we cannot hope to learn much about the bj without data but the interpreta tion of the bj is obtained from the linear model in equation 361 Once we have a sample of data we can estimate the parameters While it is true that we have so far only discussed OLS as a possibility there are actually many more ways to use the data than we can even list We have focused on OLS due to its widespread use which is justified by using the statisti cal considerations we covered previously in this chapter But the various justifications for OLS rely on the assumptions we have made MLR1 through MLR5 As we will see in later chapters under Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 91 different assumptions different estimation methods are preferredeven though our model can still be represented by equation 361 Just a few examples include weighted least squares in Chapter 8 least absolute deviations in Chapter 9 and instrumental variables in Chapter 15 One might argue that the discussion here is overlay pedantic and that the phrase estimating an OLS model should be taken as a useful shorthand for I estimated a linear model by OLS This stance has some merit but we must remember that we have studied the properties of the OLS estima tors under different assumptions For example we know OLS is unbiased under the first four Gauss Markov assumptions but it has no special efficiency properties without Assumption MLR5 We have also seen through the study of the omitted variables problem that OLS is biased if we do not have Assumption MLR4 The problem with using imprecise language is that it leads to vagueness on the most important considerations what assumptions are being made on the underlying linear model The issue of the assumptions we are using is conceptually different from the estimator we wind up applying Ideally one writes down an equation like 361 with variable names that are easy to decipher such as math4 5 b0 1 b1classize4 1 b2math3 1 b3log1income2 1 b4motheduc 1 b5 fatheduc 1 u 362 if we are trying to explain outcomes on a fourthgrade math test Then in the context of equation 362 one includes a discussion of whether it is reasonable to maintain Assumption MLR4 focus ing on the factors that might still be in u and whether more complicated functional relationships are needed a topic we study in detail in Chapter 6 Next one describes the data source which ideally is obtained via random sampling as well as the OLS estimates obtained from the sample A proper way to introduce a discussion of the estimates is to say I estimated equation 362 by ordinary least squares Under the assumption that no important variables have been omitted from the equation and assuming random sampling the OLS estimator of the class size effect b1 is unbiased If the error term u has constant variance the OLS estimator is actually best linear unbiased As we will see in Chapters 4 and 5 we can often say even more about OLS Of course one might want to admit that while controlling for thirdgrade math score family income and parents education might account for important differences across students it might not be enoughfor example u can include motivation of the student or parentsin which case OLS might be biased A more subtle reason for being careful in distinguishing between an underlying population model and an estimation method used to estimate a model is that estimation methods such as OLS can be used essentially as an exercise in curve fitting or prediction without explicitly worrying about an underlying model and the usual statistical properties of unbiasedness and efficiency For example we might just want to use OLS to estimate a line that allows us to predict future college GPA for a set of high school students with given characteristics Summary 1 The multiple regression model allows us to effectively hold other factors fixed while examining the ef fects of a particular independent variable on the dependent variable It explicitly allows the independ ent variables to be correlated 2 Although the model is linear in its parameters it can be used to model nonlinear relationships by ap propriately choosing the dependent and independent variables 3 The method of ordinary least squares is easily applied to estimate the multiple regression model Each slope estimate measures the partial effect of the corresponding independent variable on the dependent variable holding all other independent variables fixed Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 92 PART 1 Regression Analysis with CrossSectional Data 4 R2 is the proportion of the sample variation in the dependent variable explained by the independent variables and it serves as a goodnessoffit measure It is important not to put too much weight on the value of R2 when evaluating econometric models 5 Under the first four GaussMarkov assumptions MLR1 through MLR4 the OLS estimators are un biased This implies that including an irrelevant variable in a model has no effect on the unbiasedness of the intercept and other slope estimators On the other hand omitting a relevant variable causes OLS to be biased In many circumstances the direction of the bias can be determined 6 Under the five GaussMarkov assumptions the variance of an OLS slope estimator is given by Var1b j2 5 s23SSTj11 2 R2 j 2 4 As the error variance s2 increases so does Var1b j2 while Var1b j2 de creases as the sample variation in xj SSTj increases The term R2 j measures the amount of collinearity between xj and the other explanatory variables As R2 j approaches one Var1b j2 is unbounded 7 Adding an irrelevant variable to an equation generally increases the variances of the remaining OLS estimators because of multicollinearity 8 Under the GaussMarkov assumptions MLR1 through MLR5 the OLS estimators are the best lin ear unbiased estimators BLUEs 9 Beginning in Chapter 4 we will use the standard errors of the OLS coefficients to compute confi dence intervals for the population parameters and to obtain test statistics for testing hypotheses about the population parameters Therefore in reporting regression results we now include the standard errors along with the associated OLS estimates In equation form standard errors are usually put in parentheses below the OLS estimates and the same convention is often used in tables of OLS output The GaussMarkov assuMpTions The following is a summary of the five GaussMarkov assumptions that we used in this chapter Remem ber the first four were used to establish unbiasedness of OLS whereas the fifth was added to derive the usual variance formulas and to conclude that OLS is best linear unbiased assumption MLr1 Linear in parameters The model in the population can be written as y 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u where b0 b1 p bk are the unknown parameters constants of interest and u is an unobserved random er ror or disturbance term assumption MLr2 random sampling We have a random sample of n observations 5 1xi1 xi2 p xik yi2 i 5 1 2 p n6 following the population model in Assumption MLR1 assumption MLr3 no perfect Collinearity In the sample and therefore in the population none of the independent variables is constant and there are no exact linear relationships among the independent variables assumption MLr4 Zero Conditional Mean The error u has an expected value of zero given any values of the independent variables In other words E1u0x1 x2 p xk2 5 0 assumption MLr5 homoskedasticity The error u has the same variance given any value of the explanatory variables In other words Var1u0x1 p xk2 5 s2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 93 Key Terms Best Linear Unbiased Estimator BLUE Biased Toward Zero Ceteris Paribus Degrees of Freedom df Disturbance Downward Bias Endogenous Explanatory Variable Error Term Excluding a Relevant Variable Exogenous Explanatory Variable Explained Sum of Squares SSE First Order Conditions FrischWaugh Theorem GaussMarkov Assumptions GaussMarkov Theorem Inclusion of an Irrelevant Variable Intercept Micronumerosity Misspecification Analysis Multicollinearity Multiple Linear Regression MLR Model Multiple Regression Analysis OLS Intercept Estimate OLS Regression Line OLS Slope Estimate Omitted Variable Bias Ordinary Least Squares Overspecifying the Model Partial Effect Perfect Collinearity Population Model Residual Residual Sum of Squares Sample Regression Function SRF Slope Parameter Standard Deviation of b j Standard Error of b j Standard Error of the Regression SER Sum of Squared Residuals SSR Total Sum of Squares SST True Model Underspecifying the Model Upward Bias Variance Inflation Factor VIF Problems 1 Using the data in GPA2 on 4137 college students the following equation was estimated by OLS colgpa 5 1392 2 0135 hsperc 1 00148 sat n 5 4137 R2 5 273 where colgpa is measured on a fourpoint scale hsperc is the percentile in the high school graduating class defined so that for example hsperc 5 means the top 5 of the class and sat is the combined math and verbal scores on the student achievement test i Why does it make sense for the coefficient on hsperc to be negative ii What is the predicted college GPA when hsperc 20 and sat 1050 iii Suppose that two high school graduates A and B graduated in the same percentile from high school but Student As SAT score was 140 points higher about one standard deviation in the sample What is the predicted difference in college GPA for these two students Is the differ ence large iv Holding hsperc fixed what difference in SAT scores leads to a predicted colgpa difference of 50 or onehalf of a grade point Comment on your answer 2 The data in WAGE2 on working men was used to estimate the following equation educ 5 1036 2 094 sibs 1 131 meduc 1 210 feduc n 5 722 R2 5 214 where educ is years of schooling sibs is number of siblings meduc is mothers years of schooling and feduc is fathers years of schooling i Does sibs have the expected effect Explain Holding meduc and feduc fixed by how much does sibs have to increase to reduce predicted years of education by one year A noninteger answer is acceptable here ii Discuss the interpretation of the coefficient on meduc iii Suppose that Man A has no siblings and his mother and father each have 12 years of education Man B has no siblings and his mother and father each have 16 years of education What is the predicted difference in years of education between B and A Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 94 PART 1 Regression Analysis with CrossSectional Data 3 The following model is a simplified version of the multiple regression model used by Biddle and Hamermesh 1990 to study the tradeoff between time spent sleeping and working and to look at other factors affecting sleep sleep 5 b0 1 b1totwrk 1 b2educ 1 b3age 1 u where sleep and totwrk total work are measured in minutes per week and educ and age are measured in years See also Computer Exercise C3 in Chapter 2 i If adults trade off sleep for work what is the sign of b1 ii What signs do you think b2 and b3 will have iii Using the data in SLEEP75 the estimated equation is sleep 5 363825 2 148 totwrk 2 1113 educ 1 220 age n 5 706 R2 5 113 If someone works five more hours per week by how many minutes is sleep predicted to fall Is this a large tradeoff iv Discuss the sign and magnitude of the estimated coefficient on educ v Would you say totwrk educ and age explain much of the variation in sleep What other factors might affect the time spent sleeping Are these likely to be correlated with totwrk 4 The median starting salary for new law school graduates is determined by log1salary2 5 b0 1 b1LSAT 1 b2GPA 1 b3log1libvol2 1 b4log1cost2 1 b5rank 1 u where LSAT is the median LSAT score for the graduating class GPA is the median college GPA for the class libvol is the number of volumes in the law school library cost is the annual cost of attending law school and rank is a law school ranking with rank 5 1 being the best i Explain why we expect b5 0 ii What signs do you expect for the other slope parameters Justify your answers iii Using the data in LAWSCH85 the estimated equation is log1salary2 5 834 1 0047 LAST 1 248 GPA 1 095 log1libvol2 1 038 log1cost2 2 0033 rank n 5 136 R2 5 842 What is the predicted ceteris paribus difference in salary for schools with a median GPA differ ent by one point Report your answer as a percentage iv Interpret the coefficient on the variable loglibvol v Would you say it is better to attend a higher ranked law school How much is a difference in ranking of 20 worth in terms of predicted starting salary 5 In a study relating college grade point average to time spent in various activities you distribute a sur vey to several students The students are asked how many hours they spend each week in four activi ties studying sleeping working and leisure Any activity is put into one of the four categories so that for each student the sum of hours in the four activities must be 168 i In the model GPA 5 b0 1 b1study 1 b2sleep 1 b3work 1 b4leisure 1 u does it make sense to hold sleep work and leisure fixed while changing study ii Explain why this model violates Assumption MLR3 iii How could you reformulate the model so that its parameters have a useful interpretation and it satisfies Assumption MLR3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 95 6 Consider the multiple regression model containing three independent variables under Assumptions MLR1 through MLR4 y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 u You are interested in estimating the sum of the parameters on x1 and x2 call this u1 5 b0 1 b1 i Show that u 1 5 b 1 1 b 2 is an unbiased estimator of u1 ii Find Var1u 12 in terms of Var1b 12Var1b 22 and Corr1b 1 b 22 7 Which of the following can cause OLS estimators to be biased i Heteroskedasticity ii Omitting an important variable iii A sample correlation coefficient of 95 between two independent variables both included in the model 8 Suppose that average worker productivity at manufacturing firms avgprod depends on two factors average hours of training avgtrain and average worker ability avgabil avgprod 5 b0 1 b1avgtrain 1 b2avgabil 1 u Assume that this equation satisfies the GaussMarkov assumptions If grants have been given to firms whose workers have less than average ability so that avgtrain and avgabil are negatively correlated what is the likely bias in b 1 obtained from the simple regression of avgprod on avgtrain 9 The following equation describes the median housing price in a community in terms of amount of pollution nox for nitrous oxide and the average number of rooms in houses in the community rooms log1price2 5 b0 1 b1log1nox2 1 b2rooms 1 u i What are the probable signs of b1 and b2 What is the interpretation of b1 Explain ii Why might nox or more precisely lognox and rooms be negatively correlated If this is the case does the simple regression of logprice on lognox produce an upward or a downward biased estimator of b1 iii Using the data in HPRICE2 the following equations were estimated log1price2 5 1171 2 1043 log1nox2 n 5 506 R2 5 264 log1price2 5 923 2 718 log1nox2 1 306 rooms n 5 506 R2 5 514 Is the relationship between the simple and multiple regression estimates of the elasticity of price with respect to nox what you would have predicted given your answer in part ii Does this mean that 718 is definitely closer to the true elasticity than 1043 10 Suppose that you are interested in estimating the ceteris paribus relationship between y and x1 For this purpose you can collect data on two control variables x2 and x3 For concreteness you might think of y as final exam score x1 as class attendance x2 as GPA up through the previous semester and x3 as SAT or ACT score Let b 1 be the simple regression estimate from y on x1 and let b 1 be the multiple regression estimate from y on x1 x2 x3 i If x1 is highly correlated with x2 and x3 in the sample and x2 and x3 have large partial effects on y would you expect b 1 and b 1 to be similar or very different Explain ii If x1 is almost uncorrelated with x2 and x3 but x2 and x3 are highly correlated will b 1 and b 1 tend to be similar or very different Explain iii If x1 is highly correlated with x2 and x3 and x2 and x3 have small partial effects on y would you expect se1b 12 or se1b 12 to be smaller Explain iv If x1 is almost uncorrelated with x2 and x3 x2 and x3 have large partial effects on y and x2 and x3 are highly correlated would you expect se1b 12 or se1b 12 to be smaller Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 96 PART 1 Regression Analysis with CrossSectional Data 11 Suppose that the population model determining y is y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 u and this model satisfies Assumptions MLR1 through MLR4 However we estimate the model that omits x3 Let b 0 b 1 and b 2 be the OLS estimators from the regression of y on x1 and x2 Show that the expected value of b 1 given the values of the independent variables in the sample is E1b 12 5 b1 1 b3 a n i51ri1xi3 a n i51r2 i1 where the ri1 are the OLS residuals from the regression of x1 on x2 Hint The formula for b 1 comes from equation 322 Plug yi 5 b0 1 b1xi1 1 b2xi2 1 b3xi3 1 ui into this equation After some algebra take the expectation treating xi3 and ri1 as nonrandom 12 The following equation represents the effects of tax revenue mix on subsequent employment growth for the population of counties in the United States growth 5 b0 1 b1shareP 1 b2shareI 1 b3shareS 1 other factors where growth is the percentage change in employment from 1980 to 1990 shareP is the share of prop erty taxes in total tax revenue shareI is the share of income tax revenues and shareS is the share of sales tax revenues All of these variables are measured in 1980 The omitted share shareF includes fees and miscellaneous taxes By definition the four shares add up to one Other factors would include expenditures on education infrastructure and so on all measured in 1980 i Why must we omit one of the tax share variables from the equation ii Give a careful interpretation of b1 13 i Consider the simple regression model y 5 b0 1 b1x 1 u under the first four GaussMarkov assumptions For some function gx for example g1x2 5 x2 or g1x2 5 log11 1 x22 define zi 5 g1xi2 Define a slope estimator as b 1 5 a a n i51 1zi 2 z2yiba a n i51 1zi 2 z2xib Show that b 1 is linear and unbiased Remember because Eux 0 you can treat both xi and zi as nonrandom in your derivation ii Add the homoskedasticity assumption MLR5 Show that Var1b 12 5 s2a a n i51 1zi 2 z2 2ba a n i51 1zi 2 z2xib 2 iii Show directly that under the GaussMarkov assumptions Var1b 12 Var1b 12 where b 1 is the OLS estimator Hint The CauchySchwartz inequality in Appendix B implies that an21 a n i51 1zi 2 z2 1xi 2 x2 b 2 an21 a n i51 1zi 2 z2 2ban21 a n i51 1xi 2 x2 2b notice that we can drop x from the sample covariance 14 Suppose you have a sample of size n on three variables y x1 and x2 and you are primarily interested in the effect of x1 on y Let b 1 be the coefficient on x1 from the simple regression and b 1 the coefficient on x1 from the regression y on x1 x2 The standard errors reported by any regression package are se1b 12 5 s SST1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 97 se1b 12 5 s SST1 VIF1 where s is the SER from the simple regression s is the SER from the multiple regression VIF1 5 111 2 R2 12 and R2 1 is the Rsquared from the regression of x1 on x2 Explain why se1b 12 can be smaller or larger than se1b 12 15 The following estimated equations use the data in MLB1 which contains information on major league baseball salaries The dependent variable lsalary is the log of salary The two explanatory variables are years in the major leagues years and runs batted in per year rbisyr lsalary 5 12373 1 1770 years 10982 101322 n 5 353 SSR 5 326196 SER 5 964 R2 5 337 lsalary 5 11861 1 0904 years 1 0302 rbisyr 10842 101182 100202 n 5 353 SSR 5 198475 SER 5 753 R2 5 597 i How many degrees of freedom are in each regression How come the SER is smaller in the sec ond regression than the first ii The sample correlation coefficient between years and rbisyr is about 0487 Does this make sense What is the variance inflation factor there is only one for the slope coefficients in the multiple re gression Would you say there is little moderate or strong collinearity between years and rbisyr iii How come the standard error for the coefficient on years in the multiple regression is lower than its counterpart in the simple regression 16 The following equations were estimated using the data in LAWSCH85 lsalary 5 990 2 0041 rank 1 294 GPA 1242 100032 10692 n 5 142 R2 5 8238 lsalary 5 986 2 0038 rank 1 295 GPA 1 00017 age 1292 100042 10832 1000362 n 5 99 R2 5 8036 How can it be that the Rsquared is smaller when the variable age is added to the equation Computer Exercises C1 A problem of interest to health officials and others is to determine the effects of smoking during pregnancy on infant health One measure of infant health is birth weight a birth weight that is too low can put an infant at risk for contracting various illnesses Since factors other than cigarette smoking that affect birth weight are likely to be correlated with smoking we should take those factors into ac count For example higher income generally results in access to better prenatal care as well as better nutrition for the mother An equation that recognizes this is bwght 5 b0 1 b1cigs 1 b2faminc 1 u i What is the most likely sign for b2 ii Do you think cigs and faminc are likely to be correlated Explain why the correlation might be positive or negative Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 98 PART 1 Regression Analysis with CrossSectional Data iii Now estimate the equation with and without faminc using the data in BWGHT Report the re sults in equation form including the sample size and Rsquared Discuss your results focusing on whether adding faminc substantially changes the estimated effect of cigs on bwght C2 Use the data in HPRICE1 to estimate the model price 5 b0 1 b1sqrft 1 b2bdrms 1 u where price is the house price measured in thousands of dollars i Write out the results in equation form ii What is the estimated increase in price for a house with one more bedroom holding square footage constant iii What is the estimated increase in price for a house with an additional bedroom that is 140 square feet in size Compare this to your answer in part ii iv What percentage of the variation in price is explained by square footage and number of bedrooms v The first house in the sample has sqrft 5 2438 and bdrms 5 4 Find the predicted selling price for this house from the OLS regression line vi The actual selling price of the first house in the sample was 300000 so price 5 300 Find the residual for this house Does it suggest that the buyer underpaid or overpaid for the house C3 The file CEOSAL2 contains data on 177 chief executive officers and can be used to examine the effects of firm performance on CEO salary i Estimate a model relating annual salary to firm sales and market value Make the model of the constant elasticity variety for both independent variables Write the results out in equation form ii Add profits to the model from part i Why can this variable not be included in logarithmic form Would you say that these firm performance variables explain most of the variation in CEO salaries iii Add the variable ceoten to the model in part ii What is the estimated percentage return for another year of CEO tenure holding other factors fixed iv Find the sample correlation coefficient between the variables logmktval and profits Are these variables highly correlated What does this say about the OLS estimators C4 Use the data in ATTEND for this exercise i Obtain the minimum maximum and average values for the variables atndrte priGPA and ACT ii Estimate the model atndrte 5 b0 1 b1priGPA 1 b2ACT 1 u and write the results in equation form Interpret the intercept Does it have a useful meaning iii Discuss the estimated slope coefficients Are there any surprises iv What is the predicted atndrte if priGPA 5 365 and ACT 5 20 What do you make of this result Are there any students in the sample with these values of the explanatory variables v If Student A has priGPA 5 31 and ACT 5 21 and Student B has priGPA 5 21 and ACT 5 26 what is the predicted difference in their attendance rates C5 Confirm the partialling out interpretation of the OLS estimates by explicitly doing the partialling out for Example 32 This first requires regressing educ on exper and tenure and saving the residuals r1 Then regress logwage on r1 Compare the coefficient on r1 with the coefficient on educ in the regres sion of logwage on educ exper and tenure C6 Use the data set in WAGE2 for this problem As usual be sure all of the following regressions contain an intercept i Run a simple regression of IQ on educ to obtain the slope coefficient say d 1 ii Run the simple regression of logwage on educ and obtain the slope coefficient b 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 99 iii Run the multiple regression of logwage on educ and IQ and obtain the slope coefficients b 1 and b 2 respectively iv Verify that b 1 5 b 1 1 b 2d 1 C7 Use the data in MEAP93 to answer this question i Estimate the model math10 5 b0 1 b1log1expend2 1 b2lnchprg 1 u and report the results in the usual form including the sample size and Rsquared Are the signs of the slope coefficients what you expected Explain ii What do you make of the intercept you estimated in part i In particular does it make sense to set the two explanatory variables to zero Hint Recall that log112 5 0 iii Now run the simple regression of math10 on logexpend and compare the slope coefficient with the estimate obtained in part i Is the estimated spending effect now larger or smaller than in part i iv Find the correlation between lexpend 5 log1expend2 and lnchprg Does its sign make sense to you v Use part iv to explain your findings in part iii C8 Use the data in DISCRIM to answer this question These are ZIP codelevel data on prices for vari ous items at fastfood restaurants along with characteristics of the zip code population in New Jersey and Pennsylvania The idea is to see whether fastfood restaurants charge higher prices in areas with a larger concentration of blacks i Find the average values of prpblck and income in the sample along with their standard devia tions What are the units of measurement of prpblck and income ii Consider a model to explain the price of soda psoda in terms of the proportion of the popula tion that is black and median income psoda 5 b0 1 b1prpblck 1 b2income 1 u Estimate this model by OLS and report the results in equation form including the sample size and Rsquared Do not use scientific notation when reporting the estimates Interpret the coef ficient on prpblck Do you think it is economically large iii Compare the estimate from part ii with the simple regression estimate from psoda on prpblck Is the discrimination effect larger or smaller when you control for income iv A model with a constant price elasticity with respect to income may be more appropriate Report estimates of the model log1psoda2 5 b0 1 b1prpblck 1 b2log1income2 1 u If prpblck increases by 20 20 percentage points what is the estimated percentage change in psoda Hint The answer is 2xx where you fill in the xx v Now add the variable prppov to the regression in part iv What happens to b prpblck vi Find the correlation between logincome and prppov Is it roughly what you expected vii Evaluate the following statement Because logincome and prppov are so highly correlated they have no business being in the same regression C9 Use the data in CHARITY to answer the following questions i Estimate the equation gift 5 b0 1 b1mailsyear 1 b2giftlast 1 b3propresp 1 u by OLS and report the results in the usual way including the sample size and Rsquared How does the Rsquared compare with that from the simple regression that omits giftlast and propresp Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 100 PART 1 Regression Analysis with CrossSectional Data ii Interpret the coefficient on mailsyear Is it bigger or smaller than the corresponding simple regression coefficient iii Interpret the coefficient on propresp Be careful to notice the units of measurement of propresp iv Now add the variable avggift to the equation What happens to the estimated effect of mails year v In the equation from part iv what has happened to the coefficient on giftlast What do you think is happening C10 Use the data in HTV to answer this question The data set includes information on wages education parents education and several other variables for 1230 working men in 1991 i What is the range of the educ variable in the sample What percentage of men completed twelfth grade but no higher grade Do the men or their parents have on average higher levels of education ii Estimate the regression model educ 5 b0 1 b1motheduc 1 b2fatheduc 1 u by OLS and report the results in the usual form How much sample variation in educ is ex plained by parents education Interpret the coefficient on motheduc iii Add the variable abil a measure of cognitive ability to the regression from part ii and report the results in equation form Does ability help to explain variations in education even after controlling for parents education Explain iv Requires calculus Now estimate an equation where abil appears in quadratic form educ 5 b0 1 b1motheduc 1 b2fatheduc 1 b3abil 1 b4abil2 1 u Using the estimates b 3 and b 4 use calculus to find the value of abil call it abil where educ is minimized The other coefficients and values of parents education variables have no effect we are holding parents education fixed Notice that abil is measured so that negative values are permissible You might also verify that the second derivative is positive so that you do indeed have a minimum v Argue that only a small fraction of men in the sample have ability less than the value calcu lated in part iv Why is this important vi If you have access to a statistical program that includes graphing capabilities use the estimates in part iv to graph the relationship between the predicted education and abil Set motheduc and fatheduc at their average values in the sample 1218 and 1245 respectively C11 Use the data in MEAPSINGLE to study the effects of singleparent households on student math per formance These data are for a subset of schools in southeast Michigan for the year 2000 The socio economic variables are obtained at the ZIP code level where ZIP code is assigned to schools based on their mailing addresses i Run the simple regression of math4 on pctsgle and report the results in the usual format Inter pret the slope coefficient Does the effect of single parenthood seem large or small ii Add the variables lmedinc and free to the equation What happens to the coefficient on pctsgle Explain what is happening iii Find the sample correlation between lmedinc and free Does it have the sign you expect vi Does the substantial correlation between lmedinc and free mean that you should drop one from the regression to better estimate the causal effect of single parenthood on student performance Explain v Find the variance inflation factors VIFs for each of the explanatory variables appearing in the regression in part ii Which variable has the largest VIF Does this knowledge affect the model you would use to study the causal effect of single parenthood on math performance Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 101 C12 The data in ECONMATH contain grade point averages and standardized test scores along with performance in an introductory economics course for students at a large public university The vari able to be explained is score the final score in the course measured as a percentage i How many students received a perfect score for the course What was the average score Find the means and standard deviations of actmth and acteng and discuss how they compare ii Estimate a linear equation relating score to colgpa actmth and acteng where colgpa is mea sured at the beginning of the term Report the results in the usual form iii Would you say the math or English ACT score is a better predictor of performance in the eco nomics course Explain iv Discuss the size of the Rsquared in the regression APPEndix 3A 3a1 Derivation of the First order Conditions in equation 313 The analysis is very similar to the simple regression case We must characterize the solutions to the problem min b0 b1p bk a n i51 1yi 2 b0 2 b1xi1 2 p 2 bkxik2 2 Taking the partial derivatives with respect to each of the bj see Appendix A evaluating them at the solutions and setting them equal to zero gives 22 a n i51 1yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 2 2 a n i51xij1yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 for all j 5 1 p k Canceling the 22 gives the first order conditions in 313 3a2 Derivation of equation 322 To derive 322 write xi1 in terms of its fitted value and its residual from the regression of x1 on x2 p xk xi1 5 xi1 1 ri1 for all i 5 1 p n Now plug this into the second equation in 313 a n i51 1xi1 1 ri12 1yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 363 By the definition of the OLS residual u i since xi1 is just a linear function of the explanatory variables xi2 p xik it follows that g n i51xi1u i 5 0 Therefore equation 363 can be expressed as a n i51ri11yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 364 Since the ri1 are the residuals from regressing x1 on x2 c xk g n i51 xijri1 5 0 for all j 5 2 k Therefore 364 is equivalent to g n i51ri11yi 2 b 1xi12 5 0 Finally we use the fact that g n i51xi1ri1 5 0 which means that b 1 solves a n i51ri11yi 2 b 1ri12 5 0 Now straightforward algebra gives 322 provided of course that g n i51r2 i1 0 this is ensured by Assumption MLR3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 102 PART 1 Regression Analysis with CrossSectional Data 3a3 proof of Theorem 31 We prove Theorem 31 for b 1 the proof for the other slope parameters is virtually identical See Appendix E for a more succinct proof using matrices Under Assumption MLR3 the OLS estimators exist and we can write b 1 as in 322 Under Assumption MLR1 we can write yi as in 332 substitute this for yi in 322 Then using g n i51 ri1 5 0 g n i51 xijri1 5 0 for all j 5 2 p k and g n i51 xi1ri1 5 g n i51 r2 i1 we have b 1 5 b1 1 a a n i51ri1uiba a n i51r2 i1b 365 Now under Assumptions MLR2 and MLR4 the expected value of each ui given all independent variables in the sample is zero Since the ri1 are just functions of the sample independent variables it follows that E1b 10X2 5 b1 1 a a n i51ri1E1ui0X2 ba a n i51r2 i1b 5 b1 1 a a n i51ri1 0ba a n i51r2 i1b 5 b1 where X denotes the data on all independent variables and E1b 10X2 is the expected value of b 1 given xi1 p xik for all i 5 1 p n This completes the proof 3a4 General omitted variable Bias We can derive the omitted variable bias in the general model in equation 331 under the first four GaussMarkov assumptions In particular let the b j j 5 0 1 p k be the OLS estimators from the regression using the full set of explanatory variables Let the b j j 5 0 1 p k 2 1 be the OLS esti mators from the regression that leaves out xk Let d j j 5 1 p k 2 1 be the slope coefficient on xj in the auxiliary regression of xik on xi1 xi2 p xik21 i 5 1 p n A useful fact is that b j 5 b j 1 b kd j 366 This shows explicitly that when we do not control for xk in the regression the estimated partial effect of xj equals the partial effect when we include xk plus the partial effect of xk on y times the partial relationship between the omitted variable xk and xj j k Conditional on the entire set of explanatory variables X we know that the b j are all unbiased for the corresponding bj j 5 1 p k Further since d j is just a function of X we have E1b j0X2 5 E1b j0X2 1 E1b k0X2 d j 5 bj 1 bkd j 367 Equation 367 shows that b j is biased for bj unless bk 5 0in which case xk has no partial effect in the populationor d j equals zero which means that xik and xij are partially uncorrelated in the sample The key to obtaining equation 367 is equation 366 To show equation 366 we can use equation 322 a couple of times For simplicity we look at j 5 1 Now b 1 is the slope coef ficient in the simple regression of yi on ri1 i 5 1 p n where the ri1 are the OLS residuals from the regression of xi1 on xi2 xi3 p xik21 Consider the numerator of the expression for b 1 g n i51 ri1yi But for each i we can write yi 5 b 0 1 b 1xi1 1 p 1 b kxik 1 u i and plug in for yi Now by properties of the OLS residuals the ri1 have zero sample average and are uncorrelated with xi2 xi3 p xik21 in the sample Similarly the u i have zero sample average and zero sample correlation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 103 with xi1 xi2 c xik It follows that the ri1 and u i are uncorrelated in the sample since the ri1 are just linear combinations of xi1 xi2 c xik21 So a n i51 ri1yi 5 b 1a a n i51 ri1xi1b 1 b ka a n i51 ri1xikb 368 Now g n i51 ri1xi1 5 g n i51 r2 i1 which is also the denominator of b 1 Therefore we have shown that b 1 5 b 1 1 b ka a n i51 ri1xikba a n i51 r2 i1b 5 b 1 1 b kd 1 This is the relationship we wanted to show 3a5 proof of Theorem 32 Again we prove this for j 5 1 Write b 1 as in equation 365 Now under MLR5 Var1ui0X2 5 s2 for all i 5 1 c n Under random sampling the ui are independent even conditional on X and the ri1 are nonrandom conditional on X Therefore Var1b 10X2 5 a a n i51r2 i1 Var1ui0X2 ba a n i51r2 i1b 2 5 a a n i51r2 i1s2ba a n i51r2 i1b 2 5 s2a a n i51r2 i1b Now since g n i51r2 i1 is the sum of squared residuals from regressing x1 on x2 p xk g n i51r2 i1 5 SST111 2 R2 12 This completes the proof 3a6 proof of Theorem 34 We show that for any other linear unbiased estimator b 1 of b1 Var1b 12 Var1b 12 where b 1 is the OLS estimator The focus on j 5 1 is without loss of generality For b 1 as in equation 360 we can plug in for yi to obtain b 1 5 b0 a n i51wi1 1 b1 a n i51wi1xi1 1 b2 a n i51wi1xi2 1 p 1 bk a n i51wi1xik 1 a n i51wi1ui Now since the wi1 are functions of the xij E1b 10X2 5 b0 a n i51wi1 1 b1 a n i51wi1xi1 1 b2 a n i51wi1xi2 1 p 1 bk a n i51wi1xik 1 a n i51wi1E1ui0X2 5 b0 a n i51wi1 1 b1 a n i51wi1xi1 1 b2 a n i51wi1xi2 1 p 1 bk a n i51wi1xik because E1ui0X2 5 0 for all i 5 1 p n under MLR2 and MLR4 Therefore for E1b 10X2 to equal b1 for any values of the parameters we must have a n i51wi1 5 0 a n i51wi1xi1 5 1 a n i51wi1xij 5 0 j 5 2 p k 369 Now let ri1 be the residuals from the regression of xi1 on xi2 p xik Then from 369 it follows that a n i51wi1ri1 5 1 370 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 104 PART 1 Regression Analysis with CrossSectional Data because xi1 5 xi1 1 ri1 and g n i51wi1xi1 5 0 Now consider the difference between Var1b 10X2 and Var1b 10X2 under MLR1 through MLR5 s2 a n i51w2 i1 2 s2a a n i51r2 i1b 371 Because of 370 we can write the difference in 371 without s2 as a n i51w2 i1 2 a a n i51wi1ri1b 2 a a n i51r2 i1b 372 But 372 is simply a n i51 1wi1 2 g 1ri12 2 373 where g 1 5 1 g n i51wi1ri121 g n i51r2 i12 as can be seen by squaring each term in 373 summing and then canceling terms Because 373 is just the sum of squared residuals from the simple regression of wi1 on ri1remember that the sample average of ri1 is zero373 must be nonnegative This completes the proof Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 105 c h a p t e r 4 Multiple Regression Analysis Inference T his chapter continues our treatment of multiple regression analysis We now turn to the problem of testing hypotheses about the parameters in the population regression model We begin in Section 41 by finding the distributions of the OLS estimators under the added assumption that the population error is normally distributed Sections 42 and 43 cover hypothesis testing about indi vidual parameters while Section 44 discusses how to test a single hypothesis involving more than one parameter We focus on testing multiple restrictions in Section 45 and pay particular attention to determining whether a group of independent variables can be omitted from a model 41 Sampling Distributions of the OLS Estimators Up to this point we have formed a set of assumptions under which OLS is unbiased we have also derived and discussed the bias caused by omitted variables In Section 34 we obtained the variances of the OLS estimators under the GaussMarkov assumptions In Section 35 we showed that this vari ance is smallest among linear unbiased estimators Knowing the expected value and variance of the OLS estimators is useful for describing the precision of the OLS estimators However in order to perform statistical inference we need to know more than just the first two moments of b j we need to know the full sampling distribution of the b j Even under the GaussMarkov assumptions the distribution of b j can have virtually any shape When we condition on the values of the independent variables in our sample it is clear that the sampling distributions of the OLS estimators depend on the underlying distribution of the errors To make the sampling distributions of the b j tractable we now assume that the unobserved error is nor mally distributed in the population We call this the normality assumption Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 106 Assumption MLR6 Normality The population error u is independent of the explanatory variables x1 x2 p xk and is normally distrib uted with zero mean and variance s2 u Normal10s22 Assumption MLR6 is much stronger than any of our previous assumptions In fact since u is independent of the xj under MLR6 E1u0x1 p xk2 5 E1u2 5 0 and Var1u0x1 p xk2 5 Var1u2 5 s2 Thus if we make Assumption MLR6 then we are necessarily assuming MLR4 and MLR5 To emphasize that we are assuming more than before we will refer to the full set of Assumptions MLR1 through MLR6 For crosssectional regression applications Assumptions MLR1 through MLR6 are called the classical linear model CLM assumptions Thus we will refer to the model under these six assumptions as the classical linear model It is best to think of the CLM assumptions as containing all of the GaussMarkov assumptions plus the assumption of a normally distributed error term Under the CLM assumptions the OLS estimators b 0 b 1 p b k have a stronger efficiency prop erty than they would under the GaussMarkov assumptions It can be shown that the OLS estimators are the minimum variance unbiased estimators which means that OLS has the smallest variance among unbiased estimators we no longer have to restrict our comparison to estimators that are linear in the yi This property of OLS under the CLM assumptions is discussed further in Appendix E A succinct way to summarize the population assumptions of the CLM is y0x Normal1b0 1 b1x1 1 b2x2 1 p 1 bkxks22 where x is again shorthand for 1x1 p xk2 Thus conditional on x y has a normal distribution with mean linear in x1 p xk and a constant variance For a single independent variable x this situation is shown in Figure 41 fylx x1 Eyx 5 0 1 1x x2 x3 y normal distributions x Figure 41 The homoskedastic normal distribution with a single explanatory variable Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 107 The argument justifying the normal distribution for the errors usually runs something like this Because u is the sum of many different unobserved factors affecting y we can invoke the central limit theorem CLT see Appendix C to conclude that u has an approximate normal distribution This argument has some merit but it is not without weaknesses First the factors in u can have very dif ferent distributions in the population for example ability and quality of schooling in the error in a wage equation Although the CLT can still hold in such cases the normal approximation can be poor depending on how many factors appear in u and how different their distributions are A more serious problem with the CLT argument is that it assumes that all unobserved factors affect y in a separate additive fashion Nothing guarantees that this is so If u is a complicated func tion of the unobserved factors then the CLT argument does not really apply In any application whether normality of u can be assumed is really an empirical matter For example there is no theorem that says wage conditional on educ exper and tenure is normally dis tributed If anything simple reasoning suggests that the opposite is true since wage can never be less than zero it cannot strictly speaking have a normal distribution Further because there are minimum wage laws some fraction of the population earns exactly the minimum wage which also violates the normality assumption Nevertheless as a practical matter we can ask whether the conditional wage distribution is close to being normal Past empirical evidence suggests that normality is not a good assumption for wages Often using a transformation especially taking the log yields a distribution that is closer to normal For example something like logprice tends to have a distribution that looks more normal than the distribution of price Again this is an empirical issue We will discuss the consequences of nonnormality for statistical inference in Chapter 5 There are some applications where MLR6 is clearly false as can be demonstrated with simple introspection Whenever y takes on just a few values it cannot have anything close to a normal dis tribution The dependent variable in Example 35 provides a good example The variable narr86 the number of times a young man was arrested in 1986 takes on a small range of integer values and is zero for most men Thus narr86 is far from being normally distributed What can be done in these cases As we will see in Chapter 5and this is importantnonnormality of the errors is not a serious problem with large sample sizes For now we just make the normality assumption Normality of the error term translates into normal sampling distributions of the OLS estimators Normal SampliNg DiStributioNS Under the CLM assumptions MLR1 through MLR6 conditional on the sample values of the indepen dent variables b j Normal3bjVar1b j2 4 41 where Var1b j2 was given in Chapter 3 equation 351 Therefore 1b j 2 bj2sd1b j2 Normal 1012 Theorem 41 The proof of 41 is not that difficult given the properties of normally distributed random variables in Appendix B Each b j can be written as b j 5 bj 1 g n i51wijui where wij 5 rijSSRj rij is the ith residual from the regression of the xj on all the other independent variables and SSRj is the sum of squared residuals from this regression see equation 362 Since the wij depend only on the independent vari ables they can be treated as nonrandom Thus b j is just a linear combination of the errors in the sam ple 5ui i 5 1 2 p n6 Under Assumption MLR6 and the random sampling Assumption MLR2 the errors are independent identically distributed Normal10s22 random variables An important fact about independent normal random variables is that a linear combination of such random variables is Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 108 normally distributed see Appendix B This basically completes the proof In Section 33 we showed that E1b j2 5 bj and we derived Var1b j2 in Section 34 there is no need to rederive these facts The second part of this theorem follows imme diately from the fact that when we standardize a nor mal random variable by subtracting off its mean and dividing by its standard deviation we end up with a standard normal random variable The conclusions of Theorem 41 can be strengthened In addition to 41 any linear combination of the b 0 b 1 p b k is also normally distributed and any subset of the b j has a joint normal distribu tion These facts underlie the testing results in the remainder of this chapter In Chapter 5 we will show that the normality of the OLS estimators is still approximately true in large samples even with out normality of the errors 42 Testing Hypotheses about a Single Population Parameter The t Test This section covers the very important topic of testing hypotheses about any single parameter in the population regression function The population model can be written as y 5 b0 1 b1x1 1 p 1 bkxk 1 u 42 and we assume that it satisfies the CLM assumptions We know that OLS produces unbiased estima tors of the bj In this section we study how to test hypotheses about a particular bj For a full under standing of hypothesis testing one must remember that the bj are unknown features of the population and we will never know them with certainty Nevertheless we can hypothesize about the value of bj and then use statistical inference to test our hypothesis In order to construct hypotheses tests we need the following result t DiStributioN for the StaNDarDizeD eStimatorS Under the CLM assumptions MLR1 through MLR6 1b j 2 bj2se1b j2 tn2k21 5 tdf 43 where k 1 1 is the number of unknown parameters in the population model y 5 b0 1 b1x1 1 p 1 bk xk 1 u k slope parameters and the intercept b0 and n 2 k 2 1 is the degrees of freedom df Theorem 42 This result differs from Theorem 41 in some notable respects Theorem 41 showed that under the CLM assumptions 1b j 2 bj2sd1b j2 Normal1012 The t distribution in 43 comes from the fact that the constant s in sd1b j2 has been replaced with the random variable s The proof that this leads to a t distribution with n 2 k 2 1 degrees of freedom is difficult and not especially instructive Essentially the proof shows that 43 can be written as the ratio of the standard normal random vari able 1b j 2 bj2sd1b j2 over the square root of s 2s2 These random variables can be shown to be inde pendent and 1n 2 k 2 12s 2s2 x2 n2k21 The result then follows from the definition of a t random variable see Section B5 Theorem 42 is important in that it allows us to test hypotheses involving the bj In most applica tions our primary interest lies in testing the null hypothesis H0 bj 5 0 44 Suppose that u is independent of the explanatory variables and it takes on the values 22 21 0 1 and 2 with equal prob ability of 15 Does this violate the Gauss Markov assumptions Does this violate the CLM assumptions exploring FurTher 41 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 109 where j corresponds to any of the k independent variables It is important to understand what 44 means and to be able to describe this hypothesis in simple language for a particular application Since bj measures the partial effect of xj on the expected value of y after controlling for all other inde pendent variables 44 means that once x1 x2 p xj21 xj11 p xk have been accounted for xj has no effect on the expected value of y We cannot state the null hypothesis as xj does have a partial effect on y because this is true for any value of bj other than zero Classical testing is suited for testing sim ple hypotheses like 44 As an example consider the wage equation log1wage2 5 b0 1 b1educ 1 b2exper 1 b3tenure 1 u The null hypothesis H0 b2 5 0 means that once education and tenure have been accounted for the number of years in the workforce exper has no effect on hourly wage This is an economically inter esting hypothesis If it is true it implies that a persons work history prior to the current employment does not affect wage If b2 0 then prior work experience contributes to productivity and hence to wage You probably remember from your statistics course the rudiments of hypothesis testing for the mean from a normal population This is reviewed in Appendix C The mechanics of testing 44 in the multiple regression context are very similar The hard part is obtaining the coefficient estimates the standard errors and the critical values but most of this work is done automatically by economet rics software Our job is to learn how regression output can be used to test hypotheses of interest The statistic we use to test 44 against any alternative is called the t statistic or the t ratio of b j and is defined as tbj b jse1b j2 45 We have put the in quotation marks because as we will see shortly a more general form of the t statistic is needed for testing other hypotheses about bj For now it is important to know that 45 is suitable only for testing 44 For particular applications it is helpful to index t statistics using the name of the independent variable for example teduc would be the t statistic for b educ The t statistic for b j is simple to compute given b j and its standard error In fact most regression packages do the division for you and report the t statistic along with each coefficient and its standard error Before discussing how to use 45 formally to test H0 bj 5 0 it is useful to see why tbj has fea tures that make it reasonable as a test statistic to detect bj 2 0 First since se1b j2 is always positive tbj has the same sign as b j if b j is positive then so is tbj and if b j is negative so is tbj Second for a given value of se1b j2 a larger value of b j leads to larger values of tbj If b j becomes more negative so does tbj Since we are testing H0 bj 5 0 it is only natural to look at our unbiased estimator of bj b j for guidance In any interesting application the point estimate b j will never exactly be zero whether or not H0 is true The question is How far is b j from zero A sample value of b j very far from zero pro vides evidence against H0 bj 5 0 However we must recognize that there is a sampling error in our estimate b j so the size of b j must be weighed against its sampling error Since the standard error of b j is an estimate of the standard deviation of b j tbj measures how many estimated standard deviations b j is away from zero This is precisely what we do in testing whether the mean of a population is zero using the standard t statistic from introductory statistics Values of tbj sufficiently far from zero will result in a rejection of H0 The precise rejection rule depends on the alternative hypothesis and the chosen significance level of the test Determining a rule for rejecting 44 at a given significance levelthat is the probability of rejecting H0 when it is truerequires knowing the sampling distribution of tbj when H0 is true From Theorem 42 we know this to be tn2k21 This is the key theoretical result needed for testing 44 Before proceeding it is important to remember that we are testing hypotheses about the popula tion parameters We are not testing hypotheses about the estimates from a particular sample Thus it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 110 never makes sense to state a null hypothesis as H0 b 1 5 0 or even worse as H0 237 5 0 when the estimate of a parameter is 237 in the sample We are testing whether the unknown population value b1 is zero Some treatments of regression analysis define the t statistic as the absolute value of 45 so that the t statistic is always positive This practice has the drawback of making testing against onesided alternatives clumsy Throughout this text the t statistic always has the same sign as the corresponding OLS coefficient estimate 42a Testing against OneSided Alternatives to determine a rule for rejecting H0 we need to decide on the relevant alternative hypothesis First consider a onesided alternative of the form H1 bj 0 46 When we state the alternative as in equation 46 we are really saying that the null hypothesis is H0 bj 0 For example if bj is the coefficient on education in a wage regression we only care about detecting that bj is different from zero when bj is actually positive You may remember from introduc tory statistics that the null value that is hardest to reject in favor of 46 is bj 5 0 In other words if we reject the null bj 5 0 then we automatically reject bj 0 Therefore it suffices to act as if we are testing H0 bj 5 0 against H1 bj 0 effectively ignoring bj 0 and that is the approach we take in this book How should we choose a rejection rule We must first decide on a significance level level for short or the probability of rejecting H0 when it is in fact true For concreteness suppose we have decided on a 5 significance level as this is the most popular choice Thus we are willing to mistak enly reject H0 when it is true 5 of the time Now while tbj has a t distribution under H0so that it has zero meanunder the alternative bj 0 the expected value of tbj is positive Thus we are look ing for a sufficiently large positive value of tbj in order to reject H0 bj 5 0 in favor of H1 bj 0 Negative values of tbj provide no evidence in favor of H1 The definition of sufficiently large with a 5 significance level is the 95th percentile in a t distribution with n 2 k 2 1 degrees of freedom denote this by c In other words the rejection rule is that H0 is rejected in favor of H1 at the 5 significance level if tbj c 47 By our choice of the critical value c rejection of H0 will occur for 5 of all random samples when H0 is true The rejection rule in 47 is an example of a onetailed test To obtain c we only need the sig nificance level and the degrees of freedom For example for a 5 level test and with n 2 k 2 1 5 28 degrees of freedom the critical value is c 5 1701 If tbj 1701 then we fail to reject H0 in favor of 46 at the 5 level Note that a negative value for tbj no matter how large in absolute value leads to a failure in rejecting H0 in favor of 46 See Figure 42 The same procedure can be used with other significance levels For a 10 level test and if df 5 21 the critical value is c 5 1323 For a 1 significance level and if df 5 21 c 5 2518 All of these critical values are obtained directly from Table G2 You should note a pattern in the critical val ues As the significance level falls the critical value increases so that we require a larger and larger value of tbj in order to reject H0 Thus if H0 is rejected at say the 5 level then it is automatically rejected at the 10 level as well It makes no sense to reject the null hypothesis at say the 5 level and then to redo the test to determine the outcome at the 10 level As the degrees of freedom in the t distribution get large the t distribution approaches the standard normal distribution For example when n 2 k 2 1 5 120 the 5 critical value for the onesided alternative 47 is 1658 compared with the standard normal value of 1645 These are close enough for practical purposes for degrees of freedom greater than 120 one can use the standard normal criti cal values Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 111 example 41 hourly Wage equation Using the data in WAGE1 gives the estimated equation log1wage2 5 284 1 092 educ 1 0041 exper 1 022 tenure 11042 10072 100172 10032 n 5 526 R2 5 316 where standard errors appear in parentheses below the estimated coefficients We will follow this con vention throughout the text This equation can be used to test whether the return to exper controlling for educ and tenure is zero in the population against the alternative that it is positive Write this as H0 bexper 5 0 versus H1 bexper 0 In applications indexing a parameter by its associated variable name is a nice way to label parameters since the numerical indices that we use in the general model are arbitrary and can cause confusion Remember that bexper denotes the unknown population param eter It is nonsense to write H0 0041 5 0 or H0 b exper 5 0 Since we have 522 degrees of freedom we can use the standard normal critical values The 5 critical value is 1645 and the 1 critical value is 2326 The t statistic for b exper is texper 5 00410017 241 and so b exper or exper is statistically significant even at the 1 level We also say that b exper is statis tically greater than zero at the 1 significance level The estimated return for another year of experience holding tenure and education fixed is not especially large For example adding three more years increases logwage by 3100412 5 0123 so wage is only about 12 higher Nevertheless we have persuasively shown that the partial effect of experience is positive in the population 0 1701 rejection region area 05 Figure 42 5 rejection rule for the alternative H1 bj 0 with 28 df Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 112 The onesided alternative that the parameter is less than zero H1 bj 0 48 also arises in applications The rejection rule for alternative 48 is just the mirror image of the previ ous case Now the critical value comes from the left tail of the t distribution In practice it is easiest to think of the rejection rule as tbj 2c 49 where c is the critical value for the alternative H1 bj 0 For simplicity we always assume c is positive since this is how critical values are reported in t tables and so the critical value 2c is a negative number For example if the significance level is 5 and the degrees of freedom is 18 then c 5 1734 and so H0 bj 5 0 is rejected in favor of H1 bj 0 at the 5 level if tbj 21734 It is important to remember that to reject H0 against the negative alternative 48 we must get a negative t statistic A positive t ratio no matter how large provides no evidence in favor of 48 The rejection rule is illustrated in Figure 43 Let community loan approval rates be deter mined by apprate 5 b0 1 b1percmin 1 b2avginc 1 b3avgwlth 1 b4avgdebt 1 u where percmin is the percentage minority in the community avginc is average income avgwlth is average wealth and avgdebt is some measure of average debt obligations how do you state the null hypothesis that there is no difference in loan rates across neighborhoods due to racial and ethnic composition when average income aver age wealth and average debt have been controlled for how do you state the alter native that there is discrimination against minorities in loan approval rates exploring FurTher 42 0 1734 rejection region area 05 Figure 43 5 rejection rule for the alternative H1 bj 0 with 18 df Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 113 example 42 Student performance and School Size There is much interest in the effect of school size on student performance See for example The New York Times Magazine 52895 One claim is that everything else being equal students at smaller schools fare better than those at larger schools This hypothesis is assumed to be true even after accounting for differences in class sizes across schools The file MEAP93 contains data on 408 high schools in Michigan for the year 1993 We can use these data to test the null hypothesis that school size has no effect on standardized test scores against the alternative that size has a negative effect Performance is measured by the percentage of students receiving a passing score on the Michigan Educational Assessment Program MEAP standardized tenthgrade math test math10 School size is measured by student enrollment enroll The null hypothesis is H0 benroll 5 0 and the alternative is H1 benroll 0 For now we will control for two other factors average annual teacher compensation totcomp and the number of staff per one thou sand students staff Teacher compensation is a measure of teacher quality and staff size is a rough measure of how much attention students receive The estimated equation with standard errors in parentheses is math10 5 2274 1 00046 totcomp 1 048 staff 2 00020 enroll 161132 1000102 10402 1000222 n 5 408 R2 5 0541 The coefficient on enroll 200020 is in accordance with the conjecture that larger schools hamper performance higher enrollment leads to a lower percentage of students with a passing tenthgrade math score The coefficients on totcomp and staff also have the signs we expect The fact that enroll has an estimated coefficient different from zero could just be due to sampling error to be convinced of an effect we need to conduct a t test Since n 2 k 2 1 5 408 2 4 5 404 we use the standard normal critical value At the 5 level the critical value is 2165 the t statistic on enroll must be less than 2165 to reject H0 at the 5 level The t statistic on enroll is 20002000022 291 which is larger than 2165 we fail to reject H0 in favor of H1 at the 5 level In fact the 15 critical value is 2104 and since 291 2104 we fail to reject H0 even at the 15 level We conclude that enroll is not statistically significant at the 15 level The variable totcomp is statistically significant even at the 1 significance level because its t statistic is 46 On the other hand the t statistic for staff is 12 and so we cannot reject H0 bstaff 5 0 against H1 bstaff 0 even at the 10 significance level The critical value is c 5 128 from the standard normal distribution To illustrate how changing functional form can affect our conclusions we also estimate the model with all independent variables in logarithmic form This allows for example the school size effect to diminish as school size increases The estimated equation is math10 5 220766 1 2116 log1totcomp2 1 398 log1staff2 2 129 log1enroll2 148702 14062 14192 10692 n 5 408 R2 5 0654 The t statistic on logenroll is about 2187 since this is below the 5 critical value 2165 we reject H0 blog1enroll2 5 0 in favor of H1 blog1enroll2 0 at the 5 level In Chapter 2 we encountered a model where the dependent variable appeared in its original form called level form while the independent variable appeared in log form called levellog model The interpretation of the parameters is the same in the multiple regression context except of course that we can give the parameters a ceteris paribus interpretation Holding totcomp and staff fixed we have Dmath10 5 21293Dlog1enroll2 4 so that Dmath10 211291002 1Denroll2 20131Denroll2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 114 Once again we have used the fact that the change in logenroll when multiplied by 100 is approxi mately the percentage change in enroll Thus if enrollment is 10 higher at a school math10 is pre dicted to be 0131102 5 013 percentage points lower math10 is measured as a percentage Which model do we prefer the one using the level of enroll or the one using logenroll In the levellevel model enrollment does not have a statistically significant effect but in the levellog model it does This translates into a higher Rsquared for the levellog model which means we explain more of the variation in math10 by using enroll in logarithmic form 65 to 54 The levellog model is preferred because it more closely captures the relationship between math10 and enroll We will say more about using Rsquared to choose functional form in Chapter 6 42b TwoSided Alternatives In applications it is common to test the null hypothesis H0 bj 5 0 against a twosided alternative that is H1 bj 2 0 410 Under this alternative xj has a ceteris paribus effect on y without specifying whether the effect is posi tive or negative This is the relevant alternative when the sign of bj is not well determined by theory or common sense Even when we know whether bj is positive or negative under the alternative a twosided test is often prudent At a minimum using a twosided alternative prevents us from look ing at the estimated equation and then basing the alternative on whether b j is positive or negative Using the regression estimates to help us formulate the null or alternative hypotheses is not allowed 0 206 rejection region area 025 206 rejection region area 025 Figure 44 5 rejection rule for the alternative H1 bj 2 0 with 25 df Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 115 because classical statistical inference presumes that we state the null and alternative about the popula tion before looking at the data For example we should not first estimate the equation relating math performance to enrollment note that the estimated effect is negative and then decide the relevant alternative is H1 benroll 0 When the alternative is twosided we are interested in the absolute value of the t statistic The rejection rule for H0 bj 5 0 against 410 is 0tbj 0 c 411 where 00 denotes absolute value and c is an appropriately chosen critical value To find c we again specify a significance level say 5 For a twotailed test c is chosen to make the area in each tail of the t distribution equal 25 In other words c is the 975th percentile in the t distribution with n 2 k 2 1 degrees of freedom When n 2 k 2 1 5 25 the 5 critical value for a twosided test is c 5 2060 Figure 44 provides an illustration of this distribution When a specific alternative is not stated it is usually considered to be twosided In the remainder of this text the default will be a twosided alternative and 5 will be the default significance level When carrying out empirical econometric analysis it is always a good idea to be explicit about the alternative and the significance level If H0 is rejected in favor of 410 at the 5 level we usually say that xj is statistically significant or statistically different from zero at the 5 level If H0 is not rejected we say that xj is statistically insignificant at the 5 level example 43 Determinants of College gpa We use the data in GPA1 to estimate a model explaining college GPA colGPA with the average number of lectures missed per week skipped as an additional explanatory variable The estimated model is colGPA 5 139 1 412 hsGPA 1 015 ACT 2 083 skipped 1332 10942 10112 10262 n 5 141 R2 5 234 We can easily compute t statistics to see which variables are statistically significant using a two sided alternative in each case The 5 critical value is about 196 since the degrees of freedom 1141 2 4 5 1372 is large enough to use the standard normal approximation The 1 critical value is about 258 The t statistic on hsGPA is 438 which is significant at very small significance levels Thus we say that hsGPA is statistically significant at any conventional significance level The t statistic on ACT is 136 which is not statistically significant at the 10 level against a twosided alternative The coefficient on ACT is also practically small a 10point increase in ACT which is large is predicted to increase colGPA by only 15 points Thus the variable ACT is practically as well as statistically insignificant The coefficient on skipped has a t statistic of 2083026 5 2319 so skipped is statistically significant at the 1 significance level 1319 2582 This coefficient means that another lecture missed per week lowers predicted colGPA by about 083 Thus holding hsGPA and ACT fixed the predicted difference in colGPA between a student who misses no lectures per week and a student who misses five lectures per week is about 42 Remember that this says nothing about specific students rather 42 is the estimated average across a subpopulation of students In this example for each variable in the model we could argue that a onesided alternative is appropriate The variables hsGPA and skipped are very significant using a twotailed test and have the signs that we expect so there is no reason to do a onetailed test On the other hand against a one sided alternative 1b3 02 ACT is significant at the 10 level but not at the 5 level This does not change the fact that the coefficient on ACT is pretty small Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 116 42c Testing Other Hypotheses about bj Although H0 bj 5 0 is the most common hypothesis we sometimes want to test whether bj is equal to some other given constant Two common examples are bj 5 1 and bj 5 21 Generally if the null is stated as H0 bj 5 aj 412 where aj is our hypothesized value of bj then the appropriate t statistic is t 5 1b j 2 aj2se1b j2 As before t measures how many estimated standard deviations b j is away from the hypothesized value of bj The general t statistic is usefully written as t 5 1estimate 2 hypothesized value2 standard error 413 Under 412 this t statistic is distributed as tn2k21 from Theorem 42 The usual t statistic is obtained when aj 5 0 We can use the general t statistic to test against onesided or twosided alternatives For example if the null and alternative hypotheses are H0 bj 5 1 and H1 bj 1 then we find the critical value for a onesided alternative exactly as before the difference is in how we compute the t statistic not in how we obtain the appropriate c We reject H0 in favor of H1 if t c In this case we would say that b j is statistically greater than one at the appropriate significance level example 44 Campus Crime and enrollment Consider a simple model relating the annual number of crimes on college campuses crime to student enrollment enroll log1crime2 5 b0 1 b1log1enroll2 1 u This is a constant elasticity model where b1 is the elasticity of crime with respect to enrollment It is not much use to test H0 b1 5 0 as we expect the total number of crimes to increase as the size of the campus increases A more interesting hypothesis to test would be that the elasticity of crime with respect to enrollment is one H0 b1 5 1 This means that a 1 increase in enrollment leads to on average a 1 increase in crime A noteworthy alternative is H1 b1 1 which implies that a 1 increase in enrollment increases campus crime by more than 1 If b1 1 then in a relative sense not just an absolute sensecrime is more of a problem on larger campuses One way to see this is to take the exponential of the equation crime 5 exp1b02enrollb1exp1u2 See Appendix A for properties of the natural logarithm and exponential functions For b0 5 0 and u 5 0 this equation is graphed in Figure 45 for b1 1 b1 5 1 and b1 1 We test b1 5 1 against b1 1 using data on 97 colleges and universities in the United States for the year 1992 contained in the data file CAMPUS The data come from the FBIs Uniform Crime Reports and the average number of campus crimes in the sample is about 394 while the average enrollment is about 16076 The estimated equation with estimates and standard errors rounded to two decimal places is log1crime2 5 2663 1 127 log1enroll2 11032 10112 414 n 5 97 R2 5 585 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 117 The estimated elasticity of crime with respect to enroll 127 is in the direction of the alterna tive b1 1 But is there enough evidence to conclude that b1 1 We need to be careful in testing this hypothesis especially because the statistical output of standard regression packages is much more complex than the simplified output reported in equation 414 Our first instinct might be to construct the t statistic by taking the coefficient on logenroll and dividing it by its standard error which is the t statistic reported by a regression package But this is the wrong statistic for testing H0 b1 5 1 The correct t statistic is obtained from 413 we subtract the hypothesized value unity from the estimate and divide the result by the standard error of b 1 t 5 1127 2 1211 5 2711 245 The onesided 5 critical value for a t distribution with 97 2 2 5 95 df is about 166 using df 5 1202 so we clearly reject b1 5 1 in favor of b1 1 at the 5 level In fact the 1 critical value is about 237 and so we reject the null in favor of the alternative at even the 1 level We should keep in mind that this analysis holds no other factors constant so the elasticity of 127 is not necessarily a good estimate of ceteris paribus effect It could be that larger enrollments are correlated with other factors that cause higher crime larger schools might be located in higher crime areas We could control for this by collecting data on crime rates in the local city For a twosided alternative for example H0 bj 5 21 H1 b1 2 21 we still compute the t statis tic as in 413 t 5 1b j 1 12se1b j2 notice how subtracting 21 means adding 1 The rejection rule is the usual one for a twosided test reject H0 if 0t0 c where c is a twotailed critical value If H0 is rejected we say that b j is statistically different from negative one at the appropriate significance level 0 crime enroll 1 1 1 1 1 1 0 Figure 45 Graph of crime 5 enroll b1 for b1 1 b1 5 1 and b1 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 118 example 45 housing prices and air pollution For a sample of 506 communities in the Boston area we estimate a model relating median housing price price in the community to various community characteristics nox is the amount of nitrogen oxide in the air in parts per million dist is a weighted distance of the community from five employ ment centers in miles rooms is the average number of rooms in houses in the community and stratio is the average studentteacher ratio of schools in the community The population model is log1price2 5 b0 1 b1log1nox2 1 b2log1dist2 1 b3rooms 1 b4stratio 1 u Thus b1 is the elasticity of price with respect to nox We wish to test H0 b1 5 21 against the alterna tive H1 b1 2 21 The t statistic for doing this test is t 5 1b 1 1 12se1b 12 Using the data in HPRICE2 the estimated model is log1price2 5 1108 2 954 log1nox2 2 134 log1dist2 1 255 rooms 2 052 stratio 10322 11172 10432 10192 10062 n 5 506 R2 5 581 The slope estimates all have the anticipated signs Each coefficient is statistically different from zero at very small significance levels including the coefficient on lognox But we do not want to test that b1 5 0 The null hypothesis of interest is H0 b1 5 21 with corresponding t statistic 12954 1 12117 5 393 There is little need to look in the t table for a critical value when the t statistic is this small the estimated elasticity is not statistically different from 21 even at very large significance levels Controlling for the factors we have included there is little evidence that the elas ticity is different from 21 42d Computing pValues for t Tests So far we have talked about how to test hypotheses using a classical approach after stating the alter native hypothesis we choose a significance level which then determines a critical value Once the critical value has been identified the value of the t statistic is compared with the critical value and the null is either rejected or not rejected at the given significance level Even after deciding on the appropriate alternative there is a component of arbitrariness to the classical approach which results from having to choose a significance level ahead of time Different researchers prefer different significance levels depending on the particular application There is no correct significance level Committing to a significance level ahead of time can hide useful information about the outcome of a hypothesis test For example suppose that we wish to test the null hypothesis that a parameter is zero against a twosided alternative and with 40 degrees of freedom we obtain a t statistic equal to 185 The null hypothesis is not rejected at the 5 level since the t statistic is less than the twotailed critical value of c 5 2021 A researcher whose agenda is not to reject the null could simply report this outcome along with the estimate the null hypothesis is not rejected at the 5 level Of course if the t statistic or the coefficient and its standard error are reported then we can also determine that the null hypothesis would be rejected at the 10 level since the 10 critical value is c 5 1684 Rather than testing at different significance levels it is more informative to answer the follow ing question Given the observed value of the t statistic what is the smallest significance level at which the null hypothesis would be rejected This level is known as the pvalue for the test see Appendix C In the previous example we know the pvalue is greater than 05 since the null is not rejected at the 5 level and we know that the pvalue is less than 10 since the null is rejected at the 10 level We obtain the actual pvalue by computing the probability that a t random variable with 40 df is larger than 185 in absolute value That is the pvalue is the significance level of the test when we use the value of the test statistic 185 in the above example as the critical value for the test This pvalue is shown in Figure 46 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 119 Because a pvalue is a probability its value is always between zero and one In order to compute pvalues we either need extremely detailed printed tables of the t distributionwhich is not very practicalor a computer program that computes areas under the probability density function of the t distribution Most modern regression packages have this capability Some packages compute pvalues routinely with each OLS regression but only for certain hypotheses If a regression package reports a pvalue along with the standard OLS output it is almost certainly the pvalue for testing the null hypothesis H0 bj 5 0 against the twosided alternative The pvalue in this case is P1 0T0 0t0 2 415 where for clarity we let T denote a t distributed random variable with n 2 k 2 1 degrees of freedom and let t denote the numerical value of the test statistic The pvalue nicely summarizes the strength or weakness of the empirical evidence against the null hypothesis Perhaps its most useful interpretation is the following the pvalue is the probability of observing a t statistic as extreme as we did if the null hypothesis is true This means that small pvalues are evidence against the null large pvalues provide little evidence against H0 For exam ple if the pvalue 5 50 reported always as a decimal not a percentage then we would observe a value of the t statistic as extreme as we did in 50 of all random samples when the null hypothesis is true this is pretty weak evidence against H0 In the example with df 5 40 and t 5 185 the pvalue is computed as pvalue 5 P1 0T0 1852 5 2P1T 1852 5 2103592 5 0718 where P1T 1852 is the area to the right of 185 in a t distribution with 40 df This value was com puted using the econometrics package Stata it is not available in Table G2 This means that if the 0 185 area 0359 185 area 0359 area 9282 Figure 46 Obtaining the pvalue against a twosided alternative when t 5 185 and df 5 40 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 120 null hypothesis is true we would observe an absolute value of the t statistic as large as 185 about 72 percent of the time This provides some evidence against the null hypothesis but we would not reject the null at the 5 significance level The previous example illustrates that once the pvalue has been computed a classical test can be carried out at any desired level If a denotes the significance level of the test in decimal form then H0 is rejected if pvalue a otherwise H0 is not rejected at the 100a level Computing pvalues for onesided alternatives is also quite simple Suppose for example that we test H0 bj 5 0 against H1 bj 0 If b j 0 then computing a pvalue is not important we know that the pvalue is greater than 50 which will never cause us to reject H0 in favor of H1 If b j 0 then t 0 and the pvalue is just the probability that a t random variable with the appropriate df exceeds the value t Some regression packages only compute pvalues for twosided alternatives But it is sim ple to obtain the onesided pvalue just divide the twosided pvalue by 2 If the alternative is H1 bj 0 it makes sense to compute a pvalue if b j 0 and hence t 0 pvalue 5 P1T t2 5 P1T 0t0 2 because the t distribution is symmetric about zero Again this can be obtained as onehalf of the pvalue for the twotailed test Because you will quickly become familiar with the magnitudes of t statistics that lead to statistical significance especially for large sample sizes it is not always crucial to report pvalues for t statistics But it does not hurt to report them Further when we discuss F testing in Section 45 we will see that it is important to compute pvalues because critical val ues for F tests are not so easily memorized 42e A Reminder on the Language of Classical Hypothesis Testing When H0 is not rejected we prefer to use the language we fail to reject H0 at the x level rather than H0 is accepted at the x level We can use Example 45 to illustrate why the former statement is preferred In this example the estimated elasticity of price with respect to nox is 2954 and the t statistic for testing H0 bnox 5 21 is t 5 393 therefore we cannot reject H0 But there are many other values for bnox more than we can count that cannot be rejected For example the t statistic for H0 bnox 5 29 is 12954 1 92117 5 2462 and so this null is not rejected either Clearly bnox 5 21 and bnox 5 29 cannot both be true so it makes no sense to say that we accept either of these hypotheses All we can say is that the data do not allow us to reject either of these hypotheses at the 5 significance level 42f Economic or Practical versus Statistical Significance Because we have emphasized statistical significance throughout this section now is a good time to remember that we should pay attention to the magnitude of the coefficient estimates in addition to the size of the t statistics The statistical significance of a variable xj is determined entirely by the size of tbj whereas the economic significance or practical significance of a variable is related to the size and sign of b j Recall that the t statistic for testing H0 bj 5 0 is defined by dividing the estimate by its stand ard error tbj 5 b jse1b j2 Thus tbj can indicate statistical significance either because b j is large or because se1b j2 is small It is important in practice to distinguish between these reasons for statis tically significant t statistics Too much focus on statistical significance can lead to the false conclu sion that a variable is important for explaining y even though its estimated effect is modest Suppose you estimate a regression model and obtain b 1 5 56 and pvalue 5 086 for testing h0 b1 5 0 against h1 b1 2 0 What is the pvalue for testing h0 b1 5 0 against h1 b1 0 exploring FurTher 43 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 121 example 46 participation rates in 401k plans In Example 33 we used the data on 401k plans to estimate a model describing participation rates in terms of the firms match rate and the age of the plan We now include a measure of firm size the total number of firm employees totemp The estimated equation is prate 5 8029 1 544 mrate 1 269 age 2 00013 totemp 10782 10522 10452 1000042 n 5 1534 R2 5 100 The smallest t statistic in absolute value is that on the variable totemp t 5 20001300004 5 2325 and this is statistically significant at very small significance levels The twotailed pvalue for this t statistic is about 001 Thus all of the variables are statistically significant at rather small signifi cance levels How big in a practical sense is the coefficient on totemp Holding mrate and age fixed if a firm grows by 10000 employees the participation rate falls by 100001000132 5 13 percentage points This is a huge increase in number of employees with only a modest effect on the participation rate Thus although firm size does affect the participation rate the effect is not practically very large The previous example shows that it is especially important to interpret the magnitude of the coef ficient in addition to looking at t statistics when working with large samples With large sample sizes parameters can be estimated very precisely standard errors are often quite small relative to the coefficient estimates which usually results in statistical significance Some researchers insist on using smaller significance levels as the sample size increases partly as a way to offset the fact that standard errors are getting smaller For example if we feel comfortable with a 5 level when n is a few hundred we might use the 1 level when n is a few thousand Using a smaller significance level means that economic and statistical significance are more likely to coin cide but there are no guarantees In the previous example even if we use a significance level as small as 1 onetenth of 1 we would still conclude that totemp is statistically significant Many researchers are also willing to entertain larger significance levels in applications with small sample sizes reflecting the fact that it is harder to find significance with smaller sample sizes Smaller sample sizes lead to less precise estimators and the critical values are larger in magnitude two factors that make it harder to find statistical significance Unfortunately ones willingness to consider higher significance levels can depend on ones underlying agenda example 47 effect of Job training on firm Scrap rates The scrap rate for a manufacturing firm is the number of defective itemsproducts that must be discardedout of every 100 produced Thus for a given number of items produced a decrease in the scrap rate reflects higher worker productivity We can use the scrap rate to measure the effect of worker training on productivity Using the data in JTRAIN but only for the year 1987 and for nonunionized firms we obtain the following estimated equation log1scrap2 5 1246 2 029 hrsemp 2 962 log1sales2 1 761 log1employ2 15692 10232 14532 14072 n 5 29 R2 5 262 The variable hrsemp is annual hours of training per employee sales is annual firm sales in dollars and employ is the number of firm employees For 1987 the average scrap rate in the sample is about 46 and the average of hrsemp is about 89 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 122 The main variable of interest is hrsemp One more hour of training per employee lowers logscrap by 029 which means the scrap rate is about 29 lower Thus if hrsemp increases by 5each employee is trained 5 more hours per yearthe scrap rate is estimated to fall by 51292 5 145 This seems like a reasonably large effect but whether the additional training is worthwhile to the firm depends on the cost of training and the benefits from a lower scrap rate We do not have the numbers needed to do a cost benefit analysis but the estimated effect seems nontrivial What about the statistical significance of the training variable The t statistic on hrsemp is 2029023 5 2126 and now you probably recognize this as not being large enough in magnitude to conclude that hrsemp is statistically significant at the 5 level In fact with 29 2 4 5 25 degrees of freedom for the onesided alternative H1 bhrsemp 0 the 5 critical value is about 2171 Thus using a strict 5 level test we must conclude that hrsemp is not statistically significant even using a onesided alternative Because the sample size is pretty small we might be more liberal with the significance level The 10 critical value is 2132 and so hrsemp is almost significant against the onesided alternative at the 10 level The pvalue is easily computed as P1T25 21262 5 110 This may be a low enough pvalue to conclude that the estimated effect of training is not just due to sampling error but opinions would legitimately differ on whether a onesided pvalue of 11 is sufficiently small Remember that large standard errors can also be a result of multicollinearity high correlation among some of the independent variables even if the sample size seems fairly large As we dis cussed in Section 34 there is not much we can do about this problem other than to collect more data or change the scope of the analysis by dropping or combining certain independent variables As in the case of a small sample size it can be hard to precisely estimate partial effects when some of the explanatory variables are highly correlated Section 45 contains an example We end this section with some guidelines for discussing the economic and statistical significance of a variable in a multiple regression model 1 Check for statistical significance If the variable is statistically significant discuss the magnitude of the coefficient to get an idea of its practical or economic importance This latter step can require some care depending on how the independent and dependent variables appear in the equation In particular what are the units of measurement Do the variables appear in logarithmic form 2 If a variable is not statistically significant at the usual levels 10 5 or 1 you might still ask if the variable has the expected effect on y and whether that effect is practically large If it is large you should compute a pvalue for the t statistic For small sample sizes you can sometimes make a case for pvalues as large as 20 but there are no hard rules With large pvalues that is small t statistics we are treading on thin ice because the practically large estimates may be due to sampling error a different random sample could result in a very different estimate 3 It is common to find variables with small t statistics that have the wrong sign For practical pur poses these can be ignored we conclude that the variables are statistically insignificant A signif icant variable that has the unexpected sign and a practically large effect is much more troubling and difficult to resolve One must usually think more about the model and the nature of the data to solve such problems Often a counterintuitive significant estimate results from the omission of a key variable or from one of the important problems we will discuss in Chapters 9 and 15 43 Confidence Intervals Under the CLM assumptions we can easily construct a confidence interval CI for the population parameter bj Confidence intervals are also called interval estimates because they provide a range of likely values for the population parameter and not just a point estimate Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 123 Using the fact that 1b j 2 bj2se1b j2 has a t distribution with n 2 k 2 1 degrees of freedom see 43 simple manipulation leads to a CI for the unknown bj a 95 confidence interval given by b j 6 c se1b j2 416 where the constant c is the 975th percentile in a tn2k21 distribution More precisely the lower and upper bounds of the confidence interval are given by b j b j 2 c se1b j2 and bj b j 1 c se1b j2 respectively At this point it is useful to review the meaning of a confidence interval If random samples were obtained over and over again with b j and bj computed each time then the unknown population value bj would lie in the interval b j bj for 95 of the samples Unfortunately for the single sample that we use to construct the CI we do not know whether bj is actually contained in the interval We hope we have obtained a sample that is one of the 95 of all samples where the interval estimate contains bj but we have no guarantee Constructing a confidence interval is very simple when using current computing technology Three quantities are needed b j se1b j2 and c The coefficient estimate and its standard error are reported by any regression package To obtain the value c we must know the degrees of freedom n 2 k 2 1 and the level of confidence95 in this case Then the value for c is obtained from the tn2k21 distribution As an example for df 5 n 2 k 2 1 5 25 a 95 confidence interval for any bj is given by 3b j 2 206 se1b j2 b j 1 206 se1b j2 4 When n 2 k 2 1 120 the tn2k21 distribution is close enough to normal to use the 975th per centile in a standard normal distribution for constructing a 95 CI b j 6 196 se1b j2 In fact when n 2 k 2 1 50 the value of c is so close to 2 that we can use a simple rule of thumb for a 95 con fidence interval b j plus or minus two of its standard errors For small degrees of freedom the exact percentiles should be obtained from the t tables It is easy to construct confidence intervals for any other level of confidence For exam ple a 90 CI is obtained by choosing c to be the 95th percentile in the tn2k21 distribution When df 5 n 2 k 2 1 5 25 c 5 171 and so the 90 CI is b j 6 171 se1b j2 which is necessarily nar rower than the 95 CI For a 99 CI c is the 995th percentile in the t25 distribution When df 5 25 the 99 CI is roughly b j 6 279 se1b j2 which is inevitably wider than the 95 CI Many modern regression packages save us from doing any calculations by reporting a 95 CI along with each coefficient and its standard error Once a confidence interval is constructed it is easy to carry out twotailed hypotheses tests If the null hypothesis is H0 bj 5 aj then H0 is rejected against H1 bj 2 aj at say the 5 significance level if and only if aj is not in the 95 confidence interval example 48 model of rD expenditures Economists studying industrial organization are interested in the relationship between firm size often measured by annual salesand spending on research and development RD Typically a constant elasticity model is used One might also be interested in the ceteris paribus effect of the profit marginthat is profits as a percentage of saleson RD spending Using the data in RDCHEM on 32 US firms in the chemical industry we estimate the following equation with standard errors in parentheses below the coefficients log1rd2 5 2438 1 1084 log1sales2 1 0217 profmarg 1472 10602 101282 n 5 32 R2 5 918 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 124 The estimated elasticity of RD spending with respect to firm sales is 1084 so that holding profit mar gin fixed a 1 increase in sales is associated with a 1084 increase in RD spending Incidentally RD and sales are both measured in millions of dollars but their units of measurement have no effect on the elasticity estimate We can construct a 95 confidence interval for the sales elasticity once we note that the estimated model has n 2 k 2 1 5 32 2 2 2 1 5 29 degrees of freedom From Table G2 we find the 975th percentile in a t29 distribution c 5 2045 Thus the 95 confidence interval for blog1sales2 is 1084 6 060120452 or about 961121 That zero is well outside this interval is hardly surpris ing we expect RD spending to increase with firm size More interesting is that unity is included in the 95 confidence interval for blog1sales2 which means that we cannot reject H0 blog1sales2 5 1 against H1 blog1sales2 2 1 at the 5 significance level In other words the estimated RDsales elasticity is not statistically different from 1 at the 5 level The estimate is not practically different from 1 either The estimated coefficient on profmarg is also positive and the 95 confidence interval for the population parameter bprofmarg is 0217 6 0128120452 or about 200450479 In this case zero is included in the 95 confidence interval so we fail to reject H0 bprofmarg 5 0 against H1 bprofmarg 2 0 at the 5 level Nevertheless the t statistic is about 170 which gives a twosided pvalue of about 10 and so we would conclude that profmarg is statistically significant at the 10 level against the twosided alternative or at the 5 level against the onesided alternative H1 bprofmarg 0 Plus the economic size of the profit margin coefficient is not trivial holding sales fixed a one percentage point increase in profmarg is estimated to increase RD spending by 100102172 22 A com plete analysis of this example goes beyond simply stating whether a particular value zero in this case is or is not in the 95 confidence interval You should remember that a confidence interval is only as good as the underlying assump tions used to construct it If we have omitted important factors that are correlated with the explana tory variables then the coefficient estimates are not reliable OLS is biased If heteroskedasticity is presentfor instance in the previous example if the variance of logrd depends on any of the explanatory variablesthen the standard error is not valid as an estimate of sd1b j2 as we discussed in Section 34 and the confidence interval computed using these standard errors will not truly be a 95 CI We have also used the normality assumption on the errors in obtaining these CIs but as we will see in Chapter 5 this is not as important for applications involving hundreds of observations 44 Testing Hypotheses about a Single Linear Combination of the Parameters The previous two sections have shown how to use classical hypothesis testing or confidence intervals to test hypotheses about a single bj at a time In applications we must often test hypotheses involving more than one of the population parameters In this section we show how to test a single hypothesis involving more than one of the bj Section 45 shows how to test multiple hypotheses To illustrate the general approach we will consider a simple model to compare the returns to edu cation at junior colleges and fouryear colleges for simplicity we refer to the latter as universities Kane and Rouse 1995 provide a detailed analysis of the returns to two and fouryear colleges The population includes working people with a high school degree and the model is log1wage2 5 b0 1 b1 jc 1 b2univ 1 b3exper 1 u 417 where jc 5 number of years attending a twoyear college univ 5 number of years at a fouryear college exper 5 months in the workforce Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 125 Note that any combination of junior college and fouryear college is allowed including jc 5 0 and univ 5 0 The hypothesis of interest is whether one year at a junior college is worth one year at a university this is stated as H0 b1 5 b2 418 Under H0 another year at a junior college and another year at a university lead to the same ceteris paribus percentage increase in wage For the most part the alternative of interest is onesided a year at a junior college is worth less than a year at a university This is stated as H1 b1 b2 419 The hypotheses in 418 and 419 concern two parameters b1 and b2 a situation we have not faced yet We cannot simply use the individual t statistics for b 1 and b 2 to test H0 However concep tually there is no difficulty in constructing a t statistic for testing 418 To do so we rewrite the null and alternative as H0 b1 2 b2 5 0 and H1 b1 2 b2 0 respectively The t statistic is based on whether the estimated difference b 1 2 b 2 is sufficiently less than zero to warrant rejecting 418 in favor of 419 To account for the sampling error in our estimators we standardize this difference by dividing by the standard error t 5 b 1 2 b 2 se1b 1 2 b 22 420 Once we have the t statistic in 420 testing proceeds as before We choose a significance level for the test and based on the df obtain a critical value Because the alternative is of the form in 419 the rejection rule is of the form t 2c where c is a positive value chosen from the appropriate t dis tribution Or we compute the t statistic and then compute the pvalue see Section 42 The only thing that makes testing the equality of two different parameters more difficult than testing about a single bj is obtaining the standard error in the denominator of 420 Obtaining the numerator is trivial once we have performed the OLS regression Using the data in TWOYEAR which comes from Kane and Rouse 1995 we estimate equation 417 log1wage2 5 1472 1 0667 jc 1 0769 univ 1 0049 exper 10212 100682 100232 100022 421 n 5 6763 R2 5 222 It is clear from 421 that jc and univ have both economically and statistically significant effects on wage This is certainly of interest but we are more concerned about testing whether the estimated dif ference in the coefficients is statistically significant The difference is estimated as b 1 2 b 2 5 20102 so the return to a year at a junior college is about one percentage point less than a year at a university Economically this is not a trivial difference The difference of 20102 is the numerator of the t sta tistic in 420 Unfortunately the regression results in equation 421 do not contain enough information to obtain the standard error of b 1 2 b 2 It might be tempting to claim that se1b 1 2 b 22 5 se1b 12 2 se1b 22 but this is not true In fact if we reversed the roles of b 1 and b 2 we would wind up with a negative standard error of the difference using the difference in standard errors Standard errors must always be positive because they are estimates of standard deviations Although the standard error of the dif ference b 1 2 b 2 certainly depends on se1b 12 and se1b 22 it does so in a somewhat complicated way To find se1b 1 2 b 22 we first obtain the variance of the difference Using the results on variances in Appendix B we have Var1b 1 2 b 22 5 Var1b 12 1 Var1b 22 2 2 Cov1b 1 b 22 422 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 126 Observe carefully how the two variances are added together and twice the covariance is then sub tracted The standard deviation of b 1 2 b 2 is just the square root of 422 and since 3se1b 12 42 is an unbiased estimator of Var1b 12 and similarly for 3se1b 22 42 we have se1b 1 2 b 22 5 53se1b 12 42 1 3se1b 22 42 2 2s12612 423 where s12 denotes an estimate of Cov1b 1b 22 We have not displayed a formula for Cov1b 1b 22 Some regression packages have features that allow one to obtain s12 in which case one can compute the standard error in 423 and then the t statistic in 420 Appendix E shows how to use matrix algebra to obtain s12 Some of the more sophisticated econometrics programs include special commands that can be used for testing hypotheses about linear combinations Here we cover an approach that is simple to compute in virtually any statistical package Rather than trying to compute se1b 1 2 b 22 from 423 it is much easier to estimate a different model that directly delivers the standard error of interest Define a new parameter as the difference between b1 and b2 u1 5 b1 2 b2 Then we want to test H0 u1 5 0 against H1 u1 0 424 The t statistic in 420 in terms of u 1 is just t 5 u 1se1u 12 The challenge is finding se1u 12 We can do this by rewriting the model so that u1 appears directly on one of the independent variables Because u1 5 b1 2 b2 we can also write b1 5 u1 1 b2 Plugging this into 417 and rear ranging gives the equation log1wage2 5 b0 1 1u1 1 b22jc 1 b2univ 1 b3exper 1 u 5 b0 1 u1 jc 1 b21jc 1 univ2 1 b3exper 1 u 425 The key insight is that the parameter we are interested in testing hypotheses about u1 now multiplies the variable jc The intercept is still b0 and exper still shows up as being multiplied by b3 More importantly there is a new variable multiplying b2 namely jc 1 univ Thus if we want to directly estimate u1 and obtain the standard error of u 1 then we must construct the new variable jc 1 univ and include it in the regression model in place of univ In this example the new variable has a natural interpretation it is total years of college so define totcoll 5 jc 1 univ and write 425 as log1wage2 5 b0 1 u1 jc 1 b2totcoll 1 b3exper 1 u 426 The parameter b1 has disappeared from the model while u1 appears explicitly This model is really just a different way of writing the original model The only reason we have defined this new model is that when we estimate it the coefficient on jc is u 1 and more importantly se1u 12 is reported along with the estimate The t statistic that we want is the one reported by any regression package on the variable jc not the variable totcoll When we do this with the 6763 observations used earlier the result is log1wage2 5 1472 2 0102 jc 1 0769 totcoll 1 0049 exper 10212 100692 100232 100022 427 n 5 6763 R2 5 222 The only number in this equation that we could not get from 421 is the standard error for the esti mate 20102 which is 0069 The t statistic for testing 418 is 201020069 5 2148 Against the onesided alternative 419 the pvalue is about 070 so there is some but not strong evidence against 418 The intercept and slope estimate on exper along with their standard errors are the same as in 421 This fact must be true and it provides one way of checking whether the transformed equation has been properly estimated The coefficient on the new variable totcoll is the same as the coefficient on univ in 421 and the standard error is also the same We know that this must happen by compar ing 417 and 425 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 127 It is quite simple to compute a 95 confidence interval for u1 5 b1 2 b2 Using the standard normal approximation the CI is obtained as usual u 1 6 196 se1u 12 which in this case leads to 20102 6 0135 The strategy of rewriting the model so that it contains the parameter of interest works in all cases and is easy to implement See Computer Exercises C1 and C3 for other examples 45 Testing Multiple Linear Restrictions The F Test The t statistic associated with any OLS coefficient can be used to test whether the corresponding unknown parameter in the population is equal to any given constant which is usually but not always zero We have just shown how to test hypotheses about a single linear combination of the bj by rear ranging the equation and running a regression using transformed variables But so far we have only covered hypotheses involving a single restriction Frequently we wish to test multiple hypotheses about the underlying parameters b0 b1 p bk We begin with the leading case of testing whether a set of independent variables has no partial effect on a dependent variable 45a Testing Exclusion Restrictions We already know how to test whether a particular variable has no partial effect on the dependent variable use the t statistic Now we want to test whether a group of variables has no effect on the dependent variable More precisely the null hypothesis is that a set of variables has no effect on y once another set of variables has been controlled As an illustration of why testing significance of a group of variables is useful we consider the following model that explains major league baseball players salaries log1salary2 5 b0 1 b1years 1 b2gamesyr 1 b3bavg 1 b4hrunsyr 1 b5rbisyr 1 u 428 where salary is the 1993 total salary years is years in the league gamesyr is average games played per year bavg is career batting average for example bavg 5 250 hrunsyr is home runs per year and rbisyr is runs batted in per year Suppose we want to test the null hypothesis that once years in the league and games per year have been controlled for the statistics measuring performancebavg hrunsyr and rbisyrhave no effect on salary Essentially the null hypothesis states that productivity as measured by baseball statistics has no effect on salary In terms of the parameters of the model the null hypothesis is stated as H0 b3 5 0 b4 5 0 b5 5 0 429 The null 429 constitutes three exclusion restrictions if 429 is true then bavg hrunsyr and rbisyr have no effect on logsalary after years and gamesyr have been controlled for and therefore should be excluded from the model This is an example of a set of multiple restrictions because we are putting more than one restriction on the parameters in 428 we will see more general examples of multiple restrictions later A test of multiple restrictions is called a multiple hypotheses test or a joint hypotheses test What should be the alternative to 429 If what we have in mind is that performance statistics matter even after controlling for years in the league and games per year then the appropriate alterna tive is simply H1 H0 is not true 430 The alternative 430 holds if at least one of b3 b4 or b5 is different from zero Any or all could be different from zero The test we study here is constructed to detect any violation of H0 It is also valid when the alternative is something like H1 b3 0 or b4 0 or b5 0 but it will not be the best Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 128 possible test under such alternatives We do not have the space or statistical background necessary to cover tests that have more power under multiple onesided alternatives How should we proceed in testing 429 against 430 It is tempting to test 429 by using the t statistics on the variables bavg hrunsyr and rbisyr to determine whether each variable is individually significant This option is not appropriate A particular t statistic tests a hypothesis that puts no restric tions on the other parameters Besides we would have three outcomes to contend withone for each t statistic What would constitute rejection of 429 at say the 5 level Should all three or only one of the three t statistics be required to be significant at the 5 level These are hard questions and fortunately we do not have to answer them Furthermore using separate t statistics to test a multiple hypothesis like 429 can be very misleading We need a way to test the exclusion restrictions jointly To illustrate these issues we estimate equation 428 using the data in MLB1 This gives log1salary2 5 1119 1 0689 years 1 0126 gamesyr 10292 101212 100262 1 00098 bavg 1 0144 hrunsyr 1 0108 rbisyr 431 1001102 101612 100722 n 5 353 SSR 5 183186 R2 5 6278 where SSR is the sum of squared residuals We will use this later We have left several terms after the decimal in SSR and Rsquared to facilitate future comparisons Equation 431 reveals that whereas years and gamesyr are statistically significant none of the variables bavg hrunsyr and rbisyr has a statistically significant t statistic against a twosided alternative at the 5 significance level The t statistic on rbisyr is the closest to being significant its twosided pvalue is 134 Thus based on the three t statistics it appears that we cannot reject H0 This conclusion turns out to be wrong To see this we must derive a test of multiple restrictions whose distribution is known and tabulated The sum of squared residuals now turns out to provide a very convenient basis for testing multiple hypotheses We will also show how the Rsquared can be used in the special case of testing for exclusion restrictions Knowing the sum of squared residuals in 431 tells us nothing about the truth of the hypoth esis in 429 However the factor that will tell us something is how much the SSR increases when we drop the variables bavg hrunsyr and rbisyr from the model Remember that because the OLS estimates are chosen to minimize the sum of squared residuals the SSR always increases when vari ables are dropped from the model this is an algebraic fact The question is whether this increase is large enough relative to the SSR in the model with all of the variables to warrant rejecting the null hypothesis The model without the three variables in question is simply log1salary2 5 b0 1 b1years 1 b2gamesyr 1 u 432 In the context of hypothesis testing equation 432 is the restricted model for testing 429 model 428 is called the unrestricted model The restricted model always has fewer parameters than the unrestricted model When we estimate the restricted model using the data in MLB1 we obtain log1salary2 5 1122 1 0713 years 1 0202 gamesyr 1112 101252 100132 n 5 353 SSR 5 198311 R2 5 5971 433 As we surmised the SSR from 433 is greater than the SSR from 431 and the Rsquared from the restricted model is less than the Rsquared from the unrestricted model What we need to decide is whether the increase in the SSR in going from the unrestricted model to the restricted model 183186 to 198311 is large enough to warrant rejection of 429 As with all testing the answer depends on the significance level of the test But we cannot carry out the test at a chosen significance level until we Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 129 have a statistic whose distribution is known and can be tabulated under H0 Thus we need a way to combine the information in the two SSRs to obtain a test statistic with a known distribution under H0 Because it is no more difficult we might as well derive the test for the general case Write the unrestricted model with k independent variables as y 5 b0 1 b1x1 1 p 1 bkxk 1 u 434 the number of parameters in the unrestricted model is k 1 1 Remember to add one for the inter cept Suppose that we have q exclusion restrictions to test that is the null hypothesis states that q of the variables in 434 have zero coefficients For notational simplicity assume that it is the last q variables in the list of independent variables xk2q11 p xk The order of the variables of course is arbitrary and unimportant The null hypothesis is stated as H0 bk2q11 5 0 p bk 5 0 435 which puts q exclusion restrictions on the model 434 The alternative to 435 is simply that it is false this means that at least one of the parameters listed in 435 is different from zero When we impose the restrictions under H0 we are left with the restricted model y 5 b0 1 b1x1 1 p 1 bk2q xk2q 1 u 436 In this subsection we assume that both the unrestricted and restricted models contain an intercept since that is the case most widely encountered in practice Now for the test statistic itself Earlier we suggested that looking at the relative increase in the SSR when moving from the unrestricted to the restricted model should be informative for testing the hypothesis 435 The F statistic or F ratio is defined by F 1SSRr 2 SSRur2q SSRur1n 2 k 2 12 437 where SSRr is the sum of squared residuals from the restricted model and SSRur is the sum of squared residu als from the unrestricted model You should immediately notice that since SSRr can be no smaller than SSRur the F statistic is always nonnegative and almost always strictly positive Thus if you compute a negative F statistic then some thing is wrong the order of the SSRs in the numera tor of F has usually been reversed Also the SSR in the denominator of F is the SSR from the unrestricted model The easiest way to remember where the SSRs appear is to think of F as measuring the relative increase in SSR when moving from the unrestricted to the restricted model The difference in SSRs in the numerator of F is divided by q which is the number of restric tions imposed in moving from the unrestricted to the restricted model q independent variables are dropped Therefore we can write q 5 numerator degrees of freedom 5 dfr 2 dfur 438 which also shows that q is the difference in degrees of freedom between the restricted and unre stricted models Recall that df 5 number of observations 2 number of estimated parameters2 Consider relating individual performance on a standardized test score to a variety of other variables School factors include average class size perstudent expendi tures average teacher compensation and total school enrollment Other variables specific to the student are family income mothers education fathers education and number of siblings The model is score 5 b0 1 b1classize 1 b2expend 1 b3tchcomp 1 b4enroll 1 b5faminc 1 b6motheduc 1 b7fatheduc 1 b8siblings 1 u State the null hypothesis that student specific variables have no effect on stan dardized test performance once school related factors have been controlled for What are k and q for this example Write down the restricted version of the model exploring FurTher 44 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 130 Since the restricted model has fewer parametersand each model is estimated using the same n observationsdfr is always greater than dfur The SSR in the denominator of F is divided by the degrees of freedom in the unrestricted model n 2 k 2 1 5 denominator degrees of freedom 5 dfur 439 In fact the denominator of F is just the unbiased estimator of s2 5 Var1u2 in the unrestricted model In a particular application computing the F statistic is easier than wading through the somewhat cumbersome notation used to describe the general case We first obtain the degrees of freedom in the unrestricted model dfur Then we count how many variables are excluded in the restricted model this is q The SSRs are reported with every OLS regression and so forming the F statistic is simple In the major league baseball salary regression n 5 353 and the full model 428 contains six parameters Thus n 2 k 2 1 5 dfur 5 353 2 6 5 347 The restricted model 432 contains three fewer independent variables than 428 and so q 5 3 Thus we have all of the ingredients to com pute the F statistic we hold off doing so until we know what to do with it To use the F statistic we must know its sampling distribution under the null in order to choose critical values and rejection rules It can be shown that under H0 and assuming the CLM assump tions hold F is distributed as an F random variable with qn 2 k 2 1 degrees of freedom We write this as F Fqn2k21 The distribution of Fqn2k21 is readily tabulated and available in statistical tables see Table G3 and even more importantly in statistical software We will not derive the F distribution because the mathematics is very involved Basically it can be shown that equation 437 is actually the ratio of two independent chisquare random variables divided by their respective degrees of freedom The numerator chisquare random variable has q degrees of freedom and the chisquare in the denominator has n 2 k 2 1 degrees of freedom This is the definition of an F distributed random variable see Appendix B It is pretty clear from the definition of F that we will reject H0 in favor of H1 when F is suffi ciently large How large depends on our chosen significance level Suppose that we have decided on a 5 level test Let c be the 95th percentile in the Fqn2k21 distribution This critical value depends on q the numerator df and n 2 k 2 1 the denominator df It is important to keep the numerator and denominator degrees of freedom straight The 10 5 and 1 critical values for the F distribution are given in Table G3 The rejection rule is simple Once c has been obtained we reject H0 in favor of H1 at the chosen significance level if F c 440 With a 5 significance level q 5 3 and n 2 k 2 1 5 60 the critical value is c 5 276 We would reject H0 at the 5 level if the computed value of the F statistic exceeds 276 The 5 critical value and rejection region are shown in Figure 47 For the same degrees of freedom the 1 critical value is 413 In most applications the numerator degrees of freedom q will be notably smaller than the denominator degrees of freedom 1n 2 k 2 12 Applications where n 2 k 2 1 is small are unlikely to be successful because the parameters in the unrestricted model will probably not be precisely esti mated When the denominator df reaches about 120 the F distribution is no longer sensitive to it This is entirely analogous to the t distribution being well approximated by the standard normal dis tribution as the df gets large Thus there is an entry in the table for the denominator df 5 and this is what we use with large samples because n 2 k 2 1 is then large A similar statement holds for a very large numerator df but this rarely occurs in applications If H0 is rejected then we say that xk2q11 p xk are jointly statistically significant or just jointly significant at the appropriate significance level This test alone does not allow us to say which of the variables has a partial effect on y they may all affect y or maybe only one affects y If Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 131 the null is not rejected then the variables are jointly insignificant which often justifies dropping them from the model For the major league baseball example with three numerator degrees of freedom and 347 denom inator degrees of freedom the 5 critical value is 260 and the 1 critical value is 378 We reject H0 at the 1 level if F is above 378 we reject at the 5 level if F is above 260 We are now in a position to test the hypothesis that we began this section with after control ling for years and gamesyr the variables bavg hrunsyr and rbisyr have no effect on players sala ries In practice it is easiest to first compute 1SSRr 2 SSRur2SSRur and to multiply the result by 1n 2 k 2 12q the reason the formula is stated as in 437 is that it makes it easier to keep the numerator and denominator degrees of freedom straight Using the SSRs in 431 and 433 we have F 5 1198311 2 1831862 183186 347 3 955 This number is well above the 1 critical value in the F distribution with 3 and 347 degrees of free dom and so we soundly reject the hypothesis that bavg hrunsyr and rbisyr have no effect on salary The outcome of the joint test may seem surprising in light of the insignificant t statistics for the three variables What is happening is that the two variables hrunsyr and rbisyr are highly corre lated and this multicollinearity makes it difficult to uncover the partial effect of each variable this is reflected in the individual t statistics The F statistic tests whether these variables including bavg are jointly significant and multicollinearity between hrunsyr and rbisyr is much less relevant for testing this hypothesis In Problem 16 you are asked to reestimate the model while dropping rbisyr in which 0 276 area 05 area 95 rejection region Figure 47 The 5 critical value and rejection region in an F360 distribution Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 132 case hrunsyr becomes very significant The same is true for rbisyr when hrunsyr is dropped from the model The F statistic is often useful for testing exclusion of a group of variables when the variables in the group are highly correlated For example suppose we want to test whether firm performance affects the salaries of chief executive officers There are many ways to measure firm performance and it probably would not be clear ahead of time which measures would be most important Since meas ures of firm performance are likely to be highly correlated hoping to find individually significant measures might be asking too much due to multicollinearity But an F test can be used to determine whether as a group the firm performance variables affect salary 45b Relationship between F and t Statistics We have seen in this section how the F statistic can be used to test whether a group of variables should be included in a model What happens if we apply the F statistic to the case of testing significance of a single independent variable This case is certainly not ruled out by the previ ous development For example we can take the null to be H0 bk 5 0 and q 5 1 to test the single exclusion restriction that xk can be excluded from the model From Section 42 we know that the t statistic on bk can be used to test this hypothesis The question then is do we have two separate ways of testing hypotheses about a single coefficient The answer is no It can be shown that the F statistic for testing exclusion of a single variable is equal to the square of the corresponding t statistic Since t2 n2k21 has an F1n2k21 distribution the two approaches lead to exactly the same out come provided that the alternative is twosided The t statistic is more flexible for testing a single hypothesis because it can be directly used to test against onesided alternatives Since t statistics are also easier to obtain than F statistics there is really no reason to use an F statistic to test hypotheses about a single parameter We have already seen in the salary regressions for major league baseball players that two or more variables that each have insignificant t statistics can be jointly very significant It is also pos sible that in a group of several explanatory variables one variable has a significant t statistic but the group of variables is jointly insignificant at the usual significance levels What should we make of this kind of outcome For concreteness suppose that in a model with many explanatory variables we cannot reject the null hypothesis that b1 b2 b3 b4 and b5 are all equal to zero at the 5 level yet the t statistic for b 1 is significant at the 5 level Logically we cannot have b1 2 0 but also have b1 b2 b3 b4 and b5 all equal to zero But as a matter of testing it is possible that we can group a bunch of insignificant variables with a significant variable and conclude that the entire set of variables is jointly insignificant Such possible conflicts between a t test and a joint F test give another example of why we should not accept null hypotheses we should only fail to reject them The F statistic is intended to detect whether a set of coefficients is different from zero but it is never the best test for determining whether a single coefficient is different from zero The t test is best suited for testing a single hypothesis In statistical terms an F statistic for joint restrictions including b1 5 0 will have less power for detecting b1 2 0 than the usual t statistic See Section C6 in Appendix C for a discus sion of the power of a test Unfortunately the fact that we can sometimes hide a statistically significant variable along with some insignificant variables could lead to abuse if regression results are not carefully reported For example suppose that in a study of the determinants of loanacceptance rates at the city level x1 is the fraction of black households in the city Suppose that the variables x2 x3 x4 and x5 are the fractions of households headed by different age groups In explaining loan rates we would include measures of income wealth credit ratings and so on Suppose that age of household head has no effect on loan approval rates once other variables are controlled for Even if race has a margin ally significant effect it is possible that the race and age variables could be jointly insignificant Someone wanting to conclude that race is not a factor could simply report something like Race and age variables were added to the equation but they were jointly insignificant at the 5 level Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 133 Hopefully peer review prevents these kinds of misleading conclusions but you should be aware that such outcomes are possible Often when a variable is very statistically significant and it is tested jointly with another set of variables the set will be jointly significant In such cases there is no logical inconsistency in rejecting both null hypotheses 45c The RSquared Form of the F Statistic For testing exclusion restrictions it is often more convenient to have a form of the F statistic that can be computed using the Rsquareds from the restricted and unrestricted models One reason for this is that the Rsquared is always between zero and one whereas the SSRs can be very large depending on the unit of measurement of y making the calculation based on the SSRs tedious Using the fact that SSRr 5 SST11 2 R2 r 2 and SSRur 5 SST11 2 R2 ur2 we can substitute into 437 to obtain F 5 1R2 ur 2 R2 r 2q 11 2 R2 ur21n 2 k 2 12 5 1R2 ur 2 R2 r 2q 11 2 R2 ur2dfur 441 note that the SST terms cancel everywhere This is called the Rsquared form of the F statistic At this point you should be cautioned that although equation 441 is very convenient for testing exclusion restrictions it cannot be applied for testing all linear restrictions As we will see when we discuss testing general linear restrictions the sum of squared residuals form of the F statistic is some times needed Because the Rsquared is reported with almost all regressions whereas the SSR is not it is easy to use the Rsquareds from the unrestricted and restricted models to test for exclusion of some vari ables Particular attention should be paid to the order of the Rsquareds in the numerator the unre stricted Rsquared comes first contrast this with the SSRs in 437 Because R2 ur R2 r this shows again that F will always be positive In using the Rsquared form of the test for excluding a set of variables it is important to not square the Rsquared before plugging it into formula 441 the squaring has already been done All regressions report R2 and these numbers are plugged directly into 441 For the baseball salary example we can use 441 to obtain the F statistic F 5 16278 2 59712 11 2 62782 347 3 954 which is very close to what we obtained before The difference is due to rounding error example 49 parents education in a birth Weight equation As another example of computing an F statistic consider the following model to explain child birth weight in terms of various factors bwght 5 b0 1 b1cigs 1 b2parity 1 b3faminc 1 b4motheduc 1 b5fatheduc 1 u 442 where bwght 5 birth weight in pounds cigs 5 average number of cigarettes the mother smoked per day during pregnancy parity 5 the birth order of this child faminc 5 annual family income motheduc 5 years of schooling for the mother fatheduc 5 years of schooling for the father Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 134 Let us test the null hypothesis that after controlling for cigs parity and faminc parents education has no effect on birth weight This is stated as H0 b4 5 0 b5 5 0 and so there are q 5 2 exclusion restrictions to be tested There are k 1 1 5 6 parameters in the unrestricted model 442 so the df in the unrestricted model is n 2 6 where n is the sample size We will test this hypothesis using the data in BWGHT This data set contains information on 1388 births but we must be careful in counting the observations used in testing the null hypothesis It turns out that information on at least one of the variables motheduc and fatheduc is missing for 197 births in the sample these observations cannot be included when estimating the unrestricted model Thus we really have n 5 1191 observations and so there are 1191 2 6 5 1185 df in the unre stricted model We must be sure to use these same 1191 observations when estimating the restricted model not the full 1388 observations that are available Generally when estimating the restricted model to compute an F test we must use the same observations to estimate the unrestricted model otherwise the test is not valid When there are no missing data this will not be an issue The numerator df is 2 and the denominator df is 1185 from Table G3 the 5 critical value is c 5 30 Rather than report the complete results for brevity we present only the Rsquareds The Rsquared for the full model turns out to be R2 ur 5 0387 When motheduc and fathe duc are dropped from the regression the Rsquared falls to R2 r 5 0364 Thus the F statistic is F 5 3 10387 2 0364211 2 03872 41118522 5 142 since this is well below the 5 critical value we fail to reject H0 In other words motheduc and fatheduc are jointly insignificant in the birth weight equation Most statistical packages these days have builtin commands for testing multiple hypotheses after OLS estimation and so one need not worry about making the mistake of running the two regressions on different data sets Typically the commands are applied after estimation of the unrestricted model which means the smaller subset of data is used whenever there are missing values on some variables Formulas for computing the F statistic using matrix algebrasee Appendix Edo not require estimation of the retricted model 45d Computing pValues for F Tests For reporting the outcomes of F tests pvalues are especially useful Since the F distribution depends on the numerator and denominator df it is difficult to get a feel for how strong or weak the evidence is against the null hypothesis simply by looking at the value of the F statistic and one or two critical values In the F testing context the pvalue is defined as pvalue 5 P1 F2 443 where for emphasis we let denote an F random variable with qn 2 k 2 1 degrees of freedom and F is the actual value of the test statistic The pvalue still has the same interpretation as it did for t statis tics it is the probability of observing a value of F at least as large as we did given that the null hypoth esis is true A small pvalue is evidence against H0 For example pvalue 5 016 means that the chance of observing a value of F as large as we did when the null hypothesis was true is only 16 we usually reject H0 in such cases If the pvalue 5 314 then the chance of observing a value of the F statistic as large as we did under the null hypothesis is 314 Most would find this to be pretty weak evidence against H0 The data in ATTEND were used to estimate the two equations atndrte 5 4713 1 1337 priGPA 12872 11092 n 5 680 R2 5 183 and atndrte 5 7570 1 1726 priGPA 2 172 ACT 13882 11082 12 n 5 680 R2 5 291 where as always standard errors are in parentheses the standard error for ACT is missing in the second equation What is the t statistic for the coefficient on ACT Hint First compute the F statistic for significance of ACT exploring FurTher 45 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 135 As with t testing once the pvalue has been computed the F test can be carried out at any signifi cance level For example if the pvalue 5 024 we reject H0 at the 5 significance level but not at the 1 level The pvalue for the F test in Example 49 is 238 and so the null hypothesis that bmotheduc and bfatheduc are both zero is not rejected at even the 20 significance level Many econometrics packages have a builtin feature for testing multiple exclusion restric tions These packages have several advantages over calculating the statistics by hand we will less likely make a mistake pvalues are computed automatically and the problem of missing data as in Example 49 is handled without any additional work on our part 45e The F Statistic for Overall Significance of a Regression A special set of exclusion restrictions is routinely tested by most regression packages These restric tions have the same interpretation regardless of the model In the model with k independent variables we can write the null hypothesis as H0 x1 x2 p xk do not help to explain y This null hypothesis is in a way very pessimistic It states that none of the explanatory variables has an effect on y Stated in terms of the parameters the null is that all slope parameters are zero H0 b1 5 b2 5 p 5 bk 5 0 444 and the alternative is that at least one of the bj is different from zero Another useful way of stating the null is that H0 E1y0x1 x2 p xk2 5 E1y2 so that knowing the values of x1 x2 p xk does not affect the expected value of y There are k restrictions in 444 and when we impose them we get the restricted model y 5 b0 1 u 445 all independent variables have been dropped from the equation Now the Rsquared from estimating 445 is zero none of the variation in y is being explained because there are no explanatory variables Therefore the F statistic for testing 444 can be written as R2k 11 2 R221n 2 k 2 12 446 where R2 is just the usual Rsquared from the regression of y on x1 x2 p xk Most regression packages report the F statistic in 446 automatically which makes it tempting to use this statistic to test general exclusion restrictions You must avoid this temptation The F statis tic in 441 is used for general exclusion restrictions it depends on the Rsquareds from the restricted and unrestricted models The special form of 446 is valid only for testing joint exclusion of all inde pendent variables This is sometimes called determining the overall significance of the regression If we fail to reject 444 then there is no evidence that any of the independent variables help to explain y This usually means that we must look for other variables to explain y For Example 49 the F statistic for testing 444 is about 955 with k 5 5 and n 2 k 2 1 5 1185 df The pvalue is zero to four places after the decimal point so that 444 is rejected very strongly Thus we conclude that the variables in the bwght equation do explain some variation in bwght The amount explained is not large only 387 But the seemingly small Rsquared results in a highly significant F statistic That is why we must compute the F statistic to test for joint significance and not just look at the size of the Rsquared Occasionally the F statistic for the hypothesis that all independent variables are jointly insignifi cant is the focus of a study Problem 10 asks you to use stock return data to test whether stock returns over a fouryear horizon are predictable based on information known only at the beginning of the period Under the efficient markets hypothesis the returns should not be predictable the null hypoth esis is precisely 444 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 136 45f Testing General Linear Restrictions Testing exclusion restrictions is by far the most important application of F statistics Sometimes how ever the restrictions implied by a theory are more complicated than just excluding some independent variables It is still straightforward to use the F statistic for testing As an example consider the following equation log1price2 5 b0 1 b1log1assess2 1 b2log1lotsize2 1 b3log1sqrft2 1 b4bdrms 1 u 447 where price 5 house price assess 5 the assessed housing value1before the house was sold2 lotsize 5 size of the lot in square feet sqrft 5 square footage bdrms 5 number of bedrooms Now suppose we would like to test whether the assessed housing price is a rational valuation If this is the case then a 1 change in assess should be associated with a 1 change in price that is b1 5 1 In addition lotsize sqrft and bdrms should not help to explain logprice once the assessed value has been controlled for Together these hypotheses can be stated as H0 b1 5 1 b2 5 0 b3 5 0 b4 5 0 448 Four restrictions have to be tested three are exclusion restrictions but b1 5 1 is not How can we test this hypothesis using the F statistic As in the exclusion restriction case we estimate the unrestricted model 447 in this case and then impose the restrictions in 448 to obtain the restricted model It is the second step that can be a little tricky But all we do is plug in the restrictions If we write 447 as y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 b4x4 1 u 449 then the restricted model is y 5 b0 1 x1 1 u Now to impose the restriction that the coefficient on x1 is unity we must estimate the following model y 2 x1 5 b0 1 u 450 This is just a model with an intercept 1b02 but with a different dependent variable than in 449 The procedure for computing the F statistic is the same estimate 450 obtain the SSR1SSRr2 and use this with the unrestricted SSR from 449 in the F statistic 437 We are test ing q 5 4 restrictions and there are n 2 5 df in the unrestricted model The F statistic is simply 3 1SSRr 2 SSRur2SSRur43 1n 2 5244 Before illustrating this test using a data set we must emphasize one point we cannot use the Rsquared form of the F statistic for this example because the dependent variable in 450 is different from the one in 449 This means the total sum of squares from the two regressions will be different and 441 is no longer equivalent to 437 As a general rule the SSR form of the F statistic should be used if a different dependent variable is needed in running the restricted regression The estimated unrestricted model using the data in HPRICE1 is log1price2 5 264 1 1043 log1assess2 1 0074 log1lotsize2 15702 11512 103862 2 1032 log1sqrft2 1 0338 bdrms 113842 102212 n 5 88 SSR 5 1822 R2 5 773 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 137 If we use separate t statistics to test each hypothesis in 448 we fail to reject each one But rationality of the assessment is a joint hypothesis so we should test the restrictions jointly The SSR from the restricted model turns out to be SSRr 5 1880 and so the F statistic is 3 11880 2 182221822418342 5 661 The 5 critical value in an F distribution with 483 df is about 250 and so we fail to reject H0 There is essentially no evidence against the hypothesis that the assessed values are rational 46 Reporting Regression Results We end this chapter by providing a few guidelines on how to report multiple regression results for relatively complicated empirical projects This should help you to read published works in the applied social sciences while also preparing you to write your own empirical papers We will expand on this topic in the remainder of the text by reporting results from various examples but many of the key points can be made now Naturally the estimated OLS coefficients should always be reported For the key variables in an analysis you should interpret the estimated coefficients which often requires knowing the units of measurement of the variables For example is an estimate an elasticity or does it have some other interpretation that needs explanation The economic or practical importance of the estimates of the key variables should be discussed The standard errors should always be included along with the estimated coefficients Some authors prefer to report the t statistics rather than the standard errors and sometimes just the abso lute value of the t statistics Although nothing is really wrong with this there is some preference for reporting standard errors First it forces us to think carefully about the null hypothesis being tested the null is not always that the population parameter is zero Second having standard errors makes it easier to compute confidence intervals The Rsquared from the regression should always be included We have seen that in addition to providing a goodnessoffit measure it makes calculation of F statistics for exclusion restrictions simple Reporting the sum of squared residuals and the standard error of the regression is sometimes a good idea but it is not crucial The number of observations used in estimating any equation should appear near the estimated equation If only a couple of models are being estimated the results can be summarized in equation form as we have done up to this point However in many papers several equations are estimated with many different sets of independent variables We may estimate the same equation for different groups of people or even have equations explaining different dependent variables In such cases it is better to summarize the results in one or more tables The dependent variable should be indicated clearly in the table and the independent variables should be listed in the first column Standard errors or t statis tics can be put in parentheses below the estimates example 410 Salarypension tradeoff for teachers Let totcomp denote average total annual compensation for a teacher including salary and all fringe benefits pension health insurance and so on Extending the standard wage equation total compen sation should be a function of productivity and perhaps other characteristics As is standard we use logarithmic form log1totcomp2 5 f 1productivity characteristics other factors2 where f 12 is some function unspecified for now Write totcomp 5 salary 1 benefits 5 salary a1 1 benefits salary b Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 138 This equation shows that total compensation is the product of two terms salary and 1 1 bs where bs is shorthand for the benefits to salary ratio Taking the log of this equation gives log1totcomp2 5 log1salary2 1 log11 1 bs2 Now for small bs log11 1 bs2 bs we will use this approximation This leads to the econometric model log1salary2 5 b0 1 b11bs2 1 other factors Testing the salarybenefits tradeoff then is the same as a test of H0 b1 5 21 against H1 b1 2 21 We use the data in MEAP93 to test this hypothesis These data are averaged at the school level and we do not observe very many other factors that could affect total compensation We will include controls for size of the school enroll staff per thousand students staff and measures such as the school dropout and graduation rates The average bs in the sample is about 205 and the largest value is 450 The estimated equations are given in Table 41 where standard errors are given in parentheses below the coefficient estimates The key variable is bs the benefitssalary ratio From the first column in Table 41 we see that without controlling for any other factors the OLS coefficient for bs is 2825 The t statistic for testing the null hypothesis H0 b1 5 21 is t 5 12825 1 12200 5 875 and so the simple regression fails to reject H0 After adding controls for school size and staff size which roughly captures the number of students taught by each teacher the esti mate of the bs coefficient becomes 2605 Now the test of b1 5 21 gives a t statistic of about 239 thus H0 is rejected at the 5 level against a twosided alternative The variables logenroll and logstaff are very statistically significant how does adding droprate and gradrate affect the estimate of the salarybenefits tradeoff Are these variables jointly signifi cant at the 5 level What about the 10 level exploring FurTher 46 TAblE 41 Testing the Salarybenefits Tradeoff Dependent Variable logsalary Independent Variables 1 2 3 bs 825 200 605 165 589 165 logenroll 0874 0073 0881 0073 logstaff 222 050 218 050 droprate 00028 00161 gradrate 00097 00066 intercept 10523 0042 10884 0252 10738 0258 Observations Rsquared 408 040 408 353 408 361 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 139 Summary In this chapter we have covered the very important topic of statistical inference which allows us to infer something about the population model from a random sample We summarize the main points 1 Under the classical linear model assumptions MLR1 through MLR6 the OLS estimators are normally distributed 2 Under the CLM assumptions the t statistics have t distributions under the null hypothesis 3 We use t statistics to test hypotheses about a single parameter against one or twosided alternatives using one or twotailed tests respectively The most common null hypothesis is H0 bj 5 0 but we sometimes want to test other values of bj under H0 4 In classical hypothesis testing we first choose a significance level which along with the df and alter native hypothesis determines the critical value against which we compare the t statistic It is more informative to compute the pvalue for a t testthe smallest significance level for which the null hypothesis is rejectedso that the hypothesis can be tested at any significance level 5 Under the CLM assumptions confidence intervals can be constructed for each bj These CIs can be used to test any null hypothesis concerning bj against a twosided alternative 6 Single hypothesis tests concerning more than one bj can always be tested by rewriting the model to contain the parameter of interest Then a standard t statistic can be used 7 The F statistic is used to test multiple exclusion restrictions and there are two equivalent forms of the test One is based on the SSRs from the restricted and unrestricted models A more convenient form is based on the Rsquareds from the two models 8 When computing an F statistic the numerator df is the number of restrictions being tested while the denominator df is the degrees of freedom in the unrestricted model 9 The alternative for F testing is twosided In the classical approach we specify a significance level which along with the numerator df and the denominator df determines the critical value The null hypothesis is rejected when the statistic F exceeds the critical value c Alternatively we can compute a pvalue to summarize the evidence against H0 10 General multiple linear restrictions can be tested using the sum of squared residuals form of the F statistic 11 The F statistic for the overall significance of a regression tests the null hypothesis that all slope param eters are zero with the intercept unrestricted Under H0 the explanatory variables have no effect on the expected value of y 12 When data are missing on one or more explanatory variables one must be careful when computing F statistics by hand that is using either the sum of squared residuals or Rsquareds from the two regressions Whenever possible it is best to leave the calculations to statistical packages that have builtin commands which work with or without missing data The ClassiCal linear Model assuMpTions Now is a good time to review the full set of classical linear model CLM assumptions for crosssectional regression Following each assumption is a comment about its role in multiple regression analysis assumption Mlr1 linear in parameters The model in the population can be written as y 5 b0 1 b1x1 1 b2 x2 1 p 1 bk xk 1 u where b0 b1 p bk are the unknown parameters constants of interest and u is an unobserved random error or disturbance term Assumption MLR1 describes the population relationship we hope to estimate and explicitly sets out the bjthe ceteris paribus population effects of the xj on yas the parameters of interest Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 140 PART 1 Regression Analysis with CrossSectional Data assumption Mlr2 random sampling We have a random sample of n observations 5 1xi1 xi2 p xik yi2 i 5 1 p n6 following the population model in Assumption MLR1 This random sampling assumption means that we have data that can be used to estimate the bj and that the data have been chosen to be representative of the population described in Assumption MLR1 assumption Mlr3 no perfect Collinearity In the sample and therefore in the population none of the independent variables is constant and there are no exact linear relationships among the independent variables Once we have a sample of data we need to know that we can use the data to compute the OLS esti mates the b j This is the role of Assumption MLR3 if we have sample variation in each independent vari able and no exact linear relationships among the independent variables we can compute the b j assumption Mlr4 Zero Conditional Mean The error u has an expected value of zero given any values of the explanatory variables In other words E1u0x1 x2 p xk2 5 0 As we discussed in the text assuming that the unobserved factors are on average unrelated to the explanatory variables is key to deriving the first statistical property of each OLS estimator its unbiasedness for the corresponding population parameter Of course all of the previous assumptions are used to show unbiasedness assumption Mlr5 homoskedasticity The error u has the same variance given any values of the explanatory variables In other words Var1u0x1 x2 p xk2 5 s2 Compared with Assumption MLR4 the homoskedasticity assumption is of secondary importance in particular Assumption MLR5 has no bearing on the unbiasedness of the b j Still homoskedasticity has two important implications 1 We can derive formulas for the sampling variances whose components are easy to characterize 2 We can conclude under the GaussMarkov assumptions MLR1 through MLR5 that the OLS estimators have smallest variance among all linear unbiased estimators assumption Mlr6 normality The population error u is independent of the explanatory variables x1 x2 p xk and is normally distributed with zero mean and variance s2 u Normal10 s22 In this chapter we added Assumption MLR6 to obtain the exact sampling distributions of t statistics and F statistics so that we can carry out exact hypotheses tests In the next chapter we will see that MLR6 can be dropped if we have a reasonably large sample size Assumption MLR6 does imply a stronger effi ciency property of OLS the OLS estimators have smallest variance among all unbiased estimators the comparison group is no longer restricted to estimators linear in the 5yi i 5 1 2 p n6 Key Terms Alternative Hypothesis Classical Linear Model Classical Linear Model CLM Assumptions Confidence Interval CI Critical Value Denominator Degrees of Freedom Economic Significance Exclusion Restrictions F Statistic Joint Hypotheses Test Jointly Insignificant Jointly Statistically Significant Minimum Variance Unbiased Estimators Multiple Hypotheses Test Multiple Restrictions Normality Assumption Null Hypothesis Numerator Degrees of Freedom OneSided Alternative OneTailed Test Overall Significance of the Regression pValue Practical Significance Rsquared Form of the F Statistic Rejection Rule Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 141 Restricted Model Significance Level Statistically Insignificant Statistically Significant t Ratio t Statistic TwoSided Alternative TwoTailed Test Unrestricted Model Problems 1 Which of the following can cause the usual OLS t statistics to be invalid that is not to have t distribu tions under H0 i Heteroskedasticity ii A sample correlation coefficient of 95 between two independent variables that are in the model iii Omitting an important explanatory variable 2 Consider an equation to explain salaries of CEOs in terms of annual firm sales return on equity roe in percentage form and return on the firms stock ros in percentage form log1salary2 5 b0 1 b1log1sales2 1 b2roe 1 b3ros 1 u i In terms of the model parameters state the null hypothesis that after controlling for sales and roe ros has no effect on CEO salary State the alternative that better stock market performance increases a CEOs salary ii Using the data in CEOSAL1 the following equation was obtained by OLS log1salary2 5 432 1 280 log1sales2 1 0174 roe 1 00024 ros 1322 10352 100412 1000542 n 5 209 R2 5 283 By what percentage is salary predicted to increase if ros increases by 50 points Does ros have a practically large effect on salary iii Test the null hypothesis that ros has no effect on salary against the alternative that ros has a positive effect Carry out the test at the 10 significance level iv Would you include ros in a final model explaining CEO compensation in terms of firm performance Explain 3 The variable rdintens is expenditures on research and development RD as a percentage of sales Sales are measured in millions of dollars The variable profmarg is profits as a percentage of sales Using the data in RDCHEM for 32 firms in the chemical industry the following equation is estimated rdintens 5 472 1 321 log1sales2 1 050 profmarg 113692 12162 10462 n 5 32 R2 5 099 i Interpret the coefficient on logsales In particular if sales increases by 10 what is the estimated percentage point change in rdintens Is this an economically large effect ii Test the hypothesis that RD intensity does not change with sales against the alternative that it does increase with sales Do the test at the 5 and 10 levels iii Interpret the coefficient on profmarg Is it economically large iv Does profmarg have a statistically significant effect on rdintens 4 Are rent rates influenced by the student population in a college town Let rent be the average monthly rent paid on rental units in a college town in the United States Let pop denote the total city population avginc the average city income and pctstu the student population as a percentage of the total popula tion One model to test for a relationship is log1rent2 5 b0 1 b1log1pop2 1 b2log1avginc2 1 b3pctstu 1 u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 142 PART 1 Regression Analysis with CrossSectional Data i State the null hypothesis that size of the student body relative to the population has no ceteris paribus effect on monthly rents State the alternative that there is an effect ii What signs do you expect for b1 and b2 iii The equation estimated using 1990 data from RENTAL for 64 college towns is log1rent2 5 043 1 066 log1pop2 1 507 log1avginc2 1 0056 pctstu 18442 10392 10812 100172 n 5 64 R2 5 458 What is wrong with the statement A 10 increase in population is associated with about a 66 increase in rent iv Test the hypothesis stated in part i at the 1 level 5 Consider the estimated equation from Example 43 which can be used to study the effects of skipping class on college GPA colGPA 5 139 1 412 hsGPA 1 015 ACT 2 083 skipped 1332 10942 10112 10262 n 5 141 R2 5 234 i Using the standard normal approximation find the 95 confidence interval for bhsGPA ii Can you reject the hypothesis H0 bhsGPA 5 4 against the twosided alternative at the 5 level iii Can you reject the hypothesis H0 bhsGPA 5 1 against the twosided alternative at the 5 level 6 In Section 45 we used as an example testing the rationality of assessments of housing prices There we used a loglog model in price and assess see equation 447 Here we use a levellevel formulation i In the simple regression model price 5 b0 1 b1assess 1 u the assessment is rational if b1 5 1 and b0 5 0 The estimated equation is price 5 21447 1 976 assess 116272 10492 n 5 88 SSR 5 16564451 R2 5 820 First test the hypothesis that H0 b0 5 0 against the twosided alternative Then test H0 b1 5 1 against the twosided alternative What do you conclude ii To test the joint hypothesis that b0 5 0 and b0 5 1 we need the SSR in the restricted model This amounts to computing g n i511pricei 2 assessi2 2 where n 5 88 since the residuals in the restricted model are just pricei 2 assessi No estimation is needed for the restricted model because both parameters are specified under H0 This turns out to yield SSR 5 20944899 Carry out the F test for the joint hypothesis iii Now test H0 b2 5 0 b3 5 0 and b4 5 0 in the model price 5 b0 1 b1assess 1 b2lotsize 1 b3sqrft 1 b4bdrms 1 u The Rsquared from estimating this model using the same 88 houses is 829 iv If the variance of price changes with assess lotsize sqrft or bdrms what can you say about the F test from part iii 7 In Example 47 we used data on nonunionized manufacturing firms to estimate the relationship between the scrap rate and other firm characteristics We now look at this example more closely and use all avail able firms i The population model estimated in Example 47 can be written as log1scrap2 5 b0 1 b1hrsemp 1 b2log1sales2 1 b3log1employ2 1 u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 143 Using the 43 observations available for 1987 the estimated equation is log1scrap2 5 1174 2 042 hrsemp 2 951 log1sales2 1 992 log1employ2 14572 10192 13702 13602 n 5 43 R2 5 310 Compare this equation to that estimated using only the 29 nonunionized firms in the sample ii Show that the population model can also be written as log1scrap2 5 b0 1 b1hrsemp 1 b2log1salesemploy2 1 u3log1employ2 1 u where u3 5 b2 1 b3 Hint Recall that log1x2x32 5 log1x22 2 log1x32 Interpret the hypothesis H0 u3 5 0 iii When the equation from part ii is estimated we obtain log1scrap2 5 1174 2 042 hrsemp 2 951 log1salesemploy2 1 041 log1employ2 14572 10192 13702 12052 n 5 43 R2 5 310 Controlling for worker training and for the salestoemployee ratio do bigger firms have larger statistically significant scrap rates iv Test the hypothesis that a 1 increase in salesemploy is associated with a 1 drop in the scrap rate 8 Consider the multiple regression model with three independent variables under the classical linear model assumptions MLR1 through MLR6 y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 u You would like to test the null hypothesis H0 b1 2 3b2 5 1 i Let b 1 and b 2 denote the OLS estimators of b1 and b2 Find Var1b 1 2 3b 22 in terms of the variances of b 1 and b 2 and the covariance between them What is the standard error of b 1 2 3b 2 ii Write the t statistic for testing H0 b1 2 3b2 5 1 iii Define u1 5 b1 2 3b2 and u 1 5 b 1 2 3b 2 Write a regression equation involving b0 u1 b2 and b3 that allows you to directly obtain u 1 and its standard error 9 In Problem 3 in Chapter 3 we estimated the equation sleep 5 363825 2 148 totwrk 2 1113 educ 1 220 age 1112282 10172 15882 11452 n 5 706 R2 5 113 where we now report standard errors along with the estimates i Is either educ or age individually significant at the 5 level against a twosided alternative Show your work ii Dropping educ and age from the equation gives sleep 5 358638 2 151 totwrk 138912 10172 n 5 706 R2 5 103 Are educ and age jointly significant in the original equation at the 5 level Justify your answer iii Does including educ and age in the model greatly affect the estimated tradeoff between sleeping and working iv Suppose that the sleep equation contains heteroskedasticity What does this mean about the tests computed in parts i and ii Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 144 PART 1 Regression Analysis with CrossSectional Data 10 Regression analysis can be used to test whether the market efficiently uses information in valuing stocks For concreteness let return be the total return from holding a firms stock over the fouryear period from the end of 1990 to the end of 1994 The efficient markets hypothesis says that these returns should not be systematically related to information known in 1990 If firm characteristics known at the beginning of the period help to predict stock returns then we could use this information in choosing stocks For 1990 let dkr be a firms debt to capital ratio let eps denote the earnings per share let netinc denote net income and let salary denote total compensation for the CEO i Using the data in RETURN the following equation was estimated return 5 21437 1 321 dkr 1 043 eps 2 0051 nentinc 1 0035 salary 16892 12012 10782 100472 100222 n 5 142 R2 5 0395 Test whether the explanatory variables are jointly significant at the 5 level Is any explanatory variable individually significant ii Now reestimate the model using the log form for netinc and salary return 5 23630 1 327 dkr 1 069 eps 2 474 log1netinc2 1 724 log1salary2 139372 12032 10802 13392 16312 n 5 142 R2 5 0330 Do any of your conclusions from part i change iii In this sample some firms have zero debt and others have negative earnings Should we try to use logdkr or logeps in the model to see if these improve the fit Explain iv Overall is the evidence for predictability of stock returns strong or weak 11 The following table was created using the data in CEOSAL2 where standard errors are in parentheses below the coefficients Dependent Variable logsalary Independent Variables 1 2 3 logsales 224 027 158 040 188 040 logmktval 112 050 100 049 Profmarg 0023 0022 0022 0021 Ceoten 0171 0055 comten 0092 0033 intercept 494 020 462 025 457 025 Observations Rsquared 177 281 177 304 177 353 The variable mktval is market value of the firm profmarg is profit as a percentage of sales ceoten is years as CEO with the current company and comten is total years with the company i Comment on the effect of profmarg on CEO salary ii Does market value have a significant effect Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 145 iii Interpret the coefficients on ceoten and comten Are these explanatory variables statistically significant iv What do you make of the fact that longer tenure with the company holding the other factors fixed is associated with a lower salary 12 The following analysis was obtained using data in MEAP93 which contains schoollevel pass rates as a percent on a tenthgrade math test i The variable expend is expenditures per student in dollars and math10 is the pass rate on the exam The following simple regression relates math10 to lexpend logexpend math10 5 26934 1 1116 lexpend 125532 13172 n 5 408 R2 5 0297 Interpret the coefficient on lexpend In particular if expend increases by 10 what is the estimated percentage point change in math10 What do you make of the large negative intercept estimate The minimum value of lexpend is 811 and its average value is 837 ii Does the small Rsquared in part i imply that spending is correlated with other factors affecting math10 Explain Would you expect the Rsquared to be much higher if expenditures were randomly assigned to schoolsthat is independent of other school and student characteristicsrather than having the school districts determine spending iii When log of enrollment and the percent of students eligible for the federal free lunch program are included the estimated equation becomes math10 5 22314 1 775 lexpend 2 126 lenroll 2 324 lnchprg 124992 13042 10582 10362 n 5 408 R2 5 1893 Comment on what happens to the coefficient on lexpend Is the spending coefficient still statistically different from zero iv What do you make of the Rsquared in part iii What are some other factors that could be used to explain math10 at the school level 13 The data in MEAPSINGLE were used to estimate the following equations relating schoollevel perfor mance on a fourthgrade math test to socioeconomic characteristics of students attending school The vari able free measured at the school level is the percentage of students eligible for the federal free lunch program The variable medinc is median income in the ZIP code and pctsgle is percent of students not liv ing with two parents also measured at the ZIP code level See also Computer Exercise C11 in Chapter 3 math4 5 9677 2 833 pctsgle 11602 10712 n 5 299 R2 5 380 math4 5 9300 2 275 pctsgle 2 402 free 11632 11172 10702 n 5 299 R2 5 459 math4 5 2449 2 274 pctsgle 2 422 free 2 752 lmedinc 1 901 lexppp 159242 11612 10712 153582 14042 n 5 299 R2 5 472 math4 5 1752 2 259 pctsgle 2 420 free 1 880 lexppp 132252 11172 10702 13762 n 5 299 R2 5 472 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 146 PART 1 Regression Analysis with CrossSectional Data i Interpret the coefficient on the variable pctsgle in the first equation Comment on what happens when free is added as an explanatory variable ii Does expenditure per pupil entered in logarithmic form have a statistically significant effect on performance How big is the estimated effect iii If you had to choose among the four equations as your best estimate of the effect of pctsgle and obtain a 95 confidence interval of bpctsgle which would you choose Why Computer Exercises C1 The following model can be used to study whether campaign expenditures affect election outcomes voteA 5 b0 1 b1log1expendA2 1 b2log1expendB2 1 b3prtystrA 1 u where voteA is the percentage of the vote received by Candidate A expendA and expendB are cam paign expenditures by Candidates A and B and prtystrA is a measure of party strength for Candidate A the percentage of the most recent presidential vote that went to As party i What is the interpretation of b1 ii In terms of the parameters state the null hypothesis that a 1 increase in As expenditures is offset by a 1 increase in Bs expenditures iii Estimate the given model using the data in VOTE1 and report the results in usual form Do As expenditures affect the outcome What about Bs expenditures Can you use these results to test the hypothesis in part ii iv Estimate a model that directly gives the t statistic for testing the hypothesis in part ii What do you conclude Use a twosided alternative C2 Use the data in LAWSCH85 for this exercise i Using the same model as in Problem 4 in Chapter 3 state and test the null hypothesis that the rank of law schools has no ceteris paribus effect on median starting salary ii Are features of the incoming class of studentsnamely LSAT and GPAindividually or jointly significant for explaining salary Be sure to account for missing data on LSAT and GPA iii Test whether the size of the entering class clsize or the size of the faculty faculty needs to be added to this equation carry out a single test Be careful to account for missing data on clsize and faculty iv What factors might influence the rank of the law school that are not included in the salary regression C3 Refer to Computer Exercise C2 in Chapter 3 Now use the log of the housing price as the dependent variable log1price2 5 b0 1 b1sqrft 1 b2bdrms 1 u i You are interested in estimating and obtaining a confidence interval for the percentage change in price when a 150squarefoot bedroom is added to a house In decimal form this is u1 5 150b1 1 b2 Use the data in HPRICE1 to estimate u1 ii Write b2 in terms of u1 and b1 and plug this into the logprice equation iii Use part ii to obtain a standard error for u 2 and use this standard error to construct a 95 confidence interval C4 In Example 49 the restricted version of the model can be estimated using all 1388 observations in the sample Compute the Rsquared from the regression of bwght on cigs parity and faminc using all observations Compare this to the Rsquared reported for the restricted model in Example 49 C5 Use the data in MLB1 for this exercise i Use the model estimated in equation 431 and drop the variable rbisyr What happens to the statistical significance of hrunsyr What about the size of the coefficient on hrunsyr Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 147 ii Add the variables runsyr runs per year fldperc fielding percentage and sbasesyr stolen bases per year to the model from part i Which of these factors are individually significant iii In the model from part ii test the joint significance of bavg fldperc and sbasesyr C6 Use the data in WAGE2 for this exercise i Consider the standard wage equation log1wage2 5 b0 1 b1educ 1 b2exper 1 b3tenure 1 u State the null hypothesis that another year of general workforce experience has the same effect on logwage as another year of tenure with the current employer ii Test the null hypothesis in part i against a twosided alternative at the 5 significance level by constructing a 95 confidence interval What do you conclude C7 Refer to the example used in Section 44 You will use the data set TWOYEAR i The variable phsrank is the persons high school percentile A higher number is better For example 90 means you are ranked better than 90 percent of your graduating class Find the smallest largest and average phsrank in the sample ii Add phsrank to equation 426 and report the OLS estimates in the usual form Is phsrank statistically significant How much is 10 percentage points of high school rank worth in terms of wage iii Does adding phsrank to 426 substantively change the conclusions on the returns to two and fouryear colleges Explain iv The data set contains a variable called id Explain why if you add id to equation 417 or 426 you expect it to be statistically insignificant What is the twosided pvalue C8 The data set 401KSUBS contains information on net financial wealth nettfa age of the survey respondent age annual family income inc family size fsize and participation in certain pension plans for people in the United States The wealth and income variables are both recorded in thousands of dollars For this question use only the data for singleperson households so fsize 5 1 i How many singleperson households are there in the data set ii Use OLS to estimate the model nettfa 5 b0 1 b1inc 1 b2age 1 u and report the results using the usual format Be sure to use only the singleperson households in the sample Interpret the slope coefficients Are there any surprises in the slope estimates iii Does the intercept from the regression in part ii have an interesting meaning Explain iv Find the pvalue for the test H0 b2 5 1 against H1 b2 1 Do you reject H0 at the 1 significance level v If you do a simple regression of nettfa on inc is the estimated coefficient on inc much different from the estimate in part ii Why or why not C9 Use the data in DISCRIM to answer this question See also Computer Exercise C8 in Chapter 3 i Use OLS to estimate the model log1psoda2 5 b0 1 b1prpblck 1 b2log1income2 1 b3prppov 1 u and report the results in the usual form Is b 1 statistically different from zero at the 5 level against a twosided alternative What about at the 1 level ii What is the correlation between logincome and prppov Is each variable statistically significant in any case Report the twosided pvalues iii To the regression in part i add the variable loghseval Interpret its coefficient and report the twosided pvalue for H0 blog1hseval2 5 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 148 PART 1 Regression Analysis with CrossSectional Data iv In the regression in part iii what happens to the individual statistical significance of logincome and prppov Are these variables jointly significant Compute a pvalue What do you make of your answers v Given the results of the previous regressions which one would you report as most reliable in determining whether the racial makeup of a zip code influences local fastfood prices C10 Use the data in ELEM9495 to answer this question The findings can be compared with those in Table 41 The dependent variable lavgsal is the log of average teacher salary and bs is the ratio of average benefits to average salary by school i Run the simple regression of lavgsal on bs Is the estimated slope statistically different from zero Is it statistically different from 21 ii Add the variables lenrol and lstaff to the regression from part i What happens to the coefficient on bs How does the situation compare with that in Table 41 iii How come the standard error on the bs coefficient is smaller in part ii than in part i Hint What happens to the error variance versus multicollinearity when lenrol and lstaff are added iv How come the coefficient on lstaff is negative Is it large in magnitude v Now add the variable lunch to the regression Holding other factors fixed are teachers being compensated for teaching students from disadvantaged backgrounds Explain vi Overall is the pattern of results that you find with ELEM9495 consistent with the pattern in Table 41 C11 Use the data in HTV to answer this question See also Computer Exercise C10 in Chapter 3 i Estimate the regression model educ 5 b0 1 b1motheduc 1 b2fatheduc 1 b3abil 1 b4abil2 1 u by OLS and report the results in the usual form Test the null hypothesis that educ is linearly related to abil against the alternative that the relationship is quadratic ii Using the equation in part i test H0 b1 5 b2 against a twosided alternative What is the pvalue of the test iii Add the two college tuition variables to the regression from part i and determine whether they are jointly statistically significant iv What is the correlation between tuit17 and tuit18 Explain why using the average of the tuition over the two years might be preferred to adding each separately What happens when you do use the average v Do the findings for the average tuition variable in part iv make sense when interpreted causally What might be going on C12 Use the data in ECONMATH to answer the following questions i Estimate a model explaining colgpa to hsgpa actmth and acteng Report the results in the usual form Are all explanatory variables statistically significant ii Consider an increase in hsgpa of one standard deviation about 343 By how much does colgpa increase holding actmth and acteng fixed About how many standard deviations would the actmth have to increase to change colgpa by the same amount as a one standard deviation in hsgpa Comment iii Test the null hypothesis that actmth and acteng have the same effect in the population against a twosided alternative Report the pvalue and describe your conclusions iv Suppose the college admissions officer wants you to use the data on the variables in part i to create an equation that explains at least 50 percent of the variation in colgpa What would you tell the officer Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 149 c h a p t e r 5 Multiple Regression Analysis OLS Asymptotics I n Chapters 3 and 4 we covered what are called finite sample small sample or exact properties of the OLS estimators in the population model y 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u 51 For example the unbiasedness of OLS derived in Chapter 3 under the first four GaussMarkov assumptions is a finite sample property because it holds for any sample size n subject to the mild restriction that n must be at least as large as the total number of parameters in the regression model k 1 1 Similarly the fact that OLS is the best linear unbiased estimator under the full set of Gauss Markov assumptions MLR1 through MLR5 is a finite sample property In Chapter 4 we added the classical linear model Assumption MLR6 which states that the error term u is normally distributed and independent of the explanatory variables This allowed us to derive the exact sampling distributions of the OLS estimators conditional on the explanatory variables in the sample In particular Theorem 41 showed that the OLS estimators have normal sampling distri butions which led directly to the t and F distributions for t and F statistics If the error is not normally distributed the distribution of a t statistic is not exactly t and an F statistic does not have an exact F distribution for any sample size In addition to finite sample properties it is important to know the asymptotic properties or large sample properties of estimators and test statistics These properties are not defined for a particular sample size rather they are defined as the sample size grows without bound Fortunately under the assumptions we have made OLS has satisfactory large sample properties One practically important Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 150 finding is that even without the normality assumption Assumption MLR6 t and F statistics have approximately t and F distributions at least in large sample sizes We discuss this in more detail in Section 52 after we cover the consistency of OLS in Section 51 Because the material in this chapter is more difficult to understand and because one can conduct empirical work without a deep understanding of its contents this chapter may be skipped However we will necessarily refer to large sample properties of OLS when we study discrete response variables in Chapter 7 relax the homoskedasticity assumption in Chapter 8 and delve into estimation with time series data in Part 2 Furthermore virtually all advanced econometric methods derive their justifica tion using largesample analysis so readers who will continue into Part 3 should be familiar with the contents of this chapter 51 Consistency Unbiasedness of estimators although important cannot always be obtained For example as we dis cussed in Chapter 3 the standard error of the regression s is not an unbiased estimator for s the standard deviation of the error u in a multiple regression model Although the OLS estimators are unbiased under MLR1 through MLR4 in Chapter 11 we will find that there are time series regres sions where the OLS estimators are not unbiased Further in Part 3 of the text we encounter several other estimators that are biased yet useful Although not all useful estimators are unbiased virtually all economists agree that consistency is a minimal requirement for an estimator The Nobel Prizewinning econometrician Clive W J Granger once remarked If you cant get it right as n goes to infinity you shouldnt be in this busi ness The implication is that if your estimator of a particular population parameter is not consistent then you are wasting your time There are a few different ways to describe consistency Formal definitions and results are given in Appendix C here we focus on an intuitive understanding For concreteness let b j be the OLS esti mator of bj for some j For each n b j has a probability distribution representing its possible values in different random samples of size n Because b j is unbiased under Assumptions MLR1 through MLR4 this distribution has mean value bj If this estimator is consistent then the distribution of b j becomes more and more tightly distributed around bj as the sample size grows As n tends to infinity the distribution of b j collapses to the single point bj In effect this means that we can make our esti mator arbitrarily close to bj if we can collect as much data as we want This convergence is illustrated in Figure 51 Naturally for any application we have a fixed sample size which is a major reason an asymp totic property such as consistency can be difficult to grasp Consistency involves a thought experi ment about what would happen as the sample size gets large while at the same time we obtain numerous random samples for each sample size If obtaining more and more data does not generally get us closer to the parameter value of interest then we are using a poor estimation procedure Conveniently the same set of assumptions implies both unbiasedness and consistency of OLS We summarize with a theorem ConsistenCy of oLs Under Assumptions MLR1 through MLR4 the OLS estimator b j is consistent for bj for all j 5 0 1 p k Theorem 51 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 151 1 1 f n3 n2 n1 ˆ ˆ FiguRE 51 Sampling distributions of b 1 for sample sizes n1 n2 n3 A general proof of this result is most easily developed using the matrix algebra methods described in Appendices D and E But we can prove Theorem 51 without difficulty in the case of the simple regression model We focus on the slope estimator b 1 The proof starts out the same as the proof of unbiasedness we write down the formula for b 1 and then plug in yi 5 b0 1 b1xi1 1 ui b 1 5 a a n i51 1xi1 2 x12yiba a n i51 1xi1 2 x12 2b 52 5 b1 1 an21 a n i51 1xi1 2 x12uiban21 a n i51 1xi1 2 x12 2b where dividing both the numerator and denominator by n does not change the expression but allows us to directly apply the law of large numbers When we apply the law of large numbers to the averages in the second part of equation 52 we conclude that the numerator and denominator converge in probability to the population quantities Cov1x1u2 and Var1x12 respectively Provided that Var1x12 2 0which is assumed in MLR3we can use the properties of probability limits see Appendix C to get plim b 1 5 b1 1 Cov1x1u2Var1x12 53 5 b1 because Cov1x1u2 5 0 We have used the fact discussed in Chapters 2 and 3 that E1u0x12 5 0 Assumption MLR4 implies that x1 and u are uncorrelated have zero covariance Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 152 As a technical matter to ensure that the probability limits exist we should assume that Var1x12 and Varu which means that their probability distributions are not too spread out but we will not worry about cases where these assumptions might fail Further we couldand in an advanced treatment of econometrics we wouldexplicitly relax Assumption MLR3 to rule out only perfect collinearity in the population As stated Assumption MLR3 also disallows per fect collinearity among the regressors in the sample we have at hand Technically for the thought experiment we can show consistency with no perfect collinearity in the population allowing for the unlucky possibility that we draw a data set that does exhibit perfect collinearity From a practical perspective the distinction is unimportant as we cannot compute the OLS estimates for our sample if MLR3 fails The previous arguments and equation 53 in particular show that OLS is consistent in the sim ple regression case if we assume only zero correlation This is also true in the general case We now state this as an assumption Assumption MLR4 Zero Mean and Zero Correlation E1u2 5 0 and Cov1xju2 5 0 for j 5 1 2 p k Assumption MLR4r is weaker than Assumption MLR4 in the sense that the latter implies the former One way to characterize the zero conditional mean assumption E1u0x1 p xk2 5 0 is that any function of the explanatory variables is uncorrelated with u Assumption MLR4r requires only that each xj is uncorrelated with u and that u has a zero mean in the population In Chapter 2 we actu ally motivated the OLS estimator for simple regression using Assumption MLR4r and the first order conditions for OLS in the multiple regression case given in equation 313 are simply the sample analogs of the population zero correlation assumptions and zero mean assumption Therefore in some ways Assumption MLR4r is more natural an assumption because it leads directly to the OLS estimates Further when we think about violations of Assumption MLR4 we usually think in terms of Cov1xju2 2 0 for some j So how come we have used Assumption MLR4 until now There are two reasons both of which we have touched on earlier First OLS turns out to be biased but consist ent under Assumption MLR4r if E1u0x1 p xk2 depends on any of the xj Because we have previ ously focused on finite sample or exact sampling properties of the OLS estimators we have needed the stronger zero conditional mean assumption Second and probably more important is that the zero conditional mean assumption means that we have properly modeled the population regression function PRF That is under Assumption MLR4 we can write E1y0x1 p xk2 5 b0 1 b1x1 1 p 1 bkxk and so we can obtain partial effects of the explanatory variables on the average or expected value of y If we instead only assume Assumption MLR4r b0 1 b1x1 1 p 1 bkxk need not represent the PRF and we face the possibility that some nonlinear functions of the xj such as x2 j could be correlated with the error u A situation like this means that we have neglected nonlinearities in the model that could help us better explain y if we knew that we would usually include such nonlinear functions In other words most of the time we hope to get a good estimate of the PRF and so the zero conditional mean assumption is natural Nevertheless the weaker zero correlation assumption turns out to be useful in interpreting OLS estimation of a linear model as providing the best linear approximation to the PRF It is also used in more advanced settings such as in Chapter 15 where we have no interest in modeling a PRF For further discussion of this somewhat subtle point see Wooldridge 2010 Chapter 4 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 153 51a Deriving the Inconsistency in OLS Just as failure of E1u0x1 p xk2 5 0 causes bias in the OLS estimators correlation between u and any of x1 x2 p xk generally causes all of the OLS estimators to be inconsistent This simple but important observation is often summarized as if the error is correlated with any of the independent variables then OLS is biased and inconsistent This is very unfortunate because it means that any bias persists as the sample size grows In the simple regression case we can obtain the inconsistency from the first part of equation 53 which holds whether or not u and x1 are uncorrelated The inconsistency in b 1 sometimes loosely called the asymptotic bias is plim b 1 2 b1 5 Cov1x1u2Var1x12 54 Because Var1x12 0 the inconsistency in b 1 is positive if x1 and u are positively correlated and the inconsistency is negative if x1 and u are negatively correlated If the covariance between x1 and u is small relative to the variance in x1 the inconsistency can be negligible unfortunately we cannot even estimate how big the covariance is because u is unobserved We can use 54 to derive the asymptotic analog of the omitted variable bias see Table 32 in Chapter 3 Suppose the true model y 5 b0 1 b1x1 1 b2x2 1 v satisfies the first four GaussMarkov assumptions Then v has a zero mean and is uncorrelated with x1 and x2 If b 0 b 1 and b 2 denote the OLS estimators from the regression of y on x1 and x2 then Theorem 51 implies that these estimators are consistent If we omit x2 from the regression and do the simple regression of y on x1 then u 5 b2x2 1 v Let b 1 denote the simple regression slope estimator Then plim b 1 5 b1 1 b2d1 55 where d1 5 Cov1x1x22Var1x12 56 Thus for practical purposes we can view the inconsistency as being the same as the bias The differ ence is that the inconsistency is expressed in terms of the population variance of x1 and the population covariance between x1 and x2 while the bias is based on their sample counterparts because we condi tion on the values of x1 and x2 in the sample If x1 and x2 are uncorrelated in the population then d1 5 0 and b 1 is a consistent estimator of b1 although not necessarily unbiased If x2 has a positive partial effect on y so that b2 0 and x1 and x2 are positively correlated so that d1 0 then the inconsistency in b 1 is positive and so on We can obtain the direction of the inconsistency or asymptotic bias from Table 32 If the covariance between x1 and x2 is small relative to the variance of x1 the inconsistency can be small exaMpLe 51 Housing prices and Distance from an incinerator Let y denote the price of a house price let x1 denote the distance from the house to a new trash in cinerator distance and let x2 denote the quality of the house quality The variable quality is left vague so that it can include things like size of the house and lot number of bedrooms and bathrooms and intangibles such as attractiveness of the neighborhood If the incinerator depresses house prices then b1 should be positive everything else being equal a house that is farther away from the incinera tor is worth more By definition b2 is positive since higher quality houses sell for more other factors being equal If the incinerator was built farther away on average from better homes then distance and quality are positively correlated and so d1 0 A simple regression of price on distance or logprice on logdistance will tend to overestimate the effect of the incinerator b1 1 b2 d1 b1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 154 An important point about inconsistency in OLS estimators is that by definition the problem does not go away by adding more observations to the sample If anything the problem gets worse with more data the OLS estimator gets closer and closer to b1 1 b2d1 as the sample size grows Deriving the sign and magnitude of the incon sistency in the general k regressor case is harder just as deriving the bias is more difficult We need to remember that if we have the model in equation 51 where say x1 is correlated with u but the other independent variables are uncorrelated with u all of the OLS estimators are generally inconsistent For example in the k 5 2 case y 5 b0 1 b1x1 1 b2x2 1 u 54 suppose that x2 and u are uncorrelated but x1 and u are correlated Then the OLS estimators b 1 and b 2 will generally both be inconsistent The intercept will also be inconsistent The inconsistency in b 2 arises when x1 and x2 are correlated as is usually the case If x1 and x2 are uncorrelated then any cor relation between x1 and u does not result in the inconsistency of b 2 plim b 2 5 b2 Further the incon sistency in b 1 is the same as in 54 The same statement holds in the general case if x1 is correlated with u but x1 and u are uncorrelated with the other independent variables then only b 1 is inconsistent and the inconsistency is given by 54 The general case is very similar to the omitted variable case in Section 3A4 of Appendix 3A 52 Asymptotic Normality and Large Sample Inference Consistency of an estimator is an important property but it alone does not allow us to perform statistical inference Simply knowing that the estimator is getting closer to the population value as the sample size grows does not allow us to test hypotheses about the parameters For test ing we need the sampling distribution of the OLS estimators Under the classical linear model assumptions MLR1 through MLR6 Theorem 41 shows that the sampling distributions are nor mal This result is the basis for deriving the t and F distributions that we use so often in applied econometrics The exact normality of the OLS estimators hinges crucially on the normality of the distribution of the error u in the population If the errors u1 u2 p un are random draws from some distribution other than the normal the b j will not be normally distributed which means that the t statistics will not have t distributions and the F statistics will not have F distributions This is a potentially serious problem because our inference hinges on being able to obtain critical values or pvalues from the t or F distributions Recall that Assumption MLR6 is equivalent to saying that the distribution of y given x1 x2 p xk is normal Because y is observed and u is not in a particular application it is much easier to think about whether the distribution of y is likely to be normal In fact we have already seen a few examples where y definitely cannot have a conditional normal distribution A normally distributed random variable is symmetrically distributed about its mean it can take on any posi tive or negative value and more than 95 of the area under the distribution is within two standard deviations Suppose that the model score 5 b0 1 b1skipped 1 b2 priGPA 1 u satisfies the first four GaussMarkov assumptions where score is score on a final exam skipped is number of classes skipped and priGPA is GPA prior to the current semester If b 1 is from the simple regression of score on skipped what is the direction of the asymptotic bias in b 1 exploring FurTher 51 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 155 In Example 35 we estimated a model explaining the number of arrests of young men dur ing a particular year narr86 In the population most men are not arrested during the year and the vast majority are arrested one time at the most In the sample of 2725 men in the data set CRIME1 fewer than 8 were arrested more than once during 1986 Because narr86 takes on only two values for 92 of the sample it cannot be close to being normally distributed in the population In Example 46 we estimated a model explaining participation percentages prate in 401k pension plans The frequency distribution also called a histogram in Figure 52 shows that the dis tribution of prate is heavily skewed to the right rather than being normally distributed In fact over 40 of the observations on prate are at the value 100 indicating 100 participation This violates the normality assumption even conditional on the explanatory variables We know that normality plays no role in the unbiasedness of OLS nor does it affect the conclu sion that OLS is the best linear unbiased estimator under the GaussMarkov assumptions But exact inference based on t and F statistics requires MLR6 Does this mean that in our prior analysis of prate in Example 46 we must abandon the t statistics for determining which variables are statisti cally significant Fortunately the answer to this question is no Even though the yi are not from a normal distribution we can use the central limit theorem from Appendix C to conclude that the OLS estimators satisfy asymptotic normality which means they are approximately normally distributed in large enough sample sizes 0 10 20 30 40 50 60 70 80 90 100 0 2 4 6 8 participation rate in percentage form proportion in cell FiguRE 52 Histogram of prate using the data in 401K Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 156 The proof of asymptotic normality is somewhat complicated and is sketched in the appendix for the simple regression case Part ii follows from the law of large numbers and part iii follows from parts i and ii and the asymptotic properties discussed in Appendix C Theorem 52 is useful because the normality Assumption MLR6 has been dropped the only restriction on the distribution of the error is that it has finite variance something we will always assume We have also assumed zero conditional mean MLR4 and homoskedasticity of u MLR5 In trying to understand the meaning of Theorem 52 it is important to keep separate the notions of the population distribution of the error term u and the sampling distributions of the b j as the sample size grows A common mistake is to think that something is happening to the distribution of unamely that it is getting closer to normalas the sample size grows But remember that the population distribution is immutable and has nothing to do with the sample size For example we previously discussed narr86 the number of times a young man is arrested during the year 1986 The nature of this variableit takes on small nonnegative integer valuesis fixed in the population Whether we sample 10 men or 1000 men from this population obviously has no effect on the popula tion distribution What Theorem 52 says is that regardless of the population distribution of u the OLS estima tors when properly standardized have approximate standard normal distributions This approxima tion comes about by the central limit theorem because the OLS estimators involvein a complicated waythe use of sample averages Effectively the sequence of distributions of averages of the under lying errors is approaching normality for virtually any population distribution Notice how the standardized b j has an asymptotic standard normal distribution whether we divide the difference b j 2 bj by sd1b j2 which we do not observe because it depends on s or by se1b j2 which we can compute from our data because it depends on s In other words from an asymptotic point of view it does not matter that we have to replace s with s Of course replacing s with s affects the exact distribution of the standardized b j We just saw in Chapter 4 that under the classical linear model assumptions 1b j 2 bj2sd1b j2 has an exact Normal01 distribution and 1b j 2 bj2sd1b j2 has an exact tn2k21 distribution How should we use the result in equation 57 It may seem one consequence is that if we are going to appeal to largesample analysis we should now use the standard normal distribution for inference rather than the t distribution But from a practical perspective it is just as legitimate to write asyMptotiC norMaLity of oLs Under the GaussMarkov Assumptions MLR1 through MLR5 i n1b j 2 bj2 a Normal10 s2a2 j 2 where s2a2 j 0 is the asymptotic variance of n1b j 2 bj2 for the slope coefficients a2 j 5 plim1n21g n i51 r2 ij 2 where the rij are the residuals from regressing xj on the other independent variables We say that b j is asymptotically normally distributed see Appendix C ii s 2 is a consistent estimator of s2 5 Var1u2 iii For each j 1b j 2 bj2sd1b j2 a Normal1012 and 1b j 2 bj2se1b j2 a Normal1012 57 where se1b j2 is the usual OLS standard error Theorem 52 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 157 1b j 2 bj2se1b j2 a tn2k21 5 tdf 58 because tdf approaches the Normal01 distribution as df gets large Because we know under the CLM assumptions the tn2k21 distribution holds exactly it makes sense to treat 1b j 2 bj2se1b j2 as a tn2k21 random variable generally even when MLR6 does not hold Equation 58 tells us that t testing and the construction of confidence intervals are carried out exactly as under the classical linear model assumptions This means that our analysis of dependent variables like prate and narr86 does not have to change at all if the GaussMarkov assumptions hold in both cases we have at least 1500 observations which is certainly enough to justify the approxima tion of the central limit theorem If the sample size is not very large then the t distribution can be a poor approximation to the distribution of the t statistics when u is not normally distributed Unfortunately there are no general prescriptions on how big the sample size must be before the approximation is good enough Some econometricians think that n 5 30 is satisfactory but this cannot be sufficient for all possible distribu tions of u Depending on the distribution of u more observations may be necessary before the central limit theorem delivers a useful approximation Further the quality of the approximation depends not just on n but on the df n 2 k 2 1 With more independent variables in the model a larger sample size is usually needed to use the t approximation Methods for inference with small degrees of free dom and nonnormal errors are outside the scope of this text We will simply use the t statistics as we always have without worrying about the normality assumption It is very important to see that Theorem 52 does require the homoskedasticity assumption along with the zero conditional mean assumption If Varyx is not constant the usual t statistics and con fidence intervals are invalid no matter how large the sample size is the central limit theorem does not bail us out when it comes to heteroskedasticity For this reason we devote all of Chapter 8 to discuss ing what can be done in the presence of heteroskedasticity One conclusion of Theorem 52 is that s 2 is a consistent estimator of s2 we already know from Theorem 33 that s 2 is unbiased for s2 under the GaussMarkov assumptions The consistency implies that s is a consistent estimator of s which is important in establishing the asymptotic nor mality result in equation 57 Remember that s appears in the standard error for each b j In fact the estimated variance of b j is Var1b j2 5 s 2 SSTj11 2 R2 j 2 59 where SSTj is the total sum of squares of xj in the sample and R2 j is the Rsquared from regressing xj on all of the other independent variables In Section 34 we studied each component of 59 which we will now expound on in the context of asymptotic analysis As the sample size grows s 2 converges in probability to the constant s2 Further R2 j approaches a number strictly between zero and unity so that 1 2 R2 j converges to some number between zero and one The sample variance of xj is SSTjn and so SSTjn converges to Var1xj2 as the sample size grows This means that SSTj grows at approximately the same rate as the sample size SSTj ns2 j where s2 j is the population variance of xj When we combine these facts we find that Var1b j2 shrinks to zero at the rate of 1n this is why larger sample sizes are better When u is not normally distributed the square root of 59 is sometimes called the asymptotic standard error and t statistics are called asymptotic t statistics Because these are the same quanti ties we dealt with in Chapter 4 we will just call them standard errors and t statistics with the under standing that sometimes they have only largesample justification A similar comment holds for an asymptotic confidence interval constructed from the asymptotic standard error In a regression model with a large sample size what is an approximate 95 confi dence interval for b j under MLR1 through MLR5 We call this an asymptotic confi dence interval exploring FurTher 52 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 158 Using the preceding argument about the estimated variance we can write se1b j2 cjn 510 where cj is a positive constant that does not depend on the sample size In fact the constant cj can be shown to be cj 5 s sj1 2 r2 j where s 5 sd1u2 sj 5 sd1xj2 and r2 j is the population Rsquared from regressing xj on the other explanatory variables Just like studying equation 59 to see which variables affect Var1b j2 under the GaussMarkov assumptions we can use this expression for cj to study the impact of larger error standard deviation 1s2 more population variation in xj 1sj2 and multicollinearity in the popula tion 1r2 j 2 Equation 510 is only an approximation but it is a useful rule of thumb standard errors can be expected to shrink at a rate that is the inverse of the square root of the sample size exaMpLe 52 standard errors in a Birth Weight equation We use the data in BWGHT to estimate a relationship where log of birth weight is the dependent variable and cigarettes smoked per day cigs and log of family income are independent variables The total number of observations is 1388 Using the first half of the observations 694 the stand ard error for b cigs is about 0013 The standard error using all of the observations is about 00086 The ratio of the latter standard error to the former is 000860013 662 This is pretty close to 6941388 707 the ratio obtained from the approximation in 510 In other words equation 510 implies that the standard error using the larger sample size should be about 707 of the stand ard error using the smaller sample This percentage is pretty close to the 662 we actually compute from the ratio of the standard errors The asymptotic normality of the OLS estimators also implies that the F statistics have approxi mate F distributions in large sample sizes Thus for testing exclusion restrictions or other multiple hypotheses nothing changes from what we have done before 52a Other Large Sample Tests The Lagrange Multiplier Statistic Once we enter the realm of asymptotic analysis other test statistics can be used for hypothesis testing For most purposes there is little reason to go beyond the usual t and F statistics as we just saw these statistics have large sample justification without the normality assumption Nevertheless sometimes it is useful to have other ways to test multiple exclusion restrictions and we now cover the Lagrange multiplier LM statistic which has achieved some popularity in modern econometrics The name Lagrange multiplier statistic comes from constrained optimization a topic beyond the scope of this text See Davidson and MacKinnon 1993 The name score statisticwhich also comes from optimization using calculusis used as well Fortunately in the linear regression frame work it is simple to motivate the LM statistic without delving into complicated mathematics The form of the LM statistic we derive here relies on the GaussMarkov assumptions the same assumptions that justify the F statistic in large samples We do not need the normality assumption To derive the LM statistic consider the usual multiple regression model with k independent variables y 5 b0 1 b1x1 1 p 1 bkxk 1 u 511 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 159 We would like to test whether say the last q of these variables all have zero population parameters the null hypothesis is H0 bk2q11 5 0 p bk 5 0 512 which puts q exclusion restrictions on the model 511 As with F testing the alternative to 512 is that at least one of the parameters is different from zero The LM statistic requires estimation of the restricted model only Thus assume that we have run the regression y 5 b 0 1 b 1x1 1 p 1 b k2qxk2q 1 u 513 where indicates that the estimates are from the restricted model In particular u indicates the residuals from the restricted model As always this is just shorthand to indicate that we obtain the restricted residual for each observation in the sample If the omitted variables xk2q11 through xk truly have zero population coefficients then at least approximately u should be uncorrelated with each of these variables in the sample This suggests running a regression of these residuals on those independent variables excluded under H0 which is almost what the LM test does However it turns out that to get a usable test statistic we must include all of the independent variables in the regression We must include all regressors because in general the omitted regressors in the restricted model are correlated with the regressors that appear in the restricted model Thus we run the regression of u on x1 x2 p xk 514 This is an example of an auxiliary regression a regression that is used to compute a test statistic but whose coefficients are not of direct interest How can we use the regression output from 514 to test 512 If 512 is true the Rsquared from 514 should be close to zero subject to sampling error because u will be approximately uncorrelated with all the independent variables The question as always with hypothesis testing is how to determine when the statistic is large enough to reject the null hypothesis at a chosen sig nificance level It turns out that under the null hypothesis the sample size multiplied by the usual Rsquared from the auxiliary regression 514 is distributed asymptotically as a chisquare random variable with q degrees of freedom This leads to a simple procedure for testing the joint significance of a set of q independent variables The Lagrange Multiplier Statistic for q Exclusion Restrictions i Regress y on the restricted set of independent variables and save the residuals u ii Regress u on all of the independent variables and obtain the Rsquared say R2 u to distinguish it from the Rsquareds obtained with y as the dependent variable iii Compute LM 5 nR2 u the sample size times the Rsquared obtained from step ii iv Compare LM to the appropriate critical value c in a x2 q distribution if LM c the null hypothe sis is rejected Even better obtain the pvalue as the probability that a x2 q random variable exceeds the value of the test statistic If the pvalue is less than the desired significance level then H0 is rejected If not we fail to reject H0 The rejection rule is essentially the same as for F testing Because of its form the LM statistic is sometimes referred to as the nRsquared statistic Unlike with the F statistic the degrees of freedom in the unrestricted model plays no role in carrying out the LM test All that matters is the number of restrictions being tested q the size of the auxiliary Rsquared 1R2 u2 and the sample size n The df in the unrestricted model plays no role because of the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 160 asymptotic nature of the LM statistic But we must be sure to multiply R2 u by the sample size to obtain LM a seemingly low value of the Rsquared can still lead to joint significance if n is large Before giving an example a word of caution is in order If in step i we mistakenly regress y on all of the independent variables and obtain the residuals from this unrestricted regression to be used in step ii we do not get an interesting statistic the resulting Rsquared will be exactly zero This is because OLS chooses the estimates so that the residuals are uncorrelated in samples with all included independent variables see equations in 313 Thus we can only test 512 by regressing the restricted residuals on all of the independent variables Regressing the restricted residuals on the restricted set of independent variables will also produce R2 5 0 exaMpLe 53 economic Model of Crime We illustrate the LM test by using a slight extension of the crime model from Example 35 narr86 5 b0 1 b1pcnv 1 b2avgsen 1 b3tottime 1 b4ptime86 1 b5qemp86 1 u where narr86 5 the number of times a man was arrested pcnv 5 the proportion of prior arrests leading to conviction avgsen 5 average sentence served from past convictions tottime 5 total time the man has spent in prison prior to 1986 since reaching the age of 18 ptime86 5 months spent in prison in 1986 qemp86 5 number of quarters in 1986 during which the man was legally employed We use the LM statistic to test the null hypothesis that avgsen and tottime have no effect on narr86 once the other factors have been controlled for In step i we estimate the restricted model by regressing narr86 on pcnv ptime86 and qemp86 the variables avgsen and tottime are excluded from this regression We obtain the residuals u from this regression 2725 of them Next we run the regression of u on pcnv ptime86 qemp86 avgsen and tottime 515 as always the order in which we list the independent variables is irrelevant This second regression produces R2 u which turns out to be about 0015 This may seem small but we must multiply it by n to get the LM statistic LM 5 2725100152 409 The 10 critical value in a chisquare distribu tion with two degrees of freedom is about 461 rounded to two decimal places see Table G4 Thus we fail to reject the null hypothesis that bavgsen 5 0 and btottime 5 0 at the 10 level The pvalue is P1x2 2 4092 129 so we would reject H0 at the 15 level As a comparison the F test for joint significance of avgsen and tottime yields a pvalue of about 131 which is pretty close to that obtained using the LM statistic This is not surprising since asymp totically the two statistics have the same probability of Type I error That is they reject the null hypothesis with the same frequency when the null is true As the previous example suggests with a large sample we rarely see important discrepancies between the outcomes of LM and F tests We will use the F statistic for the most part because it is computed routinely by most regression packages But you should be aware of the LM statistic as it is used in applied work One final comment on the LM statistic As with the F statistic we must be sure to use the same observations in steps i and ii If data are missing for some of the independent variables that are excluded under the null hypothesis the residuals from step i should be obtained from a regression on the reduced data set Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 161 53 Asymptotic Efficiency of OLS We know that under the GaussMarkov assumptions the OLS estimators are best linear unbiased OLS is also asymptotically efficient among a certain class of estimators under the GaussMarkov assumptions A general treatment requires matrix algebra and advanced asymptotic analysis First we describe the result in the simple regression case In the model y 5 b0 1 b1x 1 u 516 u has a zero conditional mean under MLR4 E1u0x2 5 0 This opens up a variety of consistent esti mators for b0 and b1 as usual we focus on the slope parameter b1 Let gx be any function of x for example g1x2 5 x2 or g1x2 5 111 1 0x0 2 Then u is uncorrelated with gx see Property CE5 in Appendix B Let zi 5 g1xi2 for all observations i Then the estimator b 1 5 a a n i51 1zi 2 z2yiba a n i51 1zi 2 z2xib 517 is consistent for b1 provided gx and x are correlated Remember it is possible that gx and x are uncorrelated because correlation measures linear dependence To see this we can plug in yi 5 b0 1 b1xi 1 ui and write b 1 as b 1 5 b1 1 an21 a n i51 1zi 2 z2uiban21 a n i51 1zi 2 z2xib 518 Now we can apply the law of large numbers to the numerator and denominator which converge in probability to Covzu and Covzx respectively Provided that Cov1zu2 2 0so that z and x are correlatedwe have plim b 1 5 b1 1 Cov1zu2Cov1zx2 5 b1 because Cov1zu2 5 0 under MLR4 It is more difficult to show that b 1 is asymptotically normal Nevertheless using arguments simi lar to those in the appendix it can be shown that n1b 1 2 b12 is asymptotically normal with mean zero and asymptotic variance s2 Var1z23Cov1zx2 42 The asymptotic variance of the OLS estima tor is obtained when z 5 x in which case Cov1zx2 5 Cov1xx2 5 Var1x2 Therefore the asymp totic variance of n1b 1 2 b12 where b 1 is the OLS estimator is s2Var1x23Var1x2 42 5 s2Var1x2 Now the CauchySchwartz inequality see Appendix B4 implies that 3Cov1zx2 42 Var1z2Var1x2 which implies that the asymptotic variance of n1b 1 2 b12 is no larger than that of n1b 1 2 b12 We have shown in the simple regression case that under the GaussMarkov assumptions the OLS estimator has a smaller asymptotic variance than any estimator of the form 517 The estimator in 517 is an example of an instrumental variables estimator which we will study extensively in Chapter 15 If the homoskedasticity assumption fails then there are estimators of the form 517 that have a smaller asymptotic variance than OLS We will see this in Chapter 8 The general case is similar but much more difficult mathematically In the k regressor case the class of consistent estimators is obtained by generalizing the OLS first order conditions a n i51 gj1xi2 1yi 2 b 0 2 b 1xi1 2 p 2 b k xik2 5 0 j 5 0 1 p k 519 where gj1xi2 denotes any function of all explanatory variables for observation i As can be seen by comparing 519 with the OLS first order conditions in 313 we obtain the OLS estimators when g01xi2 5 1 and gj1xi2 5 xij for j 5 1 2 p k The class of estimators in 519 is infinite because we can use any functions of the xij that we want Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 162 PART 1 Regression Analysis with CrossSectional Data Proving consistency of the estimators in 519 let alone showing they are asymptotically normal is mathematically difficult See Wooldridge 2010 Chapter 5 Summary The claims underlying the material in this chapter are fairly technical but their practical implications are straightforward We have shown that the first four GaussMarkov assumptions imply that OLS is consist ent Furthermore all of the methods of testing and constructing confidence intervals that we learned in Chapter 4 are approximately valid without assuming that the errors are drawn from a normal distribution equivalently the distribution of y given the explanatory variables is not normal This means that we can apply OLS and use previous methods for an array of applications where the dependent variable is not even approximately normally distributed We also showed that the LM statistic can be used instead of the F sta tistic for testing exclusion restrictions Before leaving this chapter we should note that examples such as Example 53 may very well have problems that do require special attention For a variable such as narr86 which is zero or one for most men in the population a linear model may not be able to adequately capture the functional relationship between narr86 and the explanatory variables Moreover even if a linear model does describe the expected value of arrests heteroskedasticity might be a problem Problems such as these are not mitigated as the sample size grows and we will return to them in later chapters Key Terms Theorem 53 Asymptotic EfficiEncy of oLs Under the GaussMarkov assumptions let b j denote estimators that solve equations of the form 519 and let b j denote the OLS estimators Then for j 5 0 1 2 p k the OLS estimators have the smallest asymptotic variances Avarn1b j 2 bj2 Avarn1b j 2 bj2 Asymptotic Bias Asymptotic Confidence Interval Asymptotic Normality Asymptotic Properties Asymptotic Standard Error Asymptotic t Statistics Asymptotic Variance Asymptotically Efficient Auxiliary Regression Consistency Inconsistency Lagrange Multiplier LM Statistic Large Sample Properties nRSquared Statistic Score Statistic Problems 1 In the simple regression model under MLR1 through MLR4 we argued that the slope estimator b 1 is consistent for b1 Using b 0 5 y 2 b 1x1 show that plim b 0 5 b0 You need to use the consistency of b 1 and the law of large numbers along with the fact that b0 5 E1y2 2 b1E1x12 2 Suppose that the model pctstck 5 b0 1 b1funds 1 b2risktol 1 u satisfies the first four GaussMarkov assumptions where pctstck is the percentage of a workers pension invested in the stock market funds is the number of mutual funds that the worker can Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 163 choose from and risktol is some measure of risk tolerance larger risktol means the person has a higher tolerance for risk If funds and risktol are positively correlated what is the inconsistency in b 1 the slope coefficient in the simple regression of pctstck on funds 3 The data set SMOKE contains information on smoking behavior and other variables for a random sample of single adults from the United States The variable cigs is the average number of cigarettes smoked per day Do you think cigs has a normal distribution in the US adult population Explain 4 In the simple regression model 516 under the first four GaussMarkov assumptions we showed that estimators of the form 517 are consistent for the slope b1 Given such an estimator define an esti mator of b0 by b 0 5 y 2 b 1x Show that plim b 0 5 b0 5 The following histogram was created using the variable score in the data file ECONMATH Thirty bins were used to create the histogram and the height of each cell is the proportion of observations falling within the corresponding interval The bestfitting normal distributionthat is using the sample mean and sample standard deviationhas been superimposed on the histogram 0 02 04 06 08 1 proportion in cell 20 40 60 80 100 course score in percentage form i If you use the normal distribution to estimate the probability that score exceeds 100 would the answer be zero Why does your answer contradict the assumption of a normal distribution for score ii Explain what is happening in the left tail of the histogram Does the normal distribution fit well in the left tail Computer Exercises C1 Use the data in WAGE1 for this exercise i Estimate the equation wage 5 b0 1 b1educ 1 b2exper 1 b3tenure 1 u Save the residuals and plot a histogram ii Repeat part i but with logwage as the dependent variable iii Would you say that Assumption MLR6 is closer to being satisfied for the levellevel model or the loglevel model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 164 PART 1 Regression Analysis with CrossSectional Data C2 Use the data in GPA2 for this exercise i Using all 4137 observations estimate the equation colgpa 5 b0 1 b1hsperc 1 b2sat 1 u and report the results in standard form ii Reestimate the equation in part i using the first 2070 observations iii Find the ratio of the standard errors on hsperc from parts i and ii Compare this with the result from 510 C3 In equation 442 of Chapter 4 using the data set BWGHT compute the LM statistic for testing whether motheduc and fatheduc are jointly significant In obtaining the residuals for the restricted model be sure that the restricted model is estimated using only those observations for which all vari ables in the unrestricted model are available see Example 49 C4 Several statistics are commonly used to detect nonnormality in underlying population distributions Here we will study one that measures the amount of skewness in a distribution Recall that any nor mally distributed random variable is symmetric about its mean therefore if we standardize a sym metrically distributed random variable say z 5 1y 2 my2sy where my 5 E1y2 and sy 5 sd1y2 then z has mean zero variance one and E1z32 5 0 Given a sample of data 5yi i 5 1 p n6 we can stan dardize yi in the sample by using zi 5 1yi 2 m y2s y where m y is the sample mean and s y is the sample standard deviation We ignore the fact that these are estimates based on the sample A sample statistic that measures skewness is n21g n i51z3 i or where n is replaced with n 21 as a degreesoffreedom ad justment If y has a normal distribution in the population the skewness measure in the sample for the standardized values should not differ significantly from zero i First use the data set 401KSUBS keeping only observations with fsize 5 1 Find the skewness measure for inc Do the same for loginc Which variable has more skewness and therefore seems less likely to be normally distributed ii Next use BWGHT2 Find the skewness measures for bwght and logbwght What do you conclude iii Evaluate the following statement The logarithmic transformation always makes a positive variable look more normally distributed iv If we are interested in the normality assumption in the context of regression should we be evaluating the unconditional distributions of y and logy Explain C5 Consider the analysis in Computer Exercise C11 in Chapter 4 using the data in HTV where educ is the dependent variable in a regression i How many different values are taken on by educ in the sample Does educ have a continuous distribution ii Plot a histogram of educ with a normal distribution overlay Does the distribution of educ appear anything close to normal iii Which of the CLM assumptions seems clearly violated in the model educ 5 b0 1 b1motheduc 1 b2fatheduc 1 b3abil 1 b4abil2 1 u How does this violation change the statistical inference procedures carried out in Computer Exercise C11 in Chapter 4 C6 Use the data in ECONMATH to answer this question i Logically what are the smallest and largest values that can be taken on by the variable score What are the smallest and largest values in the sample ii Consider the linear model score 5 b0 1 b1colgpa 1 b2actmth 1 b3acteng 1 u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 165 Why cannot Assumption MLR6 hold for the error term u What consequences does this have for using the usual t statistic to test H0 b3 5 0 iii Estimate the model from part ii and obtain the t statistic and associated pvalue for testing H0 b3 5 0 How would you defend your findings to someone who makes the following state ment You cannot trust that pvalue because clearly the error term in the equation cannot have a normal distribution APPEndix 5A Asymptotic Normality of OLS We sketch a proof of the asymptotic normality of OLS Theorem 52i in the simple regression case Write the simple regression model as in equation 516 Then by the usual algebra of simple regression we can write n1b 1 2 b12 5 11s2 x2 cn212 a n i51 1xi 2 x2uid where we use s2 x to denote the sample variance of 5xi i 5 1 2 p n6 By the law of large num bers see Appendix C s2 x S p s2 x 5 Var1x2 Assumption MLR3 rules out perfect collinearity which means that Varx 0 xi varies in the sample and therefore x is not constant in the population Next n212g n i511xi 2 x2ui 5 n212g n i511xi 2 m2ui 1 1m 2 x2 n212g n i51ui where m 5 E1x2 is the population mean of x Now 5ui6 is a sequence of iid random variables with mean zero and variance s2 and so n212g n i51ui converges to the Normal0s2 distribution as n S this is just the central limit theorem from Appendix C By the law of large numbers plim1u 2 x2 5 0 A standard result in asymptotic theory is that if plim1wn2 5 0 and zn has an asymptotic normal distribution then plim1wnzn2 5 0 See Wooldridge 2010 Chapter 3 for more discussion This implies that 1m 2 x2n212g n i51ui has zero plim Next 1xi 2 m2ui i 5 1 2 p is an indefinite sequence of iid random variables with mean zerobecause u and x are uncorrelated under As sumption MLR4and variance s2s2 x by the homoskedasticity Assumption MLR5 Therefore n212g n i511xi 2 m2ui has an asymptotic Normal10s2s2 x2 distribution We just showed that the dif ference between n212g n i511xi 2 x2ui and n212g n i511xi 2 m2ui has zero plim A result in asymptotic theory is that if zn has an asymptotic normal distribution and plim1vn 2 zn2 5 0 then vn has the same asymptotic normal distribution as zn It follows that n212g n i511xi 2 x2ui also has an asymptotic Normal10s2s2 x2 distribution Putting all of the pieces together gives n1b 1 2 b12 5 11s2 x2 cn212 a n i51 1xi 2 x2uid 111s2 x2 2 11s2 xcn212 a n i51 1xi 2 x2uid and since plim11s2 x2 5 1s2 x the second term has zero plim Therefore the asymptotic distribu tion of n1b 1 2 b12 is Normal105s2s2 x65s2 x622 5 Normal10s2s2 x2 This completes the proof in the simple regression case as a2 1 5 s2 x in this case See Wooldridge 2010 Chapter 4 for the general case Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 166 c h a p t e r 6 Multiple Regression Analysis Further Issues T his chapter brings together several issues in multiple regression analysis that we could not conveniently cover in earlier chapters These topics are not as fundamental as the material in Chapters 3 and 4 but they are important for applying multiple regression to a broad range of empirical problems 61 Effects of Data Scaling on OLS Statistics In Chapter 2 on bivariate regression we briefly discussed the effects of changing the units of meas urement on the OLS intercept and slope estimates We also showed that changing the units of meas urement did not affect Rsquared We now return to the issue of data scaling and examine the effects of rescaling the dependent or independent variables on standard errors t statistics F statistics and confidence intervals We will discover that everything we expect to happen does happen When variables are rescaled the coefficients standard errors confidence intervals t statistics and F statistics change in ways that preserve all measured effects and testing outcomes Although this is no great surprisein fact we would be very worried if it were not the caseit is useful to see what occurs explicitly Often data scaling is used for cosmetic purposes such as to reduce the number of zeros after a decimal point in an estimated coefficient By judiciously choosing units of measurement we can improve the appear ance of an estimated equation while changing nothing that is essential We could treat this problem in a general way but it is much better illustrated with examples Likewise there is little value here in introducing an abstract notation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 167 We begin with an equation relating infant birth weight to cigarette smoking and family income bwght 5 b 0 1 b 1cigs 1 b 2 faminc 61 where bwght 5 child birth weight in ounces cigs 5 number of cigarettes smoked by the mother while pregnant per day faminc 5 annual family income in thousands of dollars The estimates of this equation obtained using the data in BWGHT are given in the first column of Table 61 Standard errors are listed in parentheses The estimate on cigs says that if a woman smoked five more cigarettes per day birth weight is predicted to be about 4634152 5 2317 ounces less The t statistic on cigs is 506 so the variable is very statistically significant Now suppose that we decide to measure birth weight in pounds rather than in ounces Let bwghtlbs 5 bwght16 be birth weight in pounds What happens to our OLS statistics if we use this as the dependent variable in our equation It is easy to find the effect on the coefficient estimates by simple manipulation of equation 61 Divide this entire equation by 16 bwght16 5 b 0 16 1 1b 1162cigs 1 1b 2 162faminc Since the lefthand side is birth weight in pounds it follows that each new coefficient will be the corresponding old coefficient divided by 16 To verify this the regression of bwghtlbs on cigs and faminc is reported in column 2 of Table 61 Up to the reported digits and any digits beyond the intercept and slopes in column 2 are just those in column 1 divided by 16 For example the coefficient on cigs is now 0289 this means that if cigs were higher by five birth weight would be 0289152 5 1445 pounds lower In terms of ounces we have 14451162 5 2312 which is slightly different from the 2317 we obtained earlier due to rounding error The point is once the effects are transformed into the same units we get exactly the same answer regardless of how the dependent variable is measured What about statistical significance As we expect changing the dependent variable from ounces to pounds has no effect on how statistically important the independent variables are The standard errors in column 2 are 16 times smaller than those in column 1 A few quick calculations show TAblE 61 Effects of Data Scaling Dependent Variable 1 bwght 2 bwghtlbs 3 bwght Independent Variables cigs 4634 0916 0289 0057 packs 9268 1832 faminc 0927 0292 0058 0018 0927 0292 intercept 116974 1049 73109 0656 116974 1049 Observations 1388 1388 1388 RSquared 0298 0298 0298 SSR 55748551 21776778 55748551 SER 20063 12539 20063 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 168 that the t statistics in column 2 are indeed identical to the t statistics in column 1 The endpoints for the confidence intervals in column 2 are just the endpoints in column 1 divided by 16 This is because the CIs change by the same factor as the standard errors Remember that the 95 CI here is b j 6 196 se1b j2 In terms of goodnessoffit the Rsquareds from the two regressions are identical as should be the case Notice that the sum of squared residuals SSR and the standard error of the regression SER do differ across equations These differences are easily explained Let u i denote the residual for obser vation i in the original equation 61 Then the residual when bwghtlbs is the dependent variable is simply u i16 Thus the squared residual in the second equation is 1u i162 2 5 u 2 i 256 This is why the sum of squared residuals in column 2 is equal to the SSR in column 1 divided by 256 Since SER 5 s 5 SSR1n 2 k 2 12 5 SSR1385 the SER in column 2 is 16 times smaller than that in column 1 Another way to think about this is that the error in the equation with bwghtlbs as the dependent variable has a standard deviation 16 times smaller than the standard devia tion of the original error This does not mean that we have reduced the error by changing how birth weight is measured the smaller SER simply reflects a difference in units of measurement Next let us return the dependent variable to its original units bwght is measured in ounces Instead let us change the unit of measurement of one of the independent variables cigs Define packs to be the number of packs of cigarettes smoked per day Thus packs 5 cigs20 What happens to the coefficients and other OLS statistics now Well we can write bwght 5 b 0 1 120b 12 1cigs202 1 b 2 faminc 5 b 0 1 120b 12packs 1 b 2 faminc Thus the intercept and slope coefficient on faminc are unchanged but the coefficient on packs is 20 times that on cigs This is intuitively appealing The results from the regression of bwght on packs and faminc are in column 3 of Table 61 Incidentally remember that it would make no sense to include both cigs and packs in the same equa tion this would induce perfect multicollinearity and would have no interesting meaning Other than the coefficient on packs there is one other statistic in column 3 that differs from that in column 1 the standard error on packs is 20 times larger than that on cigs in column 1 This means that the t statistic for testing the significance of cigarette smoking is the same whether we measure smoking in terms of cigarettes or packs This is only natural The previous example spells out most of the possibilities that arise when the dependent and inde pendent variables are rescaled Rescaling is often done with dollar amounts in economics especially when the dollar amounts are very large In Chapter 2 we argued that if the dependent variable appears in logarithmic form changing the unit of measurement does not affect the slope coefficient The same is true here changing the unit of measurement of the dependent variable when it appears in logarithmic form does not affect any of the slope estimates This follows from the simple fact that log1c1yi2 5 log1c12 1 log1yi2 for any constant c1 0 The new intercept will be log1c12 1 b 0 Similarly changing the unit of measurement of any xj where log1xj2 appears in the regression only affects the intercept This corresponds to what we know about percentage changes and in particular elasticities they are invariant to the units of measurement of either y or the xj For example if we had specified the dependent variable in 61 to be logbwght estimated the equation and then reestimated it with logbwghtlbs as the dependent variable the coefficients on cigs and faminc would be the same in both regressions only the intercept would be different In the original birth weight equation 61 suppose that faminc is measured in dollars rather than in thousands of dollars Thus define the variable fincdol 5 1000faminc How will the OLS statistics change when fincdol is substituted for faminc For the purpose of presenting the regression re sults do you think it is better to measure income in dollars or in thousands of dollars Exploring FurthEr 61 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 169 61a Beta Coefficients Sometimes in econometric applications a key variable is measured on a scale that is difficult to inter pret Labor economists often include test scores in wage equations and the scale on which these tests are scored is often arbitrary and not easy to interpret at least for economists In almost all cases we are interested in how a particular individuals score compares with the population Thus instead of asking about the effect on hourly wage if say a test score is 10 points higher it makes more sense to ask what happens when the test score is one standard deviation higher Nothing prevents us from seeing what happens to the dependent variable when an independent variable in an estimated model increases by a certain number of standard deviations assuming that we have obtained the sample standard deviation of the independent variable which is easy in most regression packages This is often a good idea So for example when we look at the effect of a standardized test score such as the SAT score on college GPA we can find the standard deviation of SAT and see what happens when the SAT score increases by one or two standard deviations Sometimes it is useful to obtain regression results when all variables involved the dependent as well as all the independent variables have been standardized A variable is standardized in the sample by subtracting off its mean and dividing by its standard deviation see Appendix C This means that we compute the zscore for every variable in the sample Then we run a regression using the zscores Why is standardization useful It is easiest to start with the original OLS equation with the vari ables in their original forms yi 5 b 0 1 b 1xi1 1 b 2xi2 1 p 1 b kxik 1 u i 62 We have included the observation subscript i to emphasize that our standardization is applied to all sample values Now if we average 62 use the fact that the u i have a zero sample average and sub tract the result from 62 we get yi 2 y 5 b 11xi1 2 x12 1 b 21xi2 2 x22 1 p 1 b k1xik 2 xk2 1 u i Now let s y be the sample standard deviation for the dependent variable let s 1 be the sample sd for x1 let s 2 be the sample sd for x2 and so on Then simple algebra gives the equation 1yi 2 y2s y 5 1s 1s y2b 13 1xi1 2 x12s 14 1 p 1 1s ks y2b k3 1xik 2 xk2s k4 1 1u is y2 63 Each variable in 63 has been standardized by replacing it with its zscore and this has resulted in new slope coefficients For example the slope coefficient on 1xi1 2 x12s 1 is 1s 1s y2b 1This is sim ply the original coefficient b 1 multiplied by the ratio of the standard deviation of x1 to the standard deviation of y The intercept has dropped out altogether It is useful to rewrite 63 dropping the i subscript as zy 5 b 1z1 1 b 2z2 1 p 1 b kzk 1 error 64 where zy denotes the zscore of y z1 is the zscore of x1 and so on The new coefficients are b j 5 1s js y2b j for j 5 1 p k 65 These b j are traditionally called standardized coefficients or beta coefficients The latter name is more common which is unfortunate because we have been using beta hat to denote the usual OLS estimates Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 170 Beta coefficients receive their interesting meaning from equation 64 If x1 increases by one standard deviation then y changes by b 1 standard deviations Thus we are measuring effects not in terms of the original units of y or the xj but in standard deviation units Because it makes the scale of the regressors irrelevant this equation puts the explanatory variables on equal footing In a standard OLS equation it is not possible to simply look at the size of different coefficients and conclude that the explanatory variable with the largest coefficient is the most important We just saw that the magnitudes of coefficients can be changed at will by changing the units of measurement of the xj But when each xj has been standardized comparing the magnitudes of the resulting beta coefficients is more compelling When the regression equation has only a single explanatory variable x1 its stand ardized coefficient is simply the sample correlation coefficient between y and x1 which means it must lie in the range 1 to 1 Even in situations where the coefficients are easily interpretablesay the dependent variable and independent variables of interest are in logarithmic form so the OLS coefficients of interest are estimated elasticitiesthere is still room for computing beta coefficients Although elasticities are free of units of measurement a change in a particular explanatory variable by say 10 may repre sent a larger or smaller change over a variables range than changing another explanatory variable by 10 For example in a state with wide income variation but relatively little variation in spending per student it might not make much sense to compare performance elasticities with respect to the income and spending Comparing beta coefficient magnitudes can be helpful To obtain the beta coefficients we can always standardize y x1 p xk and then run the OLS regression of the zscore of y on the zscores of x1 p xkwhere it is not necessary to include an intercept as it will be zero This can be tedious with many independent variables Many regression packages provide beta coefficients via a simple command The following example illustrates the use of beta coefficients ExamplE 61 Effects of pollution on Housing prices We use the data from Example 45 in the file HPRICE2 to illustrate the use of beta coefficients Recall that the key independent variable is nox a measure of the nitrogen oxide in the air over each community One way to understand the size of the pollution effectwithout getting into the science underlying nitrogen oxides effect on air qualityis to compute beta coefficients An alternative approach is contained in Example 45 we obtained a price elasticity with respect to nox by using price and nox in logarithmic form The population equation is the levellevel model price 5 b0 1 b1nox 1 b2crime 1 b3rooms 1 b4dist 1 b5stratio 1 u where all the variables except crime were defined in Example 45 crime is the number of reported crimes per capita The beta coefficients are reported in the following equation so each variable has been converted to its zscore zprice 5 2340 znox 2 143 zcrime 1 514 zrooms 2 235 zdist 2 270 zstratio This equation shows that a one standard deviation increase in nox decreases price by 34 standard deviation a one standard deviation increase in crime reduces price by 14 standard deviation Thus the same relative movement of pollution in the population has a larger effect on housing prices than crime does Size of the house as measured by number of rooms rooms has the largest standard ized effect If we want to know the effects of each independent variable on the dollar value of median house price we should use the unstandardized variables Whether we use standardized or unstandardized variables does not affect statistical significance the t statistics are the same in both cases Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 171 62 More on Functional Form In several previous examples we have encountered the most popular device in econometrics for allowing nonlinear relationships between the explained and explanatory variables using logarithms for the dependent or independent variables We have also seen models containing quadratics in some explanatory variables but we have yet to provide a systematic treatment of them In this section we cover some variations and extensions on functional forms that often arise in applied work 62a More on Using Logarithmic Functional Forms We begin by reviewing how to interpret the parameters in the model log1price2 5 b0 1 b1log1nox2 1 b2rooms 1 u 66 where these variables are taken from Example 45 Recall that throughout the text logx is the natural log of x The coefficient b1 is the elasticity of price with respect to nox pollution The coefficient b2 is the change in log price when Drooms 5 1 as we have seen many times when multiplied by 100 this is the approximate percentage change in price Recall that 100 b2 is sometimes called the semi elasticity of price with respect to rooms When estimated using the data in HPRICE2 we obtain log1price2 5 923 2 718 log1nox2 1 306 rooms 10192 10662 10192 67 n 5 506 R2 5 514 Thus when nox increases by 1 price falls by 718 holding only rooms fixed When rooms increases by one price increases by approximately 10013062 5 306 The estimate that one more room increases price by about 306 turns out to be somewhat inac curate for this application The approximation error occurs because as the change in logy becomes larger and larger the approximation Dy 100 Dlog1y2 becomes more and more inaccurate Fortunately a simple calculation is available to compute the exact percentage change To describe the procedure we consider the general estimated model log1y2 5 b 0 1 b 1log1x12 1 b 2x2 Adding additional independent variables does not change the procedure Now fixing x1 we have Dlog1y2 5 b 2Dx2 Using simple algebraic properties of the exponential and logarithmic functions gives the exact percentage change in the predicted y as Dy 5 100 3exp1b 2Dx22 2 14 68 where the multiplication by 100 turns the proportionate change into a percentage change When Dx2 5 1 Dy 5 100 3exp1b 22 2 14 69 Applied to the housing price example with x2 5 rooms and b 2 5 306 Dprice 5 1003exp13062 2 14 5 358 which is notably larger than the approximate percentage change 306 obtained directly from 67 Incidentally this is not an unbiased estimator because exp is a nonlinear function it is however a consistent estimator of 1003exp1b22 2 14 This is because the proba bility limit passes through continuous functions while the expected value operator does not See Appendix C Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 172 The adjustment in equation 68 is not as crucial for small percentage changes For example when we include the studentteacher ratio in equation 67 its estimated coefficient is 052 which means that if stratio increases by one price decreases by approximately 52 The exact proportion ate change is exp12052221 2051 or 251 On the other hand if we increase stratio by five then the approximate percentage change in price is 26 while the exact change obtained from equation 68 is 1003exp12262214 2229 The logarithmic approximation to percentage changes has an advantage that justifies its reporting even when the percentage change is large To describe this advantage consider again the effect on price of changing the number of rooms by one The logarithmic approximation is just the coefficient on rooms in equation 67 multiplied by 100 namely 306 We also computed an estimate of the exact percentage change for increasing the number of rooms by one as 358 But what if we want to estimate the percentage change for decreasing the number of rooms by one In equation 68 we take D x2 5 21 and b 2 5 306 and so Dprice 5 1003exp123062214 5 2264 or a drop of 264 Notice that the approximation based on using the coefficient on rooms is between 264 and 358an outcome that always occurs In other words simply using the coefficient multiplied by 100 gives us an estimate that is always between the absolute value of the estimates for an increase and a decrease If we are specifically interested in an increase or a decrease we can use the calculation based on equation 68 The point just made about computing percentage changes is essentially the one made in introduc tory economics when it comes to computing say price elasticities of demand based on large price changes the result depends on whether we use the beginning or ending price and quantity in comput ing the percentage changes Using the logarithmic approximation is similar in spirit to calculating an arc elasticity of demand where the averages of prices and quantities are used in the denominators in computing the percentage changes We have seen that using natural logs leads to coefficients with appealing interpretations and we can be ignorant about the units of measurement of variables appearing in logarithmic form because the slope coefficients are invariant to rescalings There are several other reasons logs are used so much in applied work First when y 0 models using logy as the dependent variable often satisfy the CLM assumptions more closely than models using the level of y Strictly positive variables often have conditional distributions that are heteroskedastic or skewed taking the log can mitigate if not eliminate both problems Another potential benefit of using logs is that taking the log of a variable often narrows its range This is particularly true of variables that can be large monetary values such as firms annual sales or baseball players salaries Population variables also tend to vary widely Narrowing the range of the dependent and independent variables can make OLS estimates less sensitive to outlying or extreme values we take up the issue of outlying observations in Chapter 9 However one must not indiscriminately use the logarithmic transformation because in some cases it can actually create extreme values An example is when a variable y is between zero and one such as a proportion and takes on values close to zero In this case logy which is necessarily negative can be very large in magnitude whereas the original variable y is bounded between zero and one There are some standard rules of thumb for taking logs although none is written in stone When a variable is a positive dollar amount the log is often taken We have seen this for variables such as wages salaries firm sales and firm market value Variables such as population total number of employees and school enrollment often appear in logarithmic form these have the common feature of being large integer values Variables that are measured in yearssuch as education experience tenure age and so on usually appear in their original form A variable that is a proportion or a percentsuch as the unemployment rate the participation rate in a pension plan the percentage of students passing a standardized exam and the arrest rate on reported crimescan appear in either original or logarith mic form although there is a tendency to use them in level forms This is because any regression Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 173 coefficients involving the original variablewhether it is the dependent or independent variable will have a percentage point change interpretation See Appendix A for a review of the distinction between a percentage change and a percentage point change If we use say logunem in a regres sion where unem is the percentage of unemployed individuals we must be very careful to distin guish between a percentage point change and a percentage change Remember if unem goes from 8 to 9 this is an increase of one percentage point but a 125 increase from the initial unemploy ment level Using the log means that we are looking at the percentage change in the unemployment rate log192 2 log182 118 or 118 which is the logarithmic approximation to the actual 125 increase One limitation of the log is that it cannot be used if a variable takes on zero or negative values In cases where a variable y is nonnegative but can take on the value 0 log11y is sometimes used The percentage change interpretations are often closely preserved except for changes beginning at y 5 0 where the percentage change is not even defined Generally using log11y and then interpreting the estimates as if the variable were logy is accepta ble when the data on y contain relatively few zeros An example might be where y is hours of training per employee for the population of manufacturing firms if a large fraction of firms provides training to at least one worker Technically however log 11y cannot be normally distributed although it might be less heteroskedastic than y Useful albeit more advanced alternatives are the Tobit and Poisson models in Chapter 17 One drawback to using a dependent variable in logarithmic form is that it is more difficult to predict the original variable The original model allows us to predict logy not y Nevertheless it is fairly easy to turn a prediction for logy into a prediction for y see Section 64 A related point is that it is not legitimate to compare Rsquareds from models where y is the dependent variable in one case and logy is the dependent variable in the other These measures explain variations in different variables We discuss how to compute comparable goodnessoffit measures in Section 64 62b Models with Quadratics Quadratic functions are also used quite often in applied economics to capture decreasing or increas ing marginal effects You may want to review properties of quadratic functions in Appendix A In the simplest case y depends on a single observed factor x but it does so in a quadratic fashion y 5 b0 1 b1x 1 b2x2 1 u For example take y 5 wage and x 5 exper As we discussed in Chapter 3 this model falls outside of simple regression analysis but is easily handled with multiple regression It is important to remember that b1 does not measure the change in y with respect to x it makes no sense to hold x2 fixed while changing x If we write the estimated equation as y 5 b 0 1 b 1x 1 b 2x2 610 then we have the approximation Dy 1b 0 1 2b 2x2Dx so DyDx b 1 1 2b 2x 611 Suppose that the annual number of drunk driving arrests is determined by log1arrests2 5 b0 1 b1log1pop2 1 b2age1625 1 other factors where age 1625 is the proportion of the population between 16 and 25 years of age Show that b2 has the following ceteris paribus interpretation it is the percentage change in arrests when the percentage of the people aged 16 to 25 increases by one percentage point Exploring FurthEr 62 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 174 This says that the slope of the relationship between x and y depends on the value of x the estimated slope is b 1 1 2b 2x If we plug in x 5 0 we see that b 1 can be interpreted as the approximate slope in going from x 5 0 to x 5 1 After that the second term 2b 2x must be accounted for If we are only interested in computing the predicted change in y given a starting value for x and a change in x we could use 610 directly there is no reason to use the calculus approximation at all However we are usually more interested in quickly summarizing the effect of x on y and the interpre tation of b 1 and b 2 in equation 611 provides that summary Typically we might plug in the average value of x in the sample or some other interesting values such as the median or the lower and upper quartile values In many applications b 1 is positive and b 2 is negative For example using the wage data in WAGE1 we obtain wage 5 373 1 298 exper 2 0061 exper2 1352 10412 100092 612 n 5 526 R2 5 093 This estimated equation implies that exper has a diminishing effect on wage The first year of expe rience is worth roughly 30 per hour 298 The second year of experience is worth less about 298 2 2100612 112 286 or 286 according to the approximation in 611 with x 5 1 In going from 10 to 11 years of experience wage is predicted to increase by about 298 2 2100612 1102 5 176 or 176 And so on When the coefficient on x is positive and the coefficient on x2 is negative the quadratic has a parabolic shape There is always a positive value of x where the effect of x on y is zero before this point x has a positive effect on y after this point x has a negative effect on y In practice it can be important to know where this turning point is In the estimated equation 610 with b 1 0 and b 2 0 the turning point or maximum of the function is always achieved at the coefficient on x over twice the absolute value of the coefficient on x2 xp 5 0b 112b 22 0 613 In the wage example xp 5 experp is 29832100612 4 244 Note how we just drop the minus sign on 0061 in doing this calculation This quadratic relationship is illustrated in Figure 61 In the wage equation 612 the return to experience becomes zero at about 244 years What should we make of this There are at least three possible explanations First it may be that few people in the sample have more than 24 years of experience and so the part of the curve to the right of 24 can be ignored The cost of using a quadratic to capture diminishing effects is that the quadratic must eventually turn around If this point is beyond all but a small percentage of the people in the sample then this is not of much concern But in the data set WAGE1 about 28 of the people in the sample have more than 24 years of experience this is too high a percentage to ignore It is possible that the return to exper really becomes negative at some point but it is hard to believe that this happens at 24 years of experience A more likely possibility is that the estimated effect of exper on wage is biased because we have controlled for no other factors or because the functional relationship between wage and exper in equation 612 is not entirely correct Computer Exercise C2 asks you to explore this possibility by controlling for education in addition to using logwage as the dependent variable When a model has a dependent variable in logarithmic form and an explanatory variable entering as a quadratic some care is needed in reporting the partial effects The following example also shows that the quadratic can have a Ushape rather than a parabolic shape A Ushape arises in equation 610 when b 1 is negative and b 2 is positive this captures an increasing effect of x on y Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 175 ExamplE 62 Effects of pollution on Housing prices We modify the housing price model from Example 45 to include a quadratic term in rooms log1price2 5 b0 1 b1log1nox2 1 b2log1dist2 1 b3rooms 1 b4rooms2 1 b5stratio 1 u 614 The model estimated using the data in HPRICE2 is log1price2 5 1339 2 902 log1nox2 2 087 log1dist2 1572 11152 10432 2 545 rooms 1 062 rooms2 2 048 stratio 11652 10132 10062 n 5 506 R2 5 603 The quadratic term rooms2 has a t statistic of about 477 and so it is very statistically significant But what about interpreting the effect of rooms on logprice Initially the effect appears to be strange Because the coefficient on rooms is negative and the coefficient on rooms2 is positive this equation literally implies that at low values of rooms an additional room has a negative effect on logprice At some point the effect becomes positive and the quadratic shape means that the semielasticity of price with respect to rooms is increasing as rooms increases This situation is shown in Figure 62 We obtain the turnaround value of rooms using equation 613 even though b 1 is negative and b 2 is positive The absolute value of the coefficient on rooms 545 divided by twice the coefficient on rooms2 062 gives roomsp 5 5453210622 4 44 this point is labeled in Figure 62 Do we really believe that starting at three rooms and increasing to four rooms actually reduces a houses expected value Probably not It turns out that only five of the 506 communities in the sample 373 737 exper wage 244 FiguRE 61 Quadratic relationship between wage and exper Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 176 have houses averaging 44 rooms or less about 1 of the sample This is so small that the quadratic to the left of 44 can for practical purposes be ignored To the right of 44 we see that adding another room has an increasing effect on the percentage change in price Dlog1price2 532545 1 210622 4rooms6Drooms and so Dprice 100532545 1 210622 4rooms6Drooms 5 12545 1 124 rooms2Drooms Thus an increase in rooms from say five to six increases price by about 2545 1 124152 5 75 the increase from six to seven increases price by roughly 2545 1 124162 5 199 This is a very strong increasing effect The strong increasing effect of rooms on logprice in this example illustrates an important les son one cannot simply look at the coefficient on the quadratic termin this case 062and declare that it is too small to bother with based only on its magnitude In many applications with quadratics the coefficient on the squared variable has one or more zeros after the decimal point after all this coefficient measures how the slope is changing as x rooms changes A seemingly small coefficient can have practically important consequences as we just saw As a general rule one must compute the partial effect and see how it varies with x to determine if the quadratic term is practically important In doing so it is useful to compare the changing slope implied by the quadratic model with the constant slope obtained from the model with only a linear term If we drop rooms2 from the equation the coef ficient on rooms becomes about 255 which implies that each additional room starting from any number of roomsincreases median price by about 255 This is very different from the quadratic model where the effect becomes 255 at rooms 5 645 but changes rapidly as rooms gets smaller or larger For example at rooms 5 7 the return to the next room is about 323 rooms logprice 44 FiguRE 62 logprice as a quadratic function of rooms Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 177 What happens generally if the coefficients on the level and squared terms have the same sign either both positive or both negative and the explanatory variable is necessarily nonnegative as in the case of rooms or exper In either case there is no turning point for values x 0 For example if b1 and b2 are both positive the smallest expected value of y is at x 5 0 and increases in x always have a positive and increasing effect on y This is also true if b1 5 0 and b2 0 which means that the partial effect is zero at x 5 0 and increasing as x increases Similarly if b1 and b2 are both nega tive the largest expected value of y is at x 5 0 and increases in x have a negative effect on y with the magnitude of the effect increasing as x gets larger The general formula for the turning point of any quadratic is xp 5 2b 112b 22 which leads to a positive value if b 1 and b 2 have opposite signs and a negative value when b 1 and b 2 have the same sign Knowing this simple formula is useful in cases where x may take on both positive and negative values one can compute the turning point and see if it makes sense taking into account the range of x in the sample There are many other possibilities for using quadratics along with logarithms For example an extension of 614 that allows a nonconstant elasticity between price and nox is log1price2 5 b0 1 b1log1nox2 1 b23log1nox2 42 1 b3crime 1 b4rooms 1 b5rooms2 1 b6stratio 1 u 615 If b2 5 0 then b1 is the elasticity of price with respect to nox Otherwise this elasticity depends on the level of nox To see this we can combine the arguments for the partial effects in the quadratic and logarithmic models to show that Dprice 3b1 1 2b2log1nox2 4Dnox 616 therefore the elasticity of price with respect to nox is b1 1 2b2log1nox2 so that it depends on lognox Finally other polynomial terms can be included in regression models Certainly the quadratic is seen most often but a cubic and even a quartic term appear now and then An often reasonable func tional form for a total cost function is cost 5 b0 1 b1quantity 1 b2quantity2 1 b3quantity3 1 u Estimating such a model causes no complications Interpreting the parameters is more involved though straightforward using calculus we do not study these models further 62c Models with Interaction Terms Sometimes it is natural for the partial effect elasticity or semielasticity of the dependent variable with respect to an explanatory variable to depend on the magnitude of yet another explanatory vari able For example in the model price 5 b0 1 b1sqrft 1 b2bdrms 1 b3sqrft bdrms 1 b4bthrms 1 u the partial effect of bdrms on price holding all other variables fixed is Dprice Dbdrms 5 b2 1 b3sqrft 617 If b3 0 then 617 implies that an additional bedroom yields a higher increase in housing price for larger houses In other words there is an interaction effect between square footage and number of bedrooms In summarizing the effect of bdrms on price we must evaluate 617 at interesting values Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 178 of sqrft such as the mean value or the lower and upper quartiles in the sample Whether or not b3 is zero is something we can easily test The parameters on the original variables can be tricky to interpret when we include an interac tion term For example in the previous housing price equation equation 617 shows that b2 is the effect of bdrms on price for a home with zero square feet This effect is clearly not of much interest Instead we must be careful to put interesting values of sqrft such as the mean or median values in the sample into the estimated version of equation 617 Often it is useful to reparameterize a model so that the coefficients on the original variables have an interesting meaning Consider a model with two explanatory variables and an interaction y 5 b0 1 b1x1 1 b2x2 1 b3x1x2 1 u As just mentioned b2 is the partial effect of x2 on y when x1 5 0 Often this is not of interest Instead we can reparameterize the model as y 5 a0 1 d1x1 1 d2x2 1 b31x1 2 m12 1x2 2 m22 1 u where m1 is the population mean of x1 and m2 is the population mean of x2 We can easily see that now the coefficient on x2 d2 is the partial effect of x2 on y at the mean value of x1 By multiplying out the interaction in the second equation and comparing the coefficients we can easily show that d2 5 b2 1 b3m1 The parameter d1 has a similar interpretation Therefore if we subtract the means of the variablesin practice these would typically be the sample meansbefore creating the interac tion term the coefficients on the original variables have a useful interpretation Plus we immediately obtain standard errors for the partial effects at the mean values Nothing prevents us from replacing m1 or m2 with other values of the explanatory variables that may be of interest The following example illustrates how we can use interaction terms ExamplE 63 Effects of attendance on Final Exam performance A model to explain the standardized outcome on a final exam stndfnl in terms of percentage of classes attended prior college grade point average and ACT score is stndfnl 5 b0 1 b1atndrte 1 b2priGPA 1 b3ACT 1 b4priGPA2 1 b5ACT2 1 b6priGPAatndrte 1 u 618 We use the standardized exam score for the reasons discussed in Section 61 it is easier to inter pret a students performance relative to the rest of the class In addition to quadratics in priGPA and ACT this model includes an interaction between priGPA and the attendance rate The idea is that class attendance might have a different effect for students who have performed differently in the past as measured by priGPA We are interested in the effects of attendance on final exam score DstndfnlDatndrte 5 b1 1 b6priGPA Using the 680 observations in ATTEND for students in a course on microeconomic principles the estimated equation is stndfnl 5 2 05 2 0067 atndrte 2 1 63 priGPA 2 128 ACT 11362 101022 1482 10982 1 296 priGPA2 1 0045 ACT2 1 0056 priGPA atndrte 619 11012 100222 100432 n 5 680 R2 5 229 R2 5 222 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 179 We must interpret this equation with extreme care If we simply look at the coefficient on atndrte we will incorrectly conclude that attendance has a negative effect on final exam score But this coefficient supposedly measures the effect when priGPA 5 0 which is not interesting in this sample the small est prior GPA is about 86 We must also take care not to look separately at the estimates of b1 and b6 and conclude that because each t statistic is insignificant we cannot reject H0 b1 5 0 b6 5 0 In fact the pvalue for the F test of this joint hypothesis is 014 so we certainly reject H0 at the 5 level This is a good example of where looking at separate t statistics when testing a joint hypothesis can lead one far astray How should we estimate the partial effect of atndrte on stndfnl We must plug in interesting values of priGPA to obtain the partial effect The mean value of priGPA in the sample is 259 so at the mean priGPA the effect of atndrte on stndfnl is 20067 1 005612592 0078 What does this mean Because atndrte is measured as a percentage it means that a 10 percentage point increase in atndrte increases stndfnl by 078 standard deviations from the mean final exam score How can we tell whether the estimate 0078 is statistically different from zero We need to rerun the regression where we replace priGPAatndrte with priGPA 259atndrte This gives as the new coefficient on atndrte the estimated effect at priGPA 5 259 along with its standard error noth ing else in the regression changes We described this device in Section 44 Running this new regres sion gives the standard error of b 1 1 b 612592 5 0078 as 0026 which yields t 5 00780026 5 3 Therefore at the average priGPA we conclude that attendance has a statistically significant positive effect on final exam score Things are even more complicated for finding the effect of priGPA on stndfnl because of the quadratic term priGPA2 To find the effect at the mean value of priGPA and the mean attend ance rate 82 we would replace priGPA2 with 1priGPA 2 2592 2 and priGPAatndrte with priGPAatndrte 82 The coefficient on priGPA becomes the partial effect at the mean values and we would have its standard error See Computer Exercise C7 62d Computing Average Partial Effects The hallmark of models with quadratics interactions and other nonlinear functional forms is that the partial effects depend on the values of one or more explanatory variables For example we just saw in Example 63 that the effect of atndrte depends on the value of priGPA It is easy to see that the partial effect of priGPA in equation 618 is b2 1 2b4priGPA 1 b6atndrte something that can be verified with simple calculus or just by combining the quadratic and interac tion formulas The embellishments in equation 618 can be useful for seeing how the strength of associations between stndfnl and each explanatory variable changes with the values of all explanatory variables The flexibility afforded by a model such as 618 does have a cost it is tricky to describe the partial effects of the explanatory variables on stndfnl with a single number Often one wants a single value to describe the relationship between the dependent variable y and each explanatory variable One popular summary measure is the average partial effect APE also called the average marginal effect The idea behind the APE is simple for models such as 618 After computing the partial effect and plugging in the estimated parameters we average the partial effects for each unit across the sample So the estimated partial effect of atndrte on stndfnl is b 1 1 b 6 priGPAi If we add the term b7 ACTatndrte to equation 618 what is the partial effect of atndrte on stndfnl Exploring FurthEr 63 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 180 We do not want to report this partial effect for each of the 680 students in our sample Instead we average these partial effects to obtain APEstndfnl 5 b 1 1 b 6 priGPA where priGPA is the sample average of priGPA The single number APEstndfnl is the estimated APE The APE of priGPA is only a little more complicated APEpriGPA 5 b 2 1 2b 4priGPA 1 b6atndrte Both APEstndfnl and APEpriGPA tell us the size of the partial effects on average The centering of explanatory variables about their sample averages before creating quadratics or interactions forces the coefficient on the levels to be the APEs This can be cumbersome in compli cated models Fortunately some commonly used regression packages compute APEs with a simple command after OLS estimation Just as importantly proper standard errors are computed using the fact that an APE is a linear combination of the OLS coefficients For example the APEs and their standard errors for models with both quadratics and interactions as in Example 63 are easy to obtain APEs are also useful in models that are inherently nonlinear in parameters which we treat in Chapter 17 At that point we will revisit the definition and calculation of APEs 63 More on GoodnessofFit and Selection of Regressors Until now we have not focused much on the size of R2 in evaluating our regression models primarily because beginning students tend to put too much weight on Rsquared As we will see shortly choos ing a set of explanatory variables based on the size of the Rsquared can lead to nonsensical models In Chapter 10 we will discover that Rsquareds obtained from time series regressions can be artifi cially high and can result in misleading conclusions Nothing about the classical linear model assumptions requires that R2 be above any particular value R2 is simply an estimate of how much variation in y is explained by x1 x2 p xk in the popula tion We have seen several regressions that have had pretty small Rsquareds Although this means that we have not accounted for several factors that affect y this does not mean that the factors in u are correlated with the independent variables The zero conditional mean assumption MLR4 is what determines whether we get unbiased estimators of the ceteris paribus effects of the independent vari ables and the size of the Rsquared has no direct bearing on this A small Rsquared does imply that the error variance is large relative to the variance of y which means we may have a hard time precisely estimating the bj But remember we saw in Section 34 that a large error variance can be offset by a large sample size if we have enough data we may be able to precisely estimate the partial effects even though we have not controlled for many unobserved fac tors Whether or not we can get precise enough estimates depends on the application For example suppose that some incoming students at a large university are randomly given grants to buy computer equipment If the amount of the grant is truly randomly determined we can estimate the ceteris pari bus effect of the grant amount on subsequent college grade point average by using simple regression analysis Because of random assignment all of the other factors that affect GPA would be uncor related with the amount of the grant It seems likely that the grant amount would explain little of the variation in GPA so the Rsquared from such a regression would probably be very small But if we have a large sample size we still might get a reasonably precise estimate of the effect of the grant Another good illustration of where poor explanatory power has nothing to do with unbiased esti mation of the bj is given by analyzing the data set APPLE Unlike the other data sets we have used the key explanatory variables in APPLE were set experimentallythat is without regard to other factors that might affect the dependent variable The variable we would like to explain ecolbs is the hypothetical pounds of ecologically friendly ecolabeled apples a family would demand Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 181 Each family actually family head was presented with a description of ecolabeled apples along with prices of regular apples regprc and prices of the hypothetical ecolabeled apples ecoprc Because the price pairs were randomly assigned to each family they are unrelated to other observed factors such as family income and unobserved factors such as desire for a clean environment Therefore the regression of ecolbs on ecoprc regprc across all samples generated in this way produces unbi ased estimators of the price effects Nevertheless the Rsquared from the regression is only 0364 the price variables explain only about 36 of the total variation in ecolbs So here is a case where we explain very little of the variation in y yet we are in the rare situation of knowing that the data have been generated so that unbiased estimation of the bj is possible Incidentally adding observed family characteristics has a very small effect on explanatory power See Computer Exercise C11 Remember though that the relative change in the Rsquared when variables are added to an equation is very useful the F statistic in 441 for testing the joint significance crucially depends on the difference in Rsquareds between the unrestricted and restricted models As we will see in Section 64 an important consequence of a low Rsquared is that prediction is difficult Because most of the variation in y is explained by unobserved factors or at least factors we do not include in our model we will generally have a hard time using the OLS equation to predict individual future outcomes on y given a set of values for the explanatory variables In fact the low Rsquared means that we would have a hard time predicting y even if we knew the bj the population coefficients Fundamentally most of the factors that explain y are unaccounted for in the explanatory variables making prediction difficult 63a Adjusted RSquared Most regression packages will report along with the Rsquared a statistic called the adjusted Rsquared Because the adjusted Rsquared is reported in much applied work and because it has some useful features we cover it in this subsection To see how the usual Rsquared might be adjusted it is usefully written as R2 5 1 2 1SSRn21SSTn2 620 where SSR is the sum of squared residuals and SST is the total sum of squares compared with equa tion 328 all we have done is divide both SSR and SST by n This expression reveals what R2 is actually estimating Define s2 y as the population variance of y and let s2 u denote the population variance of the error term u Until now we have used s2 to denote s2 u but it is helpful to be more specific here The population Rsquared is defined as r2 5 1 2 s2 us2 y this is the proportion of the variation in y in the population explained by the independent variables This is what R2 is supposed to be estimating R2 estimates s2 u by SSRn which we know to be biased So why not replace SSRn with SSRn k 1 Also we can use SSTn 1 in place of SSTn as the former is the unbiased estimator of s2 y Using these estimators we arrive at the adjusted Rsquared R2 5 1 2 3SSR1n 2 k 2 12 43SST1n 2 12 4 621 5 1 2 s 23SST1n 2 12 4 because s 2 5 SSR1n 2 k 2 12 Because of the notation used to denote the adjusted Rsquared it is sometimes called Rbar squared The adjusted Rsquared is sometimes called the corrected Rsquared but this is not a good name because it implies that R2 is somehow better than R2 as an estimator of the population Rsquared Unfortunately R2 is not generally known to be a better estimator It is tempting to think that R2 cor rects the bias in R2 for estimating the population Rsquared r2 but it does not the ratio of two unbi ased estimators is not an unbiased estimator Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 182 The primary attractiveness of R2 is that it imposes a penalty for adding additional independent variables to a model We know that R2 can never fall when a new independent variable is added to a regression equation this is because SSR never goes up and usually falls as more independent vari ables are added assuming we use the same set of observations But the formula for R2 shows that it depends explicitly on k the number of independent variables If an independent variable is added to a regression SSR falls but so does the df in the regression n k 1 SSRn k 1 can go up or down when a new independent variable is added to a regression An interesting algebraic fact is the following if we add a new independent variable to a regres sion equation R2 increases if and only if the t statistic on the new variable is greater than one in absolute value An extension of this is that R2 increases when a group of variables is added to a regression if and only if the F statistic for joint significance of the new variables is greater than unity Thus we see immediately that using R2 to decide whether a certain independent variable or set of variables belongs in a model gives us a different answer than standard t or F testing because a t or F statistic of unity is not statistically significant at traditional significance levels It is sometimes useful to have a formula for R2 in terms of R2 Simple algebra gives R2 5 1 2 11 2 R22 1n 2 121n 2 k 2 12 622 For example if R2 5 30 n 5 51 and k 5 10 then R2 5 1 2 70150240 5 125 Thus for small n and large k R2 can be substantially below R2 In fact if the usual Rsquared is small and n k 1 is small R2 can actually be negative For example you can plug in R2 5 10 n 5 51 and k 5 10 to verify that R2 5 2125 A negative R2 indicates a very poor model fit relative to the number of degrees of freedom The adjusted Rsquared is sometimes reported along with the usual Rsquared in regressions and sometimes R2 is reported in place of R2 It is important to remember that it is R2 not R2 that appears in the F statistic in 441 The same formula with R2 r and R2 ur is not valid 63b Using Adjusted RSquared to Choose between Nonnested Models In Section 45 we learned how to compute an F statistic for testing the joint significance of a group of variables this allows us to decide at a particular significance level whether at least one variable in the group affects the dependent variable This test does not allow us to decide which of the variables has an effect In some cases we want to choose a model without redundant independent variables and the adjusted Rsquared can help with this In the major league baseball salary example in Section 45 we saw that neither hrunsyr nor rbisyr was individually significant These two variables are highly correlated so we might want to choose between the models log1salary2 5 b0 1 b1years 1 b2gamesyr 1 b3bavg 1 b4hrunsyr 1 u and log1salary2 5 b0 1 b1years 1 b2gamesyr 1 b3bavg 1 b4rbisyr 1 u These two equations are nonnested models because neither equation is a special case of the other The F statistics we studied in Chapter 4 only allow us to test nested models one model the restricted model is a special case of the other model the unrestricted model See equations 432 and 428 for examples of restricted and unrestricted models One possibility is to create a composite model that contains all explanatory variables from the original models and then to test each model against the general model using the F test The problem with this process is that either both models might Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 183 be rejected or neither model might be rejected as happens with the major league baseball salary example in Section 45 Thus it does not always provide a way to distinguish between models with nonnested regressors In the baseball player salary regression using the data in MLB1 R2 for the regression containing hrunsyr is 6211 and R2 for the regression containing rbisyr is 6226 Thus based on the adjusted Rsquared there is a very slight preference for the model with rbisyr But the difference is practi cally very small and we might obtain a different answer by controlling for some of the variables in Computer Exercise C5 in Chapter 4 Because both nonnested models contain five parameters the usual Rsquared can be used to draw the same conclusion Comparing R2 to choose among different nonnested sets of independent variables can be valu able when these variables represent different functional forms Consider two models relating RD intensity to firm sales rdintens 5 b0 1 b1log1sales2 1 u 623 rdintens 5 b0 1 b1sales 1 b2sales2 1 u 624 The first model captures a diminishing return by including sales in logarithmic form the second model does this by using a quadratic Thus the second model contains one more parameter than the first When equation 623 is estimated using the 32 observations on chemical firms in RDCHEM R2 is 061 and R2 for equation 624 is 148 Therefore it appears that the quadratic fits much better But a comparison of the usual Rsquareds is unfair to the first model because it contains one fewer param eter than 624 That is 623 is a more parsimonious model than 624 Everything else being equal simpler models are better Since the usual Rsquared does not penalize more complicated models it is better to use R2 The R2 for 623 is 030 while R2 for 624 is 090 Thus even after adjusting for the difference in degrees of freedom the quadratic model wins out The quadratic model is also preferred when profit margin is added to each regression There is an important limitation in using R2 to choose between nonnested models we cannot use it to choose between different functional forms for the dependent variable This is unfortunate because we often want to decide on whether y or logy or maybe some other transformation should be used as the dependent variable based on goodnessoffit But neither R2 nor R2 can be used for this purpose The reason is simple these Rsquareds measure the explained proportion of the total variation in whatever dependent variable we are using in the regression and different nonlinear func tions of the dependent variable will have different amounts of variation to explain For example the total variations in y and logy are not the same and are often very different Comparing the adjusted Rsquareds from regressions with these different forms of the dependent variables does not tell us anything about which model fits better they are fit ting two separate dependent variables ExamplE 64 CEO Compensation and Firm performance Consider two estimated models relating CEO compensation to firm performance salary 5 830 63 1 0163 sales 1 19 63 roe 1223902 100892 111082 625 n 5 209 R2 5 029 R2 5 020 Explain why choosing a model by maximiz ing R2 or minimizing s the standard error of the regression is the same thing Exploring FurthEr 64 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 184 and lsalary 5 4 36 1 275 lsales 1 0179 roe 10292 10332 100402 626 n 5 209 R2 5 282 R2 5 275 where roe is the return on equity discussed in Chapter 2 For simplicity lsalary and lsales denote the natural logs of salary and sales We already know how to interpret these different estimated equa tions But can we say that one model fits better than the other The Rsquared for equation 625 shows that sales and roe explain only about 29 of the varia tion in CEO salary in the sample Both sales and roe have marginal statistical significance Equation 626 shows that logsales and roe explain about 282 of the variation in logsalary In terms of goodnessoffit this much higher Rsquared would seem to imply that model 626 is much better but this is not necessarily the case The total sum of squares for salary in the sample is 391732982 while the total sum of squares for logsalary is only 6672 Thus there is much less variation in logsalary that needs to be explained At this point we can use features other than R2 or R2 to decide between these models For exam ple logsales and roe are much more statistically significant in 626 than are sales and roe in 625 and the coefficients in 626 are probably of more interest To be sure however we will need to make a valid goodnessoffit comparison In Section 64 we will offer a goodnessoffit measure that does allow us to compare models where y appears in both level and log form 63c Controlling for Too Many Factors in Regression Analysis In many of the examples we have covered and certainly in our discussion of omitted variables bias in Chapter 3 we have worried about omitting important factors from a model that might be correlated with the independent variables It is also possible to control for too many variables in a regression analysis If we overemphasize goodnessoffit we open ourselves to controlling for factors in a regression model that should not be controlled for To avoid this mistake we need to remember the ceteris pari bus interpretation of multiple regression models To illustrate this issue suppose we are doing a study to assess the impact of state beer taxes on traffic fatalities The idea is that a higher tax on beer will reduce alcohol consumption and likewise drunk driving resulting in fewer traffic fatalities To measure the ceteris paribus effect of taxes on fatalities we can model fatalities as a function of several factors including the beer tax fatalities 5 b0 1 b1tax 1 b2miles 1 b3percmale 1 b4perc1621 1 p where miles 5 total miles driven percmale 5 percentage of the state population that is male perc1621 5 percentage of the population between ages 16 and 21 and so on Notice how we have not included a variable measuring per capita beer consumption Are we committing an omitted variables error The answer is no If we control for beer consumption in this equation then how would beer taxes affect traffic fatalities In the equation fatalities 5 b0 1 b1tax 1 b2beercons 1 p Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 185 b1 measures the difference in fatalities due to a one percentage point increase in tax holding beercons fixed It is difficult to understand why this would be interesting We should not be controlling for dif ferences in beercons across states unless we want to test for some sort of indirect effect of beer taxes Other factors such as gender and age distribution should be controlled for As a second example suppose that for a developing country we want to estimate the effects of pesticide usage among farmers on family health expenditures In addition to pesticide usage amounts should we include the number of doctor visits as an explanatory variable No Health expenditures include doctor visits and we would like to pick up all effects of pesticide use on health expenditures If we include the number of doctor visits as an explanatory variable then we are only measuring the effects of pesticide use on health expenditures other than doctor visits It makes more sense to use number of doctor visits as a dependent variable in a separate regression on pesticide amounts The previous examples are what can be called over controlling for factors in multiple regression Often this results from nervousness about potential biases that might arise by leaving out an important explanatory variable But it is important to remember the ceteris paribus nature of multiple regression In some cases it makes no sense to hold some factors fixed precisely because they should be allowed to change when a policy variable changes Unfortunately the issue of whether or not to control for certain factors is not always clearcut For example Betts 1995 studies the effect of high school quality on subsequent earnings He points out that if better school quality results in more education then controlling for education in the regres sion along with measures of quality will underestimate the return to quality Betts does the analysis with and without years of education in the equation to get a range of estimated effects for quality of schooling To see explicitly how pursuing high Rsquareds can lead to trouble consider the housing price example from Section 45 that illustrates the testing of multiple hypotheses In that case we wanted to test the rationality of housing price assessments We regressed logprice on logassess loglotsize logsqrft and bdrms and tested whether the latter three variables had zero population coefficients while logassess had a coefficient of unity But what if we change the purpose of the analysis and estimate a hedonic price model which allows us to obtain the marginal values of various housing attributes Should we include logassess in the equation The adjusted Rsquared from the regres sion with logassess is 762 while the adjusted Rsquared without it is 630 Based on goodness offit only we should include logassess But this is incorrect if our goal is to determine the effects of lot size square footage and number of bedrooms on housing values Including logassess in the equation amounts to holding one measure of value fixed and then asking how much an additional bedroom would change another measure of value This makes no sense for valuing housing attributes If we remember that different models serve different purposes and we focus on the ceteris pari bus interpretation of regression then we will not include the wrong factors in a regression model 63d Adding Regressors to Reduce the Error Variance We have just seen some examples of where certain independent variables should not be included in a regression model even though they are correlated with the dependent variable From Chapter 3 we know that adding a new independent variable to a regression can exacerbate the multicollinearity problem On the other hand since we are taking something out of the error term adding a variable generally reduces the error variance Generally we cannot know which effect will dominate However there is one case that is clear we should always include independent variables that affect y and are uncorrelated with all of the independent variables of interest Why Because adding such a variable does not induce multicollinearity in the population and therefore multicollinearity in the sample should be negligible but it will reduce the error variance In large sample sizes the stand ard errors of all OLS estimators will be reduced As an example consider estimating the individual demand for beer as a function of the average county beer price It may be reasonable to assume that individual characteristics are uncorrelated with Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 186 countylevel prices and so a simple regression of beer consumption on county price would suffice for estimating the effect of price on individual demand But it is possible to get a more precise estimate of the price elasticity of beer demand by including individual characteristics such as age and amount of education If these factors affect demand and are uncorrelated with price then the standard error of the price coefficient will be smaller at least in large samples As a second example consider the grants for computer equipment given at the beginning of Section 63 If in addition to the grant variable we control for other factors that can explain college GPA we can probably get a more precise estimate of the effect of the grant Measures of high school grade point average and rank SAT and ACT scores and family background variables are good can didates Because the grant amounts are randomly assigned all additional control variables are uncor related with the grant amount in the sample multicollinearity between the grant amount and other independent variables should be minimal But adding the extra controls might significantly reduce the error variance leading to a more precise estimate of the grant effect Remember the issue is not unbiasedness here we obtain an unbiased and consistent estimator whether or not we add the high school performance and family background variables The issue is getting an estimator with a smaller sampling variance A related point is that when we have random assignment of a policy we need not worry about whether some of our explanatory variables are endogenousprovided these variables themselves are not affected by the policy For example in studying the effect of hours in a job training program on labor earnings we can include the amount of education reported prior to the job training program We need not worry that schooling might be correlated with omitted factors such as ability because we are not trying to estimate the return to schooling We are trying to estimate the effect of the job training program and we can include any controls that are not themselves affected by job training without biasing the job training effect What we must avoid is including a variable such as the amount of education after the job training program as some people may decide to get more education because of how many hours they were assigned to the job training program Unfortunately cases where we have information on additional explanatory variables that are uncorrelated with the explanatory variables of interest are somewhat rare in the social sciences But it is worth remembering that when these variables are available they can be included in a model to reduce the error variance without inducing multicollinearity 64 Prediction and Residual Analysis In Chapter 3 we defined the OLS predicted or fitted values and the OLS residuals Predictions are certainly useful but they are subject to sampling variation because they are obtained using the OLS estimators Thus in this section we show how to obtain confidence intervals for a prediction from the OLS regression line From Chapters 3 and 4 we know that the residuals are used to obtain the sum of squared residu als and the Rsquared so they are important for goodnessoffit and testing Sometimes economists study the residuals for particular observations to learn about individuals or firms houses etc in the sample 64a Confidence Intervals for Predictions Suppose we have estimated the equation y 5 b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk 627 When we plug in particular values of the independent variables we obtain a prediction for y which is an estimate of the expected value of y given the particular values for the explanatory variables For emphasis let c1 c2 p ck denote particular values for each of the k independent variables these Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 187 may or may not correspond to an actual data point in our sample The parameter we would like to estimate is u0 5 b0 1 b1c1 1 b2c2 1 p 1 bkck 628 5 E1y0x1 5 c1 x2 5 c2 p xk 5 ck2 The estimator of u0 is u 0 5 b 0 1 b 1c1 1 b 2c2 1 p 1 bkck 629 In practice this is easy to compute But what if we want some measure of the uncertainty in this pre dicted value It is natural to construct a confidence interval for u0 which is centered at u 0 To obtain a confidence interval for u0 we need a standard error for u 0 Then with a large df we can construct a 95 confidence interval using the rule of thumb u 0 6 2 se1u 02 As always we can use the exact percentiles in a t distribution How do we obtain the standard error of u 0 This is the same problem we encountered in Section 44 we need to obtain a standard error for a linear combination of the OLS estimators Here the problem is even more complicated because all of the OLS estimators generally appear in u 0 unless some cj are zero Nevertheless the same trick that we used in Section 44 will work here Write b0 5 u0 2 b1c1 2 p 2 bkck and plug this into the equation y 5 b0 1 b1x1 1 p 1 bkxk 1 u to obtain y 5 u0 1 b11x1 2 c12 1 b21x2 2 c22 1 p 1 bk1xk 2 ck2 1 u 630 In other words we subtract the value cj from each observation on xj and then we run the regression of yi on 1xi1 2 c12 p 1xik 2 ck2 i 5 1 2 p n 631 The predicted value in 629 and more importantly its standard error are obtained from the intercept or constant in regression 631 As an example we obtain a confidence interval for a prediction from a college GPA regression where we use high school information ExamplE 65 Confidence Interval for predicted College Gpa Using the data in GPA2 we obtain the following equation for predicting college GPA colgpa 5 1493 1 00149 sat 2 01386 hsperc 100752 1000072 1000562 2 06088 hsize 1 00546 hsize2 632 1016502 1002272 n 5 4137 R2 5 278 R2 5 277 s 5 560 where we have reported estimates to several digits to reduce roundoff error What is predicted col lege GPA when sat 5 1200 hsperc 5 30 and hsize 5 5 which means 500 This is easy to get by plugging these values into equation 632 colgpa 5 270 rounded to two digits Unfortunately we cannot use equation 632 directly to get a confidence interval for the expected colgpa at the given Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 188 values of the independent variables One simple way to obtain a confidence interval is to define a new set of independent variables sat0 5 sat 2 1200 hsperc0 5 hsperc 2 30 hsize0 5 hsize 2 5 and hsizesq0 5 hsize2 2 25 When we regress colgpa on these new independent variables we get colgpa 5 2700 1 00149 sat0 2 01386 hsperc0 100202 1000072 1000562 2 06088 hsize0 1 00546 hsizesq0 1016502 1002272 n 5 4137 R2 5 278 R2 5 277 s 5 560 The only difference between this regression and that in 632 is the intercept which is the predic tion we want along with its standard error 020 It is not an accident that the slope coefficients their standard errors Rsquared and so on are the same as before this provides a way to check that the proper transformations were done We can easily construct a 95 confidence interval for the expected college GPA 270 6 196020 or about 266 to 274 This confidence interval is rather narrow due to the very large sample size Because the variance of the intercept estimator is smallest when each explanatory variable has zero sample mean see Question 25 for the simple regression case it follows from the regression in 631 that the variance of the prediction is smallest at the mean values of the xj That is cj 5 xj for all j This result is not too surprising since we have the most faith in our regression line near the middle of the data As the values of the cj get farther away from the xj Var1y 2 gets larger and larger The previous method allows us to put a confidence interval around the OLS estimate of E1y0x1 p xk2 for any values of the explanatory variables In other words we obtain a confidence interval for the average value of y for the subpopulation with a given set of covariates But a confi dence interval for the average person in the subpopulation is not the same as a confidence interval for a particular unit individual family firm and so on from the population In forming a confidence interval for an unknown outcome on y we must account for another very important source of varia tion the variance in the unobserved error which measures our ignorance of the unobserved factors that affect y Let y0 denote the value for which we would like to construct a confidence interval which we sometimes call a prediction interval For example y0 could represent a person or firm not in our original sample Let x0 1 p x0 k be the new values of the independent variables which we assume we observe and let u0 be the unobserved error Therefore we have y0 5 b0 1 b1x0 1 1 b2x0 2 1 p 1 bkx0 k 1 u0 633 As before our best prediction of y0 is the expected value of y0 given the explanatory variables which we estimate from the OLS regression line y0 5 b 0 1 b 1x0 1 1 b 2x0 2 1 p 1 b kx0 k The prediction error in using y0 to predict y0 is e0 5 y0 2 y0 5 1b0 1 b1x0 1 1 p 1 bkx0 k2 1 u0 2 y0 634 Now E1y02 5 E1b 02 1 E1b 12x0 1 1 E1b 22x0 2 1 p 1 E1b k2x0 k 5 b0 1 b1x0 1 1 p 1 bkx0 k because the b j are unbiased As before these expectations are all conditional on the sample values of the independent variables Because u0 has zero mean E1e02 5 0 We have shown that the expected prediction error is zero In finding the variance of e0 note that u0 is uncorrelated with each b j because u0 is uncorrelated with the errors in the sample used to obtain the b j By basic properties of covariance see Appendix B Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 189 u0 and y0 are uncorrelated Therefore the variance of the prediction error conditional on all in sample values of the independent variables is the sum of the variances Var1e02 5 Var1y02 1 Var1u02 5 Var1y02 1 s2 635 where s2 5 Var1u02 is the error variance There are two sources of variation in e0 The first is the sampling error in y0 which arises because we have estimated the bj Because each b j has a vari ance proportional to 1n where n is the sample size Var1y02 is proportional to 1n This means that for large samples Var1y02 can be very small By contrast s2 is the variance of the error in the population it does not change with the sample size In many examples s2 will be the domi nant term in 635 Under the classical linear model assumptions the b j and u0 are normally distributed and so e0 is also normally distributed conditional on all sample values of the explanatory variables Earlier we described how to obtain an unbiased estimator of Var1y02 and we obtained our unbiased estimator of s2 in Chapter 3 By using these estimators we can define the standard error of e0 as se1e02 5 53se1y02 42 1 s 261 2 636 Using the same reasoning for the t statistics of the b j e0se1e02 has a t distribution with n k 1 1 degrees of freedom Therefore P32t025 e0se1e02 t0254 5 95 where t025 is the 975th percentile in the tn2k21 distribution For large n k 1 remember that t025 196 Plugging in e0 5 y0 2 y0 and rearranging gives a 95 prediction interval for y0 y0 6 t025 se1e02 637 as usual except for small df a good rule of thumb is y0 6 2se1e02 This is wider than the confidence interval for y0 itself because of s 2 in 636 it often is much wider to reflect the factors in u0 that we have not accounted for ExamplE 66 Confidence Interval for Future College Gpa Suppose we want a 95 CI for the future college GPA of a high school student with sat 5 1200 hsperc 5 30 and hsize 5 5 In Example 65 we obtained a 95 confidence interval for the average college GPA among all students with the particular characteristics sat 5 1200 hsperc 5 30 and hsize 5 5 Now we want a 95 confidence interval for any particular student with these charac teristics The 95 prediction interval must account for the variation in the individual unobserved characteristics that affect college performance We have everything we need to obtain a CI for colgpa se1y02 5 020 and s 5 560 and so from 636 se1e02 5 3 10202 2 1 15602 241 2 560 Notice how small se1y02 is relative to s virtually all of the variation in e0 comes from the variation in u0 The 95 CI is 270 6 196560 or about 160 to 380 This is a wide confidence interval and shows that based on the factors we included in the regression we cannot accurately pin down an individuals future college grade point average In one sense this is good news as it means that high school rank and performance on the SAT do not preordain ones performance in college Evidently the unob served characteristics that affect college GPA vary widely among individuals with the same observed SAT score and high school rank Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 190 64b Residual Analysis Sometimes it is useful to examine individual observations to see whether the actual value of the dependent variable is above or below the predicted value that is to examine the residuals for the individual observations This process is called residual analysis Economists have been known to examine the residuals from a regression in order to aid in the purchase of a home The following housing price example illustrates residual analysis Housing price is related to various observable characteristics of the house We can list all of the characteristics that we find important such as size number of bedrooms number of bathrooms and so on We can use a sample of houses to estimate a relationship between price and attributes where we end up with a predicted value and an actual value for each house Then we can construct the residuals u i 5 yi 2 yi The house with the most negative residual is at least based on the factors we have controlled for the most underpriced one relative to its observed characteristics Of course a selling price substantially below its predicted price could indi cate some undesirable feature of the house that we have failed to account for and which is therefore contained in the unobserved error In addition to obtaining the prediction and residual it also makes sense to compute a confidence interval for what the future selling price of the home could be using the method described in equation 637 Using the data in HPRICE1 we run a regression of price on lotsize sqrft and bdrms In the sample of 88 homes the most negative residual is 2 120206 for the 81st house Therefore the asking price for this house is 120206 below its predicted price There are many other uses of residual analysis One way to rank law schools is to regress median starting salary on a variety of student characteristics such as median LSAT scores of entering class median college GPA of entering class and so on and to obtain a predicted value and residual for each law school The law school with the largest residual has the highest predicted value added Of course there is still much uncertainty about how an individuals starting salary would compare with the median for a law school overall These residuals can be used along with the costs of attending each law school to determine the best value this would require an appropriate discounting of future earnings Residual analysis also plays a role in legal decisions A New York Times article entitled Judge Says Pupils Poverty Not Segregation Hurts Scores 62895 describes an important legal case The issue was whether the poor performance on standardized tests in the Hartford School District relative to performance in surrounding suburbs was due to poor school quality at the highly segre gated schools The judge concluded that the disparity in test scores does not indicate that Hartford is doing an inadequate or poor job in educating its students or that its schools are failing because the predicted scores based upon the relevant socioeconomic factors are about at the levels that one would expect This conclusion is based on a regression analysis of average or median scores on socioeco nomic characteristics of various school districts in Connecticut The judges conclusion suggests that given the poverty levels of students at Hartford schools the actual test scores were similar to those predicted from a regression analysis the residual for Hartford was not sufficiently negative to con clude that the schools themselves were the cause of low test scores 64c Predicting y When logy Is the Dependent Variable Because the natural log transformation is used so often for the dependent variable in empirical eco nomics we devote this subsection to the issue of predicting y when logy is the dependent variable As a byproduct we will obtain a goodnessoffit measure for the log model that can be compared with the Rsquared from the level model How would you use residual analysis to determine which professional athletes are overpaid or underpaid relative to their performance Exploring FurthEr 65 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 191 To obtain a prediction it is useful to define logy 5 log1y2 this emphasizes that it is the log of y that is predicted in the model logy 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u 638 In this equation the xj might be transformations of other variables for example we could have x1 5 log1sales2 x2 5 log1mktval2 x3 5 ceoten in the CEO salary example Given the OLS estimators we know how to predict logy for any value of the independent variables logy 5 b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk 639 Now since the exponential undoes the log our first guess for predicting y is to simply exponenti ate the predicted value for log1y2 y 5 exp1logy2 This does not work in fact it will systematically underestimate the expected value of y In fact if model 638 follows the CLM assumptions MLR1 through MLR6 it can be shown that E1y0x2 5 exp1s222 exp1b0 1 b1x1 1 b2x2 1 p 1 bkxk2 where x denotes the independent variables and s2 is the variance of u 3If u Normal10s22 then the expected value of expu is exp1s2224 This equation shows that a simple adjustment is needed to predict y y 5 exp1s 222exp1logy2 640 where s 2 is simply the unbiased estimator of s2 Because s the standard error of the regression is always reported obtaining predicted values for y is easy Because s 2 0 exp1s 222 1 For large s 2 this adjustment factor can be substantially larger than unity The prediction in 640 is not unbiased but it is consistent There are no unbiased predictions of y and in many cases 640 works well However it does rely on the normality of the error term u In Chapter 5 we showed that OLS has desirable properties even when u is not normally distributed Therefore it is useful to have a prediction that does not rely on normality If we just assume that u is independent of the explanatory variables then we have E1y0x2 5 a0exp1b0 1 b1x1 1 b2x2 1 p 1 bkxk2 641 where a0 is the expected value of expu which must be greater than unity Given an estimate a 0 we can predict y as y 5 a 0exp1logy2 642 which again simply requires exponentiating the predicted value from the log model and multiplying the result by a 0 Two approaches suggest themselves for estimating a0 without the normality assumption The first is based on a0 5 E3exp1u2 4 To estimate a0 we replace the population expectation with a sample average and then we replace the unobserved errors ui with the OLS residuals u i 5 log1yi2 2 b 0 2 b 1xi1 2 p 2 b kxik This leads to the method of moments estimator see Appendix C a 0 5 n21 a n i51 exp1u i2 643 Not surprisingly a 0 is a consistent estimator of a0 but it is not unbiased because we have replaced ui with u i inside a nonlinear function This version of a 0 is a special case of what Duan 1983 called a smearing estimate Because the OLS residuals have a zero sample average it can be shown that Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 192 for any data set a 0 1 Technically a 0 would equal one if all the OLS residuals were zero but this never happens in any interesting application That a 0 is necessarily greater than one is convenient because it must be that a0 1 A different estimate of a0 is based on a simple regression through the origin To see how it works define mi 5 exp1b0 1 b1xi1 1 p 1 bkxik2 so that from equation 641 E1yi0mi2 5 a0mi If we could observe the mi we could obtain an unbiased estimator of a0 from the regression yi on mi without an intercept Instead we replace the bj with their OLS estimates and obtain m i 5 exp1logyi2 where of course the logyi are the fitted values from the regression logyi on xi1 p xik with an intercept Then aˇ 0 to distinguish it from a 0 in equation 643 is the OLS slope estimate from the simple regression yi on m i no intercept aˇ 0 5 a a n i51 m 2 i b 21 a a n i51 m iyib 644 We will call aˇ 0 the regression estimate of a0 Like a 0 aˇ 0 is consistent but not unbiased Interestingly aˇ 0 is not guaranteed to be greater than one although it will be in most applications If aˇ 0 is less than one and especially if it is much less than one it is likely that the assumption of independence between u and the xj is violated If aˇ 0 1 one possibility is to just use the estimate in 643 although this may simply be masking a problem with the linear model for logy We summarize the steps 64d Predicting y When the Dependent Variable Is logy 1 Obtain the fitted values logyi and residuals u i from the regression logy on x1 p xk 2 Obtain a 0 as in equation 643 or aˇ 0 in equation 644 3 For given values of x1 p xk obtain logy from 642 4 Obtain the prediction y from 642 with a 0 or aˇ 0 We now show how to predict CEO salaries using this procedure ExamplE 67 predicting CEO Salaries The model of interest is log1salary2 5 b0 1 b1log1sales2 1 b2log1mktval2 1 b3ceoten 1 u so that b1 and b2 are elasticities and 100 b3 is a semielasticity The estimated equation using CEOSAL2 is lsalary 5 4504 1 163 lsales 1 109 lmktval 1 0117 ceoten 12572 10392 10502 100532 645 n 5 177 R2 5 318 where for clarity we let lsalary denote the log of salary and similarly for lsales and lmktval Next we obtain m i 5 exp1lsalaryi2 for each observation in the sample The Duan smearing estimate from 643 is about a 0 5 1136 and the regression estimate from 644 is aˇ 0 5 1117 We can use either estimate to predict salary for any values of sales mktval and ceoten Let us find the prediction for sales 5 5000 which means 5 billion because sales is in mil lions mktval 5 10000 or 10 billion and ceoten 5 10 From 645 the prediction for lsalary is 4504 1 163 log150002 1 109 log1100002 1 01171102 7013 and exp170132 1110983 Using the estimate of a0 from 643 the predicted salary is about 1262077 or 1262077 Using the estimate from 644 gives an estimated salary of about 1240968 These differ from each other by much less than each differs from the naive prediction of 1110983 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 193 We can use the previous method of obtaining predictions to determine how well the model with logy as the dependent variable explains y We already have measures for models when y is the dependent variable the Rsquared and the adjusted Rsquared The goal is to find a goodnessoffit measure in the logy model that can be compared with an Rsquared from a model where y is the dependent variable There are different ways to define a goodnessoffit measure after retransforming a model for logy to predict y Here we present two approaches that are easy to implement The first gives the same goodnessoffit measures whether we estimate a0 as in 640 643 or 644 To motivate the measure recall that in the linear regression equation estimated by OLS y 5 b 0 1 b 1x1 1 p 1 b kxk 646 the usual Rsquared is simply the square of the correlation between yi and yi see Section 32 Now if instead we compute fitted values from 642that is yi 5 a 0mi for all observations ithen it makes sense to use the square of the correlation between yi and these fitted values as an Rsquared Because correlation is unaffected if we multiply by a constant it does not matter which estimate of a0 we use In fact this Rsquared measure for y not logy is just the squared correlation between yi and m i We can compare this directly with the Rsquared from equation 646 The squared correlation measure does not depend on how we estimate a0 A second approach is to compute an Rsquared for y based on a sum of squared residuals For concreteness suppose we use equation 643 to estimate a0 Then the residual for predicting yi is ri 5 yi 2 a 0 exp1logyi2 647 and we can use these residuals to compute a sum of squared residuals Using the formula for Rsquared from linear regression we are led to 1 2 g n i51r2 i g n i511yi 2 y2 648 as an alternative goodnessoffit measure that can be compared with the Rsquared from the linear model for y Notice that we can compute such a measure for the alternative estimates of a0 in equation 640 and 644 by inserting those estimates in place of a 0 in 647 Unlike the squared correlation between yi and m i the Rsquared in 648 will depend on how we estimate a0 The estimate that mini mizes g n i51r2 i is that in equation 644 but that does not mean we should prefer it and certainly not if aˇ 0 1 We are not really trying to choose among the different estimates of a0 rather we are finding goodnessoffit measures that can be compared with the linear model for y ExamplE 68 predicting CEO Salaries After we obtain the m i we just obtain the correlation between salaryi and m i it is 493 The square of it is about 243 and this is a measure of how well the log model explains the variation in salary not logsalary The R2 from 645 318 tells us that the log model explains about 318 of the varia tion in logsalary As a competing linear model suppose we estimate a model with all variables in levels salary 5 b0 1 b1sales 1 b2mktval 1 b3ceoten 1 u 649 The key is that the dependent variable is salary We could use logs of sales or mktval on the righthand side but it makes more sense to have all dollar values in levels if one salary appears as a level The Rsquared from estimating this equation using the same 177 observations is 201 Thus the log model explains more of the variation in salary and so we prefer it to 649 on goodnessoffit Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 194 PART 1 Regression Analysis with CrossSectional Data grounds The log model is also preferred because it seems more realistic and its parameters are easier to interpret If we maintain the full set of classical linear model assumptions in the model 638 we can eas ily obtain prediction intervals for y0 5 exp1b0 1 b1x0 1 1 p 1 bkx0 k 1 u02 when we have estimated a linear model for logy Recall that x0 1 x0 2 p x0 k are known values and u0 is the unobserved error that partly determines y0 From equation 637 a 95 prediction interval for logy0 5 log1y02 is simply logy0 6 t025 se1e02 where se1e02 is obtained from the regression of logy on x1 p xk using the origi nal n observations Let cl 5 logy0 2t025 se1e02 and cu 5 logy0 1 t025 se1e02 be the lower and upper bounds of the prediction interval for logy0 That is P1cl logy0 cu2 5 95 Because the exponential function is strictly increasing it is also true that P3exp1cl2 exp1logy02 exp1cu2 4 5 95 that is P3exp1cl2 y0 exp1cu2 4 5 95 Therefore we can take exp1cl2 and exp1cu2 as the lower and upper bounds respectively for a 95 prediction interval for y0 For large n t025 5 196 and so a 95 pre diction interval for y0 is exp32196 se1e02 4 exp1b 0 1 x0b 2 to exp32196 se1e02 4 exp1b 0 1 x0b 2 where x0b is shorthand for b 1x0 1 1 p 1 b kx0 k Remember the b j and se1e02 are obtained from the regression with logy as the dependent variable Because we assume normality of u in 638 we probably would use 640 to obtain a point prediction for y0 Unlike in equation 637 this point pre diction will not lie halfway between the lower and upper bounds exp1cl2 and exp1cu2 One can obtain different 95 prediction intervalues by choosing different quantiles in the tn2k21 distribution If qa1 and qa2 are quantiles with a2 2 a1 5 95 then we can choose cl 5 qa1se1e02 and cu 5 qa2se1e02 As an example consider the CEO salary regression where we make the prediction at the same values of sales mktval and ceoten as in Example 67 The standard error of the regression for 643 is about 505 and the standard error of logy0 is about 075 Therefore using equation 636 se1e02 511 as in the GPA example the error variance swamps the estimation error in the parameters even though here the sample size is only 177 A 95 prediction interval for salary0 is exp32196 15112 4 exp170132 to exp3196 15112 4 exp170132 or about 408071 to 3024678 that is 408071 to 3024678 This very wide 95 prediction interval for CEO salary at the given sales market value and tenure values shows that there is much else that we have not included in the regression that determines salary Incidentally the point prediction for salary using 640 is about 1262075higher than the predictions using the other estimates of a0 and closer to the lower bound than the upper bound of the 95 prediction interval Summary In this chapter we have covered some important multiple regression analysis topics Section 61 showed that a change in the units of measurement of an independent variable changes the OLS coefficient in the expected manner if xj is multiplied by c its coefficient is divided by c If the dependent variable is multiplied by c all OLS coefficients are multiplied by c Neither t nor F statistics are affected by changing the units of measurement of any variables We discussed beta coefficients which measure the effects of the independent variables on the depend ent variable in standard deviation units The beta coefficients are obtained from a standard OLS regression after the dependent and independent variables have been transformed into zscores We provided a detailed discussion of functional form including the logarithmic transformation quad ratics and interaction terms It is helpful to summarize some of our conclusions Considerations When Using Logarithms 1 The coefficients have percentage change interpretations We can be ignorant of the units of meas urement of any variable that appears in logarithmic form and changing units from say dollars to thousands of dollars has no effect on a variables coefficient when that variable appears in logarithmic form Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 195 2 Logs are often used for dollar amounts that are always positive as well as for variables such as popu lation especially when there is a lot of variation They are used less often for variables measured in years such as schooling age and experience Logs are used infrequently for variables that are already percents or proportions such as an unemployment rate or a pass rate on a test 3 Models with logy as the dependent variable often more closely satisfy the classical linear model assumptions For example the model has a better chance of being linear homoskedasticity is more likely to hold and normality is often more plausible 4 In many cases taking the log greatly reduces the variation of a variable making OLS estimates less prone to outlier influence However in cases where y is a fraction and close to zero for many observa tions log1yi2 can have much more variability than yi For values yi very close to zero log1yi2 is a nega tive number very large in magnitude 5 If y 0 but y 5 0 is possible we cannot use logy Sometimes log1 1 y is used but interpretation of the coefficients is difficult 6 For large changes in an explanatory variable we can compute a more accurate estimate of the percent age change effect 7 It is harder but possible to predict y when we have estimated a model for logy Considerations When Using QUadratiCs 1 A quadratic function in an explanatory variable allows for an increasing or decreasing effect 2 The turning point of a quadratic is easily calculated and it should be calculated to see if it makes sense 3 Quadratic functions where the coefficients have the opposite sign have a strictly positive turning point if the signs of the coefficients are the same the turning point is at a negative value of x 4 A seemingly small coefficient on the square of a variable can be practically important in what it implies about a changing slope One can use a t test to see if the quadratic is statistically significant and compute the slope at various values of x to see if it is practically important 5 For a model quadratic in a variable x the coefficient on x measures the partial effect starting from x 5 0 as can be seen in equation 611 If zero is not a possible or interesting value of x one can center x about a more interesting value such as the average in the sample before computing the square This is the same as computing the average partial effect Computing Exercise 612 provides an example Considerations When Using interaCtions 1 Interaction terms allow the partial effect of an explanatory variable say x1 to depend on the level of another variable say x2and vice versa 2 Interpreting models with interactions can be tricky The coefficient on x1 say b1 measures the partial effect of x1 on y when x2 5 0 which may be impossible or uninteresting Centering x1 and x2 around interesting values before constructing the interaction term typically leads to an equation that is visu ally more appealing When the variables are centered about their sample averages the coefficients on the levels become estimated average partial effects 3 A standard t test can be used to determine if an interaction term is statistically significant Computing the partial effects at different values of the explanatory variables can be used to determine the practical importance of interactions We introduced the adjusted Rsquared R2 as an alternative to the usual Rsquared for measuring goodnessoffit Whereas R2 can never fall when another variable is added to a regression R2 penalizes the number of regressors and can drop when an independent variable is added This makes R2 preferable for choosing between nonnested models with different numbers of explanatory variables Neither R2 nor R2 can be used to compare models with different dependent variables Nevertheless it is fairly easy to obtain goodnessoffit measures for choosing between y and logy as the dependent variable as shown in Section 64 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 196 PART 1 Regression Analysis with CrossSectional Data Problems 1 The following equation was estimated using the data in CEOSAL1 log1salary2 5 4322 1 276 log1sales2 1 0215 roe 2 00008 roe2 13242 10332 101292 1000262 n 5 209 R2 5 282 This equation allows roe to have a diminishing effect on logsalary Is this generality necessary Explain why or why not 2 Let b 0 b 1 p b k be the OLS estimates from the regression of yi on xi1 p xik i 5 1 2 p n For nonzero constants c1 p ck argue that the OLS intercept and slopes from the regression of c0yi on c1xi1 p ckxik i 5 1 2 p n are given by b 0 5 c0b 0 b 1 5 1c0c12b 1 p b k 5 1c0 ck2b k Hint Use the fact that the b j solve the first order conditions in 313 and the b j must solve the first order condi tions involving the rescaled dependent and independent variables 3 Using the data in RDCHEM the following equation was obtained by OLS rdintens 5 2613 1 00030 sales 2 0000000070 sales2 14292 1000142 100000000372 n 5 32 R2 5 1484 i At what point does the marginal effect of sales on rdintens become negative ii Would you keep the quadratic term in the model Explain In Section 63 we discussed the somewhat subtle problem of relying too much on R2 or R2 in arriving at a final model it is possible to control for too many factors in a regression model For this reason it is important to think ahead about model specification particularly the ceteris paribus nature of the multiple regression equation Explanatory variables that affect y and are uncorrelated with all the other explanatory variables can be used to reduce the error variance without inducing multicollinearity In Section 64 we demonstrated how to obtain a confidence interval for a prediction made from an OLS regression line We also showed how a confidence interval can be constructed for a future unknown value of y Occasionally we want to predict y when logy is used as the dependent variable in a regression model Section 64 explains this simple method Finally we are sometimes interested in knowing about the sign and magnitude of the residuals for particular observations Residual analysis can be used to determine whether particular members of the sample have predicted values that are well above or well below the actual outcomes Key Terms Adjusted RSquared Average Partial Effect APE Beta Coefficients Bootstrap Bootstrap Standard Error Interaction Effect Nonnested Models Over Controlling Population RSquared Prediction Error Prediction Interval Predictions Quadratic Functions Resampling Method Residual Analysis Smearing Estimate Standardized Coefficients Variance of the Prediction Error Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 197 iii Define salesbil as sales measured in billions of dollars salesbil 5 sales1000 Rewrite the estimated equation with salesbil and salesbil2 as the independent variables Be sure to report standard errors and the Rsquared Hint Note that salesbil2 5 sales2110002 2 iv For the purpose of reporting the results which equation do you prefer 4 The following model allows the return to education to depend upon the total amount of both parents education called pareduc log1wage2 5 b0 1 b1educ 1 b2educpareduc 1 b3exper 1 b4tenure 1 u i Show that in decimal form the return to another year of education in this model is Dlog1wage2Deduc 5 b1 1 b2 pareduc What sign do you expect for b2 Why ii Using the data in WAGE2 the estimated equation is log1wage2 5 565 1 047 educ 1 00078 educ pareduc 1 1132 10102 1000212 019 exper 1 010 tenure 10042 10032 n 5 722 R2 5 169 Only 722 observations contain full information on parents education Interpret the coefficient on the interaction term It might help to choose two specific values for pareducfor example pareduc 5 32 if both parents have a college education or pareduc 5 24 if both parents have a high school educationand to compare the estimated return to educ iii When pareduc is added as a separate variable to the equation we get log1wage2 5 494 1 097 educ 1 033 pareduc 2 0016 educ pareduc 1382 10272 10172 100122 1 020 exper 1 010 tenure 10042 10032 n 5 722 R2 5 174 Does the estimated return to education now depend positively on parent education Test the null hypothesis that the return to education does not depend on parent education 5 In Example 42 where the percentage of students receiving a passing score on a tenthgrade math exam math10 is the dependent variable does it make sense to include sci11the percentage of elev enth graders passing a science examas an additional explanatory variable 6 When atndrte2 and ACT atndrte are added to the equation estimated in 619 the Rsquared becomes 232 Are these additional terms jointly significant at the 10 level Would you include them in the model 7 The following three equations were estimated using the 1534 observations in 401K prate 5 8029 1 5 44 mrate 1 269 age 2 00013 totemp 1782 1522 10452 1000042 R2 5 100 R2 5 098 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 198 PART 1 Regression Analysis with CrossSectional Data prate 5 9732 1 5 02 mrate 1 314 age 2 2 66 log1totemp2 11952 10512 10442 1282 R2 5 144 R2 5 142 prate 5 8062 1 5 34 mrate 1 290 age 2 00043 totemp 1782 1522 10452 1000092 1 0000000039 totemp2 1000000000102 R2 5 108 R2 5 106 Which of these three models do you prefer Why 8 Suppose we want to estimate the effects of alcohol consumption alcohol on college grade point aver age colGPA In addition to collecting information on grade point averages and alcohol usage we also obtain attendance information say percentage of lectures attended called attend A standardized test score say SAT and high school GPA hsGPA are also available i Should we include attend along with alcohol as explanatory variables in a multiple regression model Think about how you would interpret balcohol ii Should SAT and hsGPA be included as explanatory variables Explain 9 If we start with 638 under the CLM assumptions assume large n and ignore the estimation error in the b j a 95 prediction interval for y0 is 3exp12196s 2 exp1logy02 exp1196s 2 exp1logy02 4 The point prediction for y0 is y0 5 exp1s 22exp1logy02 i For what values of s will the point prediction be in the 95 prediction interval Does this condition seem likely to hold in most applications ii Verify that the condition from part i is satisfied in the CEO salary example 10 The following two equations were estimated using the data in MEAPSINGLE The key explanatory variable is lexppp the log of expenditures per student at the school level math4 5 2449 1 9 01 lexppp 2 422 free 2 752 lmedinc 2 274 pctsgle 159242 14042 10712 153582 11612 n 5 229 R2 5 472 R2 5 462 math4 5 14938 1 1 93 lexppp 2 060 free 2 10 78 lmedinc 2 397 pctsgle 1 667 read4 141702 12822 10542 13762 11112 10422 n 5 229 R2 5 749 R2 5 743 i If you are a policy maker trying to estimate the causal effect of perstudent spending on math test performance explain why the first equation is more relevant than the second What is the estimated effect of a 10 increase in expenditures per student ii Does adding read4 to the regression have strange effects on coefficients and statistical significance other than blexppp iii How would you explain to someone with only basic knowledge of regression why in this case you prefer the equation with the smaller adjusted Rsquared Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 199 Computer Exercises C1 Use the data in KIELMC only for the year 1981 to answer the following questions The data are for houses that sold during 1981 in North Andover Massachusetts 1981 was the year construction began on a local garbage incinerator i To study the effects of the incinerator location on housing price consider the simple regression model log1price2 5 b0 1 b1log1dist2 1 u where price is housing price in dollars and dist is distance from the house to the incinerator measured in feet Interpreting this equation causally what sign do you expect for b1 if the presence of the incinerator depresses housing prices Estimate this equation and interpret the results ii To the simple regression model in part i add the variables logintst logarea logland rooms baths and age where intst is distance from the home to the interstate area is square footage of the house land is the lot size in square feet rooms is total number of rooms baths is number of bathrooms and age is age of the house in years Now what do you conclude about the effects of the incinerator Explain why i and ii give conflicting results iii Add 3log1intst2 42 to the model from part ii Now what happens What do you conclude about the importance of functional form iv Is the square of logdist significant when you add it to the model from part iii C2 Use the data in WAGE1 for this exercise i Use OLS to estimate the equation log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 1 u and report the results using the usual format ii Is exper2 statistically significant at the 1 level iii Using the approximation Dwage 1001b 2 1 2b 3exper2Dexper find the approximate return to the fifth year of experience What is the approximate return to the twentieth year of experience iv At what value of exper does additional experience actually lower predicted logwage How many people have more experience in this sample C3 Consider a model where the return to education depends upon the amount of work experience and vice versa log1wage2 5 b0 1 b1educ 1 b2exper 1 b3educ exper 1 u i Show that the return to another year of education in decimal form holding exper fixed is b1 1 b3exper ii State the null hypothesis that the return to education does not depend on the level of exper What do you think is the appropriate alternative iii Use the data in WAGE2 to test the null hypothesis in ii against your stated alternative iv Let u1 denote the return to education in decimal form when exper 5 10 u1 5 b1 1 10b3 Obtain u 1 and a 95 confidence interval for u1 Hint Write b1 5 u1 2 10b3 and plug this into the equation then rearrange This gives the regression for obtaining the confidence interval for u1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 200 PART 1 Regression Analysis with CrossSectional Data C4 Use the data in GPA2 for this exercise i Estimate the model sat 5 b0 1 b1hsize 1 b2hsize2 1 u where hsize is the size of the graduating class in hundreds and write the results in the usual form Is the quadratic term statistically significant ii Using the estimated equation from part i what is the optimal high school size Justify your answer iii Is this analysis representative of the academic performance of all high school seniors Explain iv Find the estimated optimal high school size using logsat as the dependent variable Is it much different from what you obtained in part ii C5 Use the housing price data in HPRICE1 for this exercise i Estimate the model log1price2 5 b0 1 b1log1lotsize2 1 b2log1sqrft2 1 b3bdrms 1 u and report the results in the usual OLS format ii Find the predicted value of logprice when lotsize 5 20000 sqrft 5 2500 and bdrms 5 4 Using the methods in Section 64 find the predicted value of price at the same values of the explanatory variables iii For explaining variation in price decide whether you prefer the model from part i or the model price 5 b0 1 b1lotsize 1 b2sqrft 1 b3bdrms 1 u C6 Use the data in VOTE1 for this exercise i Consider a model with an interaction between expenditures voteA 5 b0 1 b1prtystrA 1 b2expendA 1 b3expendB 1 b4expendA expendB 1 u What is the partial effect of expendB on voteA holding prtystrA and expendA fixed What is the partial effect of expendA on voteA Is the expected sign for b4 obvious ii Estimate the equation in part i and report the results in the usual form Is the interaction term statistically significant iii Find the average of expendA in the sample Fix expendA at 300 for 300000 What is the estimated effect of another 100000 spent by Candidate B on voteA Is this a large effect iv Now fix expendB at 100 What is the estimated effect of DexpendA 5 100 on voteA Does this make sense v Now estimate a model that replaces the interaction with shareA Candidate As percentage share of total campaign expenditures Does it make sense to hold both expendA and expendB fixed while changing shareA vi Requires calculus In the model from part v find the partial effect of expendB on voteA holding prtystrA and expendA fixed Evaluate this at expendA 5 300 and expendB 5 0 and comment on the results C7 Use the data in ATTEND for this exercise i In the model of Example 63 argue that DstndfnlDpriGPA b2 1 2b4 priGPA 1 b6atndrte Use equation 619 to estimate the partial effect when priGPA 5 259 and atndrte 5 82 Interpret your estimate Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 201 ii Show that the equation can be written as stndfnl 5 u0 1 b1atndrte 1 u2 priGPA 1 b3 ACT 1 b41priGPA 2 2592 2 1 b5ACT2 1 b6 priGPA1atndrte 2 822 1 u where u2 5 b2 1 2b412592 1 b61822 Note that the intercept has changed but this is unimportant Use this to obtain the standard error of u 2 from part i iii Suppose that in place of priGPAatndrte 82 you put 1priGPA 2 2592 atndrte 82 Now how do you interpret the coefficients on atndrte and priGPA C8 Use the data in HPRICE1 for this exercise i Estimate the model price 5 b0 1 b1lotsize 1 b2sqrft 1 b3bdrms 1 u and report the results in the usual form including the standard error of the regression Obtain predicted price when we plug in lotsize 5 10000 sqrft 5 2300 and bdrms 5 4 round this price to the nearest dollar ii Run a regression that allows you to put a 95 confidence interval around the predicted value in part i Note that your prediction will differ somewhat due to rounding error iii Let price0 be the unknown future selling price of the house with the characteristics used in parts i and ii Find a 95 CI for price0 and comment on the width of this confidence interval C9 The data set NBASAL contains salary information and career statistics for 269 players in the National Basketball Association NBA i Estimate a model relating pointspergame points to years in the league exper age and years played in college coll Include a quadratic in exper the other variables should appear in level form Report the results in the usual way ii Holding college years and age fixed at what value of experience does the next year of experience actually reduce pointspergame Does this make sense iii Why do you think coll has a negative and statistically significant coefficient Hint NBA players can be drafted before finishing their college careers and even directly out of high school iv Add a quadratic in age to the equation Is it needed What does this appear to imply about the effects of age once experience and education are controlled for v Now regress logwage on points exper exper2 age and coll Report the results in the usual format vi Test whether age and coll are jointly significant in the regression from part v What does this imply about whether age and education have separate effects on wage once productivity and seniority are accounted for C10 Use the data in BWGHT2 for this exercise i Estimate the equation log1bwght2 5 b0 1 b1npvis 1 b2npvis2 1 u by OLS and report the results in the usual way Is the quadratic term significant ii Show that based on the equation from part i the number of prenatal visits that maximizes logbwght is estimated to be about 22 How many women had at least 22 prenatal visits in the sample iii Does it make sense that birth weight is actually predicted to decline after 22 prenatal visits Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 202 PART 1 Regression Analysis with CrossSectional Data iv Add mothers age to the equation using a quadratic functional form Holding npvis fixed at what mothers age is the birth weight of the child maximized What fraction of women in the sample are older than the optimal age v Would you say that mothers age and number of prenatal visits explain a lot of the variation in logbwght vi Using quadratics for both npvis and age decide whether using the natural log or the level of bwght is better for predicting bwght C11 Use APPLE to verify some of the claims made in Section 63 i Run the regression ecolbs on ecoprc regprc and report the results in the usual form including the Rsquared and adjusted Rsquared Interpret the coefficients on the price variables and comment on their signs and magnitudes ii Are the price variables statistically significant Report the pvalues for the individual t tests iii What is the range of fitted values for ecolbs What fraction of the sample reports ecolbs 5 0 Comment iv Do you think the price variables together do a good job of explaining variation in ecolbs Explain v Add the variables faminc hhsize household size educ and age to the regression from part i Find the pvalue for their joint significance What do you conclude vi Run separate simple regressions of ecolbs on ecoprc and then ecolbs on regprc How do the simple regression coefficients compare with the multiple regression from part i Find the correlation coefficient between ecoprc and regprc to help explain your findings vii Consider a model that adds family income and the quantity demanded for regular apples ecolbs 5 b0 1 b1ecoprc 1 b2regprc 1 b3 faminc 1 b4reglbs 1 u From basic economic theory which explanatory variable does not belong to the equation When you drop the variables one at a time time do the sizes of the adjusted Rsquareds affect your answer C12 Use the subset of 401KSUBS with fsize 5 1 this restricts the analysis to singleperson households see also Computer Exercise C8 in Chapter 4 i What is the youngest age of people in this sample How many people are at that age ii In the model nettfa 5 b0 1 b1inc 1 b2age 1 b3age2 1 u what is the literal interpretation of b2 By itself is it of much interest iii Estimate the model from part ii and report the results in standard form Are you concerned that the coefficient on age is negative Explain iv Because the youngest people in the sample are 25 it makes sense to think that for a given level of income the lowest average amount of net total financial assets is at age 25 Recall that the partial effect of age on nettfa is b2 1 2b3age so the partial effect at age 25 is b2 1 2b31252 5 b2 1 50b3 call this u2 Find u 2 and obtain the twosided pvalue for testing H0 u2 5 0 You should conclude that u 2 is small and very statistically insignificant Hint One way to do this is to estimate the model nettfa 5 a0 1 b1inc 1 u2age 1 b31age 2 252 2 1 u where the intercept a0 is different from b0 There are other ways too v Because the evidence against H0 u2 5 0 is very weak set it to zero and estimate the model nettfa 5 a0 1 b1inc 1 b31age 2 252 2 1 u In terms of goodnessoffit does this model fit better than that in part ii vi For the estimated equation in part v set inc 5 30 roughly the average value and graph the relationship between nettfa and age but only for age 25 Describe what you see vii Check to see whether including a quadratic in inc is necessary Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 203 C13 Use the data in MEAP00 to answer this question i Estimate the model math4 5 b0 1 b2lexppp 1 b2lenroll 1 b3lunch 1 u by OLS and report the results in the usual form Is each explanatory variable statistically significant at the 5 level ii Obtain the fitted values from the regression in part i What is the range of fitted values How does it compare with the range of the actual data on math4 iii Obtain the residuals from the regression in part i What is the building code of the school that has the largest positive residual Provide an interpretation of this residual iv Add quadratics of all explanatory variables to the equation and test them for joint significance Would you leave them in the model v Returning to the model in part i divide the dependent variable and each explanatory variable by its sample standard deviation and rerun the regression Include an intercept unless you also first subtract the mean from each variable In terms of standard deviation units which explanatory variable has the largest effect on the math pass rate C14 Use the data in BENEFITS to answer this question It is a schoollevel data set at the K5 level on aver age teacher salary and benefits See Example 410 for background i Regress lavgsal on bs and report the results in the usual form Can you reject H0 bbs 5 0 against a twosided alternative Can you reject H0 bbs 5 21 against H1 bbs 21 Report the pvalues for both tests ii Define lbs 5 log1bs2 Find the range of values for lbs and find its standard deviation How do these compare to the range and standard deviation for bs iii Regress lavgsal on lbs Does this fit better than the regression from part i iv Estimate the equation lavgsal 5 b0 1 b1bs 1 b2lenroll 1 b3lstaff 1 b4lunch 1 u and report the results in the usual form What happens to the coefficient on bs Is it now statistically different from zero v Interpret the coefficient on lstaff Why do you think it is negative vi Add lunch2 to the equation from part iv Is it statistically significant Compute the turning point minimum value in the quadratic and show that it is within the range of the observed data on lunch How many values of lunch are higher than the calculated turning point vii Based on the findings from part vi describe how teacher salaries relate to school poverty rates In terms of teacher salary and holding other factors fixed is it better to teach at a school with lunch 5 0 no poverty lunch 5 50 or lunch 5 100 all kids eligible for the free lunch program APPEndix 6A 6a a Brief introduction to Bootstrapping In many cases where formulas for standard errors are hard to obtain mathematically or where they are thought not to be very good approximations to the true sampling variation of an estimator we can rely on a resampling method The general idea is to treat the observed data as a population that we can draw samples from The most common resampling method is the bootstrap There are actually several versions of the bootstrap but the most general and most easily applied is called the non parametric bootstrap and that is what we describe here Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 204 PART 1 Regression Analysis with CrossSectional Data Suppose we have an estimate u of a population parameter u We obtained this estimate which could be a function of OLS estimates or estimates that we cover in later chapters from a random sample of size n We would like to obtain a standard error for u that can be used for constructing t statistics or confidence intervals Remarkably we can obtain a valid standard error by computing the estimate from different random samples drawn from the original data Implementation is easy If we list our observations from 1 through n we draw n numbers ran domly with replacement from this list This produces a new data set of size n that consists of the original data but with many observations appearing multiple times except in the rather unusual case that we resample the original data Each time we randomly sample from the original data we can estimate u using the same procedure that we used on the original data Let u 1b2 denote the estimate from bootstrap sample b Now if we repeat the resampling and estimation m times we have m new estimates 5u 1b2 b 5 1 2 p m6 The bootstrap standard error of u is just the sample standard deviation of the u 1b2 namely bse1u 2 5 c 1m 2 12 21 a m b51 1u 1b2 2 u 2 2d 1 2 650 where u is the average of the bootstrap estimates If obtaining an estimate of u on a sample of size n requires little computational time as in the case of OLS and all the other estimators we encounter in this text we can afford to choose mthe number of bootstrap replicationsto be large A typical value is m 5 1000 but even m 5 500 or a somewhat smaller value can produce a reliable standard error Note that the size of mthe num ber of times we resample the original datahas nothing to do with the sample size n For certain estimation problems beyond the scope of this text a large n can force one to do fewer bootstrap replications Many statistics and econometrics packages have builtin bootstrap commands and this makes the calculation of bootstrap standard errors simple especially compared with the work often required to obtain an analytical formula for an asymptotic standard error One can actually do better in most cases by using the bootstrap sample to compute pvalues for t statistics and F statistics or for obtaining confidence intervals rather than obtaining a bootstrap standard error to be used in the construction of t statistics or confidence intervals See Horowitz 2001 for a comprehensive treatment Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 205 c h a p t e r 7 Multiple Regression Analysis with Qualitative Information Binary or Dummy Variables I n previous chapters the dependent and independent variables in our multiple regression models have had quantitative meaning Just a few examples include hourly wage rate years of education college grade point average amount of air pollution level of firm sales and number of arrests In each case the magnitude of the variable conveys useful information In empirical work we must also incorporate qualitative factors into regression models The gender or race of an individual the industry of a firm manufacturing retail and so on and the region in the United States where a city is located South North West and so on are all considered to be qualitative factors Most of this chapter is dedicated to qualitative independent variables After we discuss the appro priate ways to describe qualitative information in Section 71 we show how qualitative explanatory variables can be easily incorporated into multiple regression models in Sections 72 73 and 74 These sections cover almost all of the popular ways that qualitative independent variables are used in crosssectional regression analysis In Section 75 we discuss a binary dependent variable which is a particular kind of qualitative dependent variable The multiple regression model has an interesting interpretation in this case and is called the linear probability model While much maligned by some econometricians the simplicity of the linear probability model makes it useful in many empirical contexts We will describe its draw backs in Section 75 but they are often secondary in empirical work 71 Describing Qualitative Information Qualitative factors often come in the form of binary information a person is female or male a person does or does not own a personal computer a firm offers a certain kind of employee pension plan or it does not a state administers capital punishment or it does not In all of these examples the relevant Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 206 information can be captured by defining a binary variable or a zeroone variable In econometrics binary variables are most commonly called dummy variables although this name is not especially descriptive In defining a dummy variable we must decide which event is assigned the value one and which is assigned the value zero For example in a study of indi vidual wage determination we might define female to be a binary variable taking on the value one for females and the value zero for males The name in this case indicates the event with the value one The same infor mation is captured by defining male to be one if the person is male and zero if the person is female Either of these is better than using gender because this name does not make it clear when the dummy variable is one does gender 5 1 correspond to male or female What we call our variables is unimportant for getting regression results but it always helps to choose names that clarify equations and expositions Suppose in the wage example that we have chosen the name female to indicate gender Further we define a binary variable married to equal one if a person is married and zero if otherwise Table 71 gives a partial listing of a wage data set that might result We see that Person 1 is female and not married Person 2 is female and married Person 3 is male and not married and so on Why do we use the values zero and one to describe qualitative information In a sense these values are arbitrary any two different values would do The real benefit of capturing qualitative infor mation using zeroone variables is that it leads to regression models where the parameters have very natural interpretations as we will see now 72 A Single Dummy Independent Variable How do we incorporate binary information into regression models In the simplest case with only a single dummy explanatory variable we just add it as an independent variable in the equation For example consider the following simple model of hourly wage determination wage 5 b0 1 d0 female 1 b1educ 1 u 71 TAble 71 A Partial Listing of the Data in WAGE1 person wage educ exper female married 1 310 11 2 1 0 2 324 12 22 1 1 3 300 11 2 0 0 4 600 8 44 0 1 5 530 12 7 0 1 525 1156 16 5 0 1 526 350 14 5 1 0 Suppose that in a study comparing elec tion outcomes between Democratic and Republican candidates you wish to indicate the party of each candidate Is a name such as party a wise choice for a binary variable in this case What would be a better name Exploring FurthEr 71 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 207 We use d0 as the parameter on female in order to highlight the interpretation of the parameters multi plying dummy variables later we will use whatever notation is most convenient In model 71 only two observed factors affect wage gender and education Because female 5 1 when the person is female and female 5 0 when the person is male the parameter d0 has the follow ing interpretation d0 is the difference in hourly wage between females and males given the same amount of education and the same error term u Thus the coefficient d0 determines whether there is discrimination against women if d0 0 then for the same level of other factors women earn less than men on average In terms of expectations if we assume the zero conditional mean assumption E1u0female educ2 5 0 then d0 5 E1wage0 female 5 1educ2 2 E1wage0 female 5 0educ2 Because female 5 1 corresponds to females and female 5 0 corresponds to males we can write this more simply as d0 5 E1wage0 femaleeduc2 2 E1wage0maleeduc2 72 The key here is that the level of education is the same in both expectations the difference d0 is due to gender only The situation can be depicted graphically as an intercept shift between males and females In Figure 71 the case d0 0 is shown so that men earn a fixed amount more per hour than women The difference does not depend on the amount of education and this explains why the wageeducation profiles for women and men are parallel educ slope 1 wage 0 1 0 men wage 0 1 1educ women wage 0 1 0 1 educ 0 0 FiguRe 71 Graph of wage 5 b0 1 d0 female 1 b1 educ for d0 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 208 At this point you may wonder why we do not also include in 71 a dummy variable say male which is one for males and zero for females This would be redundant In 71 the intercept for males is b0 and the intercept for females is b0 1 d0 Because there are just two groups we only need two different intercepts This means that in addition to b0 we need to use only one dummy variable we have chosen to include the dummy variable for females Using two dummy variables would intro duce perfect collinearity because female 1 male 5 1 which means that male is a perfect linear func tion of female Including dummy variables for both genders is the simplest example of the socalled dummy variable trap which arises when too many dummy variables describe a given number of groups We will discuss this problem in detail later In 71 we have chosen males to be the base group or benchmark group that is the group against which comparisons are made This is why b0 is the intercept for males and d0 is the difference in intercepts between females and males We could choose females as the base group by writing the model as wage 5 a0 1 g0 male 1 b1educ 1 u where the intercept for females is a0 and the intercept for males is a0 1 g0 this implies that a0 5 b0 1 d0 and a0 1 g0 5 b0 In any application it does not matter how we choose the base group but it is important to keep track of which group is the base group Some researchers prefer to drop the overall intercept in the model and to include dummy vari ables for each group The equation would then be wage 5 b0male 1 a0female 1 b1educ 1 u where the intercept for men is b0 and the intercept for women is a0 There is no dummy variable trap in this case because we do not have an overall intercept However this formulation has little to offer since testing for a difference in the intercepts is more difficult and there is no generally agreed upon way to compute Rsquared in regressions without an intercept Therefore we will always include an overall intercept for the base group Nothing much changes when more explanatory variables are involved Taking males as the base group a model that controls for experience and tenure in addition to education is wage 5 b0 1 d0 female 1 b1educ 1 b2exper 1 b3 tenure 1 u 73 If educ exper and tenure are all relevant productivity characteristics the null hypothesis of no dif ference between men and women is H0 d0 5 0 The alternative that there is discrimination against women is H1 d0 0 How can we actually test for wage discrimination The answer is simple just estimate the model by OLS exactly as before and use the usual t statistic Nothing changes about the mechanics of OLS or the statistical theory when some of the independent variables are defined as dummy variables The only difference with what we have done up until now is in the interpretation of the coefficient on the dummy variable ExamplE 71 Hourly Wage Equation Using the data in WAGE1 we estimate model 73 For now we use wage rather than logwage as the dependent variable wage 5 2157 2 181 female 1 572 educ1 025 exper 1 141 tenure 1722 1262 10492 10122 10212 74 n 5 526 R2 5 364 The negative interceptthe intercept for men in this caseis not very meaningful because no one has zero values for all of educ exper and tenure in the sample The coefficient on female is interest ing because it measures the average difference in hourly wage between a man and a woman who have the same levels of educ exper and tenure If we take a woman and a man with the same levels of Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 209 education experience and tenure the woman earns on average 181 less per hour than the man Recall that these are 1976 wages It is important to remember that because we have performed multiple regression and controlled for educ exper and tenure the 181 wage differential cannot be explained by different average lev els of education experience or tenure between men and women We can conclude that the differen tial of 181 is due to gender or factors associated with gender that we have not controlled for in the regression In 2013 dollars the wage differential is about 40911812 740 It is informative to compare the coefficient on female in equation 74 to the estimate we get when all other explanatory variables are dropped from the equation wage 5 710 2 251 female 1212 1302 75 n 5 526 R2 5 116 The coefficients in 75 have a simple interpretation The intercept is the average wage for men in the sample let female 5 0 so men earn 710 per hour on average The coefficient on female is the difference in the average wage between women and men Thus the average wage for women in the sample is 710 2 251 5 459 or 459 per hour Incidentally there are 274 men and 252 women in the sample Equation 75 provides a simple way to carry out a comparisonofmeans test between the two groups which in this case are men and women The estimated difference 2251 has a t statistic of 2837 which is very statistically significant and of course 251 is economically large as well Generally simple regression on a constant and a dummy variable is a straightforward way to compare the means of two groups For the usual t test to be valid we must assume that the homoskedasticity assumption holds which means that the population variance in wages for men is the same as that for women The estimated wage differential between men and women is larger in 75 than in 74 because 75 does not control for differences in education experience and tenure and these are lower on average for women than for men in this sample Equation 74 gives a more reliable estimate of the ceteris paribus gender wage gap it still indicates a very large differential In many cases dummy independent variables reflect choices of individuals or other economic units as opposed to something predetermined such as gender In such situations the matter of cau sality is again a central issue In the following example we would like to know whether personal computer ownership causes a higher college grade point average ExamplE 72 Effects of Computer Ownership on College Gpa In order to determine the effects of computer ownership on college grade point average we estimate the model colGPA 5 b0 1 d0 PC 1 b1hsGPA 1 b2 ACT 1 u where the dummy variable PC equals one if a student owns a personal computer and zero otherwise There are various reasons PC ownership might have an effect on colGPA A students schoolwork might be of higher quality if it is done on a computer and time can be saved by not having to wait at a computer lab Of course a student might be more inclined to play computer games or surf the Internet if he or she owns a PC so it is not obvious that d0 is positive The variables hsGPA high school GPA and ACT achievement test score are used as controls it could be that stronger students as measured Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 210 by high school GPA and ACT scores are more likely to own computers We control for these factors because we would like to know the average effect on colGPA if a student is picked at random and given a personal computer Using the data in GPA1 we obtain colGPA 5 126 1 157 PC 1 447 hsGPA 1 0087 ACT 1332 10572 10942 101052 76 n 5 141 R2 5 219 This equation implies that a student who owns a PC has a predicted GPA about 16 points higher than a comparable student without a PC remember both colGPA and hsGPA are on a fourpoint scale The effect is also very statistically significant with tPC 5 157057 275 What happens if we drop hsGPA and ACT from the equation Clearly dropping the latter vari able should have very little effect as its coefficient and t statistic are very small But hsGPA is very significant and so dropping it could affect the estimate of bPC Regressing colGPA on PC gives an estimate on PC equal to about 170 with a standard error of 063 in this case b PC and its t statistic do not change by much In the exercises at the end of the chapter you will be asked to control for other factors in the equation to see if the computer ownership effect disappears or if it at least gets notably smaller Each of the previous examples can be viewed as having relevance for policy analysis In the first example we were interested in gender discrimination in the workforce In the second example we were concerned with the effect of computer ownership on college performance A special case of policy analysis is program evaluation where we would like to know the effect of economic or social programs on individuals firms neighborhoods cities and so on In the simplest case there are two groups of subjects The control group does not participate in the program The experimental group or treatment group does take part in the program These names come from literature in the experimental sciences and they should not be taken literally Except in rare cases the choice of the control and treatment groups is not random However in some cases multiple regression analysis can be used to control for enough other factors in order to estimate the causal effect of the program ExamplE 73 Effects of Training Grants on Hours of Training Using the 1988 data for Michigan manufacturing firms in JTRAIN we obtain the following estimated equation hrsemp 5 4667 1 2625 grant 2 98 log1sales2 2 607 log1employ2 143412 15592 13542 13882 77 n 5 105 R2 5 237 The dependent variable is hours of training per employee at the firm level The variable grant is a dummy variable equal to one if the firm received a job training grant for 1988 and zero otherwise The variables sales and employ represent annual sales and number of employees respectively We cannot enter hrsemp in logarithmic form because hrsemp is zero for 29 of the 105 firms used in the regression The variable grant is very statistically significant with tgrant 5 470 Controlling for sales and employment firms that received a grant trained each worker on average 2625 hours more Because Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 211 the average number of hours of per worker training in the sample is about 17 with a maximum value of 164 grant has a large effect on training as is expected The coefficient on logsales is small and very insignificant The coefficient on logemploy means that if a firm is 10 larger it trains its workers about 61 hour less Its t statistic is 2156 which is only marginally statistically significant As with any other independent variable we should ask whether the measured effect of a qualita tive variable is causal In equation 77 is the difference in training between firms that receive grants and those that do not due to the grant or is grant receipt simply an indicator of something else It might be that the firms receiving grants would have on average trained their workers more even in the absence of a grant Nothing in this analysis tells us whether we have estimated a causal effect we must know how the firms receiving grants were determined We can only hope we have controlled for as many factors as possible that might be related to whether a firm received a grant and to its levels of training We will return to policy analysis with dummy variables in Section 76 as well as in later chapters 72a Interpreting Coefficients on Dummy Explanatory Variables When the Dependent Variable Is logy A common specification in applied work has the dependent variable appearing in logarithmic form with one or more dummy variables appearing as independent variables How do we interpret the dummy variable coefficients in this case Not surprisingly the coefficients have a percentage interpretation ExamplE 74 Housing price Regression Using the data in HPRICE1 we obtain the equation log1price2 5 2135 1 168 log1lotsize2 1 707 log1sqrft2 1652 10382 10932 1 027 bdrms 1 054 colonial 78 10292 10452 n 5 88 R2 5 649 All the variables are selfexplanatory except colonial which is a binary variable equal to one if the house is of the colonial style What does the coefficient on colonial mean For given levels of lotsize sqrft and bdrms the difference in log1price2 between a house of colonial style and that of another style is 054 This means that a colonialstyle house is predicted to sell for about 54 more holding other factors fixed This example shows that when logy is the dependent variable in a model the coefficient on a dummy variable when multiplied by 100 is interpreted as the percentage difference in y holding all other factors fixed When the coefficient on a dummy variable suggests a large proportionate change in y the exact percentage difference can be obtained exactly as with the semielasticity calculation in Section 62 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 212 ExamplE 75 log Hourly Wage Equation Let us reestimate the wage equation from Example 71 using logwage as the dependent variable and adding quadratics in exper and tenure log1wage2 5 417 2 297 female 1 080 educ 1 029 exper 10992 10362 10072 10052 2 00058 exper2 1 032 tenure 2 00059 tenure2 79 1000102 10072 1000232 n 5 526 R2 5 441 Using the same approximation as in Example 74 the coefficient on female implies that for the same levels of educ exper and tenure women earn about 100297 5 297 less than men We can do better than this by computing the exact percentage difference in predicted wages What we want is the proportionate difference in wages between females and males holding other factors fixed 1wageF 2 wageM2wageM What we have from 79 is log1wageF2 2 log1wageM2 5 2297 Exponentiating and subtracting one gives 1wageF 2 wageM2 wageM 5 exp122972 2 1 2257 This more accurate estimate implies that a womans wage is on average 257 below a comparable mans wage If we had made the same correction in Example 74 we would have obtained exp10542 2 1 0555 or about 56 The correction has a smaller effect in Example 74 than in the wage example because the magnitude of the coefficient on the dummy variable is much smaller in 78 than in 79 Generally if b 1 is the coefficient on a dummy variable say x1 when logy is the dependent vari able the exact percentage difference in the predicted y when x1 5 1 versus when x1 5 0 is 100 3exp1b 12 2 14 710 The estimate b 1 can be positive or negative and it is important to preserve its sign in computing 710 The logarithmic approximation has the advantage of providing an estimate between the mag nitudes obtained by using each group as the base group In particular although equation 710 gives us a better estimate than 100 b 1 of the percentage by which y for x1 5 1 is greater than y for x1 5 0 710 is not a good estimate if we switch the base group In Example 75 we can estimate the percentage by which a mans wage exceeds a comparable womans wage and this estimate is 100 3exp12b 12 2 14 5 100 3exp12972 2 14 346 The approximation based on 100 b 1 297 is between 257 and 346 and close to the middle Therefore it makes sense to report that the differ ence in predicted wages between men and women is about 297 without having to take a stand on which is the base group 73 Using Dummy Variables for Multiple Categories We can use several dummy independent variables in the same equation For example we could add the dummy variable married to equation 79 The coefficient on married gives the approximate proportional differential in wages between those who are and are not married holding gender educ exper and tenure fixed When we estimate this model the coefficient on married with standard error Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 213 in parentheses is 053 041 and the coefficient on female becomes 229010362 Thus the mar riage premium is estimated to be about 53 but it is not statistically different from zero 1t 5 1292 An important limitation of this model is that the marriage premium is assumed to be the same for men and women this is relaxed in the following example ExamplE 76 log Hourly Wage Equation Let us estimate a model that allows for wage differences among four groups married men married women single men and single women To do this we must select a base group we choose single men Then we must define dummy variables for each of the remaining groups Call these marrmale marrfem and singfem Putting these three variables into 79 and of course dropping female since it is now redundant gives log1wage2 5 321 1 213 marrmale 2 198 marrfem 11002 10552 10582 2 110 singfem 1 079 educ 1 027 exper 2 00054 exper2 10562 10072 10052 1000112 711 1 029 tenure 2 00053 tenure2 10072 1000232 n 5 526 R2 5 461 All of the coefficients with the exception of singfem have t statistics well above two in absolute value The t statistic for singfem is about 2196 which is just significant at the 5 level against a twosided alternative To interpret the coefficients on the dummy variables we must remember that the base group is single males Thus the estimates on the three dummy variables measure the proportionate difference in wage relative to single males For example married men are estimated to earn about 213 more than single men holding levels of education experience and tenure fixed The more precise estimate from 710 is about 237 A married woman on the other hand earns a predicted 198 less than a single man with the same levels of the other variables Because the base group is represented by the intercept in 711 we have included dummy vari ables for only three of the four groups If we were to add a dummy variable for single males to 711 we would fall into the dummy variable trap by introducing perfect collinearity Some regression pack ages will automatically correct this mistake for you while others will just tell you there is perfect collinearity It is best to carefully specify the dummy variables because then we are forced to properly interpret the final model Even though single men is the base group in 711 we can use this equation to obtain the esti mated difference between any two groups Because the overall intercept is common to all groups we can ignore that in finding differences Thus the estimated proportionate difference between single and married women is 2110 2 121982 5 088 which means that single women earn about 88 more than married women Unfortunately we cannot use equation 711 for testing whether the esti mated difference between single and married women is statistically significant Knowing the standard errors on marrfem and singfem is not enough to carry out the test see Section 44 The easiest thing to do is to choose one of these groups to be the base group and to reestimate the equation Nothing substantive changes but we get the needed estimate and its standard error directly When we use mar ried women as the base group we obtain log1wage2 5 123 1 411 marrmale 1 198 singmale 1 088 singfem 1 p 11062 10562 10582 10522 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 214 where of course none of the unreported coefficients or standard errors have changed The estimate on singfem is as expected 088 Now we have a standard error to go along with this estimate The t statistic for the null that there is no difference in the population between married and single women is tsingfem 5 088052 169 This is marginal evidence against the null hypothesis We also see that the estimated difference between married men and married women is very statistically significant 1tmarrmale 5 7342 The previous example illustrates a general principle for including dummy variables to indicate different groups if the regression model is to have different intercepts for say g groups or catego ries we need to include g 2 1 dummy variables in the model along with an intercept The inter cept for the base group is the overall intercept in the model and the dummy variable coefficient for a particular group represents the estimated differ ence in intercepts between that group and the base group Including g dummy variables along with an intercept will result in the dummy variable trap An alternative is to include g dummy variables and to exclude an overall intercept Including g dummies without an overall intercept is sometimes useful but it has two practical drawbacks First it makes it more cumbersome to test for differences relative to a base group Second regression packages usually change the way Rsquared is computed when an overall intercept is not included In particular in the formula R2 5 1 2 SSRSST the total sum of squares SST is replaced with a total sum of squares that does not center yi about its mean say SST0 5 g n i51y2 i The resulting Rsquared say R2 0 5 1 2 SSRSST0 is sometimes called the uncentered Rsquared Unfortunately R2 0 is rarely suitable as a goodnessof fit measure It is always true that SST0 SST with equality only if y 5 0 Often SST0 is much larger that SST which means that R2 0 is much larger than R2 For example if in the previous example we regress logwage on marrmale singmale marrfem singfem and the other explanatory variables without an interceptthe reported Rsquared from Stata which is R2 0 is 948 This high Rsquared is an artifact of not centering the total sum of squares in the calculation The correct Rsquared is given in equation 711 as 461 Some regression packages including Stata have an option to force calcu lation of the centered Rsquared even though an overall intercept has not been included and using this option is generally a good idea In the vast majority of cases any Rsquared based on compar ing an SSR and SST should have SST computed by centering the yi about y We can think of this SST as the sum of squared residuals obtained if we just use the sample average y to predict each yi Surely we are setting the bar pretty low for any model if all we measure is its fit relative to using a constant predictor For a model without an intercept that fits poorly it is possible that SSR SST which means R2 would be negative The uncentered Rsquared will always be between zero and one which likely explains why it is usually the default when an intercept is not estimated in regression models 73a Incorporating Ordinal Information by Using Dummy Variables Suppose that we would like to estimate the effect of city credit ratings on the municipal bond interest rate MBR Several financial companies such as Moodys Investors Service and Standard and Poors rate the quality of debt for local governments where the ratings depend on things like probability of default Local governments prefer lower interest rates in order to reduce their costs of borrow ing For simplicity suppose that rankings take on the integer values 50 1 2 3 46 with zero being the worst credit rating and four being the best This is an example of an ordinal variable Call this In the baseball salary data found in MLB1 players are given one of six positions frstbase scndbase thrdbase shrtstop outfield or catcher To allow for salary dif ferentials across position with outfield ers as the base group which dummy variables would you include as independent variables Exploring FurthEr 72 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 215 variable CR for concreteness The question we need to address is How do we incorporate the variable CR into a model to explain MBR One possibility is to just include CR as we would include any other explanatory variable MBR 5 b0 1 b1CR 1 other factors where we do not explicitly show what other factors are in the model Then b1 is the percentage point change in MBR when CR increases by one unit holding other factors fixed Unfortunately it is rather hard to interpret a oneunit increase in CR We know the quantitative meaning of another year of edu cation or another dollar spent per student but things like credit ratings typically have only ordinal meaning We know that a CR of four is better than a CR of three but is the difference between four and three the same as the difference between one and zero If not then it might not make sense to assume that a oneunit increase in CR has a constant effect on MBR A better approach which we can implement because CR takes on relatively few values is to define dummy variables for each value of CR Thus let CR1 5 1 if CR 5 1 and CR1 5 0 otherwise CR2 5 1 if CR 5 2 and CR2 5 0 otherwise and so on Effectively we take the single credit rating and turn it into five categories Then we can estimate the model MBR 5 b0 1 d1CR1 1 d2CR2 1 d3CR3 1 d4CR4 1 other factors 712 Following our rule for including dummy variables in a model we include four dummy variables because we have five categories The omitted category here is a credit rating of zero and so it is the base group This is why we do not need to define a dummy variable for this category The coefficients are easy to interpret d1 is the difference in MBR other factors fixed between a municipality with a credit rating of one and a municipality with a credit rating of zero d2 is the difference in MBR between a municipality with a credit rating of two and a munici pality with a credit rating of zero and so on The movement between each credit rating is allowed to have a different effect so using 712 is much more flexible than simply putting CR in as a single variable Once the dummy variables are defined estimating 712 is straightforward Equation 712 contains the model with a constant partial effect as a special case One way to write the three restrictions that imply a constant partial effect is d2 5 2d1 d3 5 3d1 and d4 5 4d1 When we plug these into equation 712 and rearrange we get MBR 5 b0 1 d11CR1 1 2CR2 1 3CR3 1 4CR42 1 other factors Now the term multiplying d1 is simply the origi nal credit rating variable CR To obtain the F statistic for testing the constant partial effect restrictions we obtain the unrestricted Rsquared from 712 and the restricted Rsquared from the regression of MBR on CR and the other factors we have controlled for The F statistic is obtained as in equation 441 with q 5 3 ExamplE 77 Effects of physical attractiveness on Wage Hamermesh and Biddle 1994 used measures of physical attractiveness in a wage equation The file BEAUTY contains fewer variables but more observations than used by Hamermesh and Biddle See Computer Exercise C12 Each person in the sample was ranked by an interviewer for physi cal attractiveness using five categories homely quite plain average good looking and strikingly beautiful or handsome Because there are so few people at the two extremes the authors put people into one of three groups for the regression analysis average below average and above average where the base group is average Using data from the 1977 Quality of Employment Survey after In model 712 how would you test the null hypothesis that credit rating has no effect on MBR Exploring FurthEr 73 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 216 controlling for the usual productivity characteristics Hamermesh and Biddle estimated an equation for men log1wage2 5 b 0 2 164 belavg 1 016 abvavg 1 other factors 10462 10332 n 5 700 R2 5 403 and an equation for women log1wage2 5 b 0 2 124 belavg 1 035 abvavg 1 other factors 10662 10492 n 5 409 R2 5 330 The other factors controlled for in the regressions include education experience tenure marital sta tus and race see Table 3 in Hamermesh and Biddles paper for a more complete list In order to save space the coefficients on the other variables are not reported in the paper and neither is the intercept For men those with below average looks are estimated to earn about 164 less than an average looking man who is the same in other respects including education experience tenure marital sta tus and race The effect is statistically different from zero with t 5 2357 Men with above average looks are estimated to earn only 16 more than men with average looks and the effect is not statisti cally significant 1t 52 A woman with below average looks earns about 124 less than an otherwise comparable averagelooking woman with t 5 2188 As was the case for men the estimate on abvavg is much smaller in magnitude and not statistically different from zero In related work Biddle and Hamermesh 1998 revisit the effects of looks on earnings using a more homogeneous group graduates of a particular law school The authors continue to find that physical appearance has an effect on annual earnings something that is perhaps not too surprising among people practicing law In some cases the ordinal variable takes on too many values so that a dummy variable cannot be included for each value For example the file LAWSCH85 contains data on median starting salaries for law school graduates One of the key explanatory variables is the rank of the law school Because each law school has a different rank we clearly cannot include a dummy variable for each rank If we do not wish to put the rank directly in the equation we can break it down into categories The follow ing example shows how this is done ExamplE 78 Effects of law School Rankings on Starting Salaries Define the dummy variables top10 r1125 r2640 r4160 r61100 to take on the value unity when the variable rank falls into the appropriate range We let schools ranked below 100 be the base group The estimated equation is log1salary2 5 917 1 700 top10 1 594 r1125 1 375 r2640 1412 10532 10392 10342 1 263 r4160 1 132 r61100 1 0057 LSAT 10282 10212 100312 713 1 041 GPA 1 036 log1libvol2 1 0008 log1cost2 10742 10262 102512 n 5 136 R2 5 911 R2 5 905 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 217 We see immediately that all of the dummy variables defining the different ranks are very statisti cally significant The estimate on r61100 means that holding LSAT GPA libvol and cost fixed the median salary at a law school ranked between 61 and 100 is about 132 higher than that at a law school ranked below 100 The difference between a top 10 school and a below 100 school is quite large Using the exact calculation given in equation 710 gives exp17002 2 1 1014 and so the predicted median salary is more than 100 higher at a top 10 school than it is at a below 100 school As an indication of whether breaking the rank into different groups is an improvement we can compare the adjusted Rsquared in 713 with the adjusted Rsquared from including rank as a single variable the former is 905 and the latter is 836 so the additional flexibility of 713 is warranted Interestingly once the rank is put into the admittedly somewhat arbitrary given categories all of the other variables become insignificant In fact a test for joint significance of LSAT GPA loglibvol and logcost gives a pvalue of 055 which is borderline significant When rank is included in its original form the pvalue for joint significance is zero to four decimal places One final comment about this example In deriving the properties of ordinary least squares we assumed that we had a random sample The current application violates that assumption because of the way rank is defined a schools rank necessarily depends on the rank of the other schools in the sample and so the data cannot represent independent draws from the population of all law schools This does not cause any serious problems provided the error term is uncorrelated with the explanatory variables 74 Interactions Involving Dummy Variables 74a Interactions among Dummy Variables Just as variables with quantitative meaning can be interacted in regression models so can dummy variables We have effectively seen an example of this in Example 76 where we defined four catego ries based on marital status and gender In fact we can recast that model by adding an interaction term between female and married to the model where female and married appear separately This allows the marriage premium to depend on gender just as it did in equation 711 For purposes of comparison the estimated model with the femalemarried interaction term is log1wage2 5 321 2 110 female 1 231 married 11002 10562 10552 714 2 301 female married 1 p 10722 where the rest of the regression is necessarily identical to 711 Equation 714 shows explicitly that there is a statistically significant interaction between gender and marital status This model also allows us to obtain the estimated wage differential among all four groups but here we must be careful to plug in the correct combination of zeros and ones Setting female 5 0 and married 5 0 corresponds to the group single men which is the base group since this eliminates female married and female married We can find the intercept for married men by setting female 5 0 and married 5 1 in 714 this gives an intercept of 321 1 213 5 534 and so on Equation 714 is just a different way of finding wage differentials across all gendermarital status combinations It allows us to easily test the null hypothesis that the gender differential does not depend on marital status equivalently that the marriage differential does not depend on gender Equation 711 is more convenient for testing for wage differentials between any group and the base group of single men Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 218 ExamplE 79 Effects of Computer Usage on Wages Krueger 1993 estimates the effects of computer usage on wages He defines a dummy variable which we call compwork equal to one if an individual uses a computer at work Another dummy variable comphome equals one if the person uses a computer at home Using 13379 people from the 1989 Current Population Survey Krueger 1993 Table 4 obtains log1wage2 5 b 0 1 177 compwork 1 070 comphome 10092 10192 715 1 017 compwork comphome 1 other factors 10232 The other factors are the standard ones for wage regressions including education experience gender and marital status see Kruegers paper for the exact list Krueger does not report the intercept because it is not of any importance all we need to know is that the base group consists of people who do not use a computer at home or at work It is worth noticing that the estimated return to using a computer at work but not at home is about 177 The more precise estimate is 194 Similarly people who use computers at home but not at work have about a 7 wage premium over those who do not use a computer at all The differential between those who use a computer at both places relative to those who use a computer in neither place is about 264 obtained by adding all three coefficients and mul tiplying by 100 or the more precise estimate 302 obtained from equation 710 The interaction term in 715 is not statistically significant nor is it very big economically But it is causing little harm by being in the equation 74b Allowing for Different Slopes We have now seen several examples of how to allow different intercepts for any number of groups in a multiple regression model There are also occasions for interacting dummy variables with explanatory variables that are not dummy variables to allow for a difference in slopes Continuing with the wage example suppose that we wish to test whether the return to education is the same for men and women allowing for a constant wage differential between men and women a differential for which we have already found evidence For simplicity we include only education and gender in the model What kind of model allows for different returns to education Consider the model log1wage2 5 1b0 1 d0 female2 1 1b1 1 d1 female2educ 1 u 716 If we plug female 5 0 into 716 then we find that the intercept for males is b0 and the slope on education for males is b1 For females we plug in female 5 1 thus the intercept for females is b0 1 d0 and the slope is b1 1 d1 Therefore d0 measures the difference in intercepts between women and men and d1 measures the difference in the return to education between women and men Two of the four cases for the signs of d0 and d1 are presented in Figure 72 Graph a shows the case where the intercept for women is below that for men and the slope of the line is smaller for women than for men This means that women earn less than men at all levels of education and the gap increases as educ gets larger In graph b the intercept for women is below that for men but the slope on education is larger for women This means that women earn less than men at low levels of educa tion but the gap narrows as education increases At some point a woman earns more than a man with the same level of education and this amount of education is easily found once we have the estimated equation How can we estimate model 716 To apply OLS we must write the model with an interaction between female and educ log1wage2 5 b0 1 d0 female 1 b1educ 1 d1 female educ 1 u 717 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 219 The parameters can now be estimated from the regression of logwage on female educ and female educ Obtaining the interaction term is easy in any regression package Do not be daunted by the odd nature of female educ which is zero for any man in the sample and equal to the level of edu cation for any woman in the sample An important hypothesis is that the return to education is the same for women and men In terms of model 717 this is stated as H0 d1 5 0 which means that the slope of logwage with respect to educ is the same for men and women Note that this hypothesis puts no restrictions on the difference in intercepts d0 A wage differential between men and women is allowed under this null but it must be the same at all levels of education This situation is described by Figure 71 We are also interested in the hypothesis that average wages are identical for men and women who have the same levels of education This means that d0 and d1 must both be zero under the null hypothe sis In equation 717 we must use an F test to test H0 d0 5 0 d1 5 0 In the model with just an inter cept difference we reject this hypothesis because H0 d0 5 0 is soundly rejected against H1 d0 0 ExamplE 710 log Hourly Wage Equation We add quadratics in experience and tenure to 717 log1wage2 5 389 2 227 female 1 082 educ 11192 11682 10082 2 0056 female educ 1 029 exper 2 00058 exper2 101312 10052 1000112 718 1 032 tenure 2 00059 tenure2 10072 1000242 n 5 526 R2 5 441 logwage a educ men women logwage b educ men women FiguRe 72 Graphs of equation 716 a d0 0 d1 0 b d0 0 d1 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 220 The estimated return to education for men in this equation is 082 or 82 For women it is 082 2 0056 5 0764 or about 76 The difference 256 or just over onehalf a percent age point less for women is not economically large nor statistically significant the t statistic is 200560131 243 Thus we conclude that there is no evidence against the hypothesis that the return to education is the same for men and women The coefficient on female while remaining economically large is no longer significant at con ventional levels 1t 5 21352 Its coefficient and t statistic in the equation without the interaction were 2297 and 2825 respectively see equation 79 Should we now conclude that there is no statisti cally significant evidence of lower pay for women at the same levels of educ exper and tenure This would be a serious error Because we have added the interaction femaleeduc to the equation the coef ficient on female is now estimated much less precisely than it was in equation 79 the standard error has increased by almost fivefold 1168036 4672 This occurs because female and female educ are highly correlated in the sample In this example there is a useful way to think about the multicol linearity in equation 717 and the more general equation estimated in 718 d0 measures the wage differential between women and men when educ 5 0 Very few people in the sample have very low levels of education so it is not surprising that we have a difficult time estimating the differential at educ 5 0 nor is the differential at zero years of education very informative More interesting would be to estimate the gender differential at say the average education level in the sample about 125 To do this we would replace female educ with female 1educ 21252 and rerun the regression this only changes the coefficient on female and its standard error See Computer Exercise C7 If we compute the F statistic for H0 d0 5 0 d1 5 0 we obtain F 5 3433 which is a huge value for an F random variable with numerator df 5 2 and denominator df 5 518 the pvalue is zero to four decimal places In the end we prefer model 79 which allows for a constant wage differential between women and men As a more complicated example involving interac tions we now look at the effects of race and city racial composition on major league baseball player salaries ExamplE 711 Effects of Race on Baseball player Salaries Using MLB1 the following equation is estimated for the 330 major league baseball players for which city racial composition statistics are available The variables black and hispan are binary indicators for the individual players The base group is white players The variable percblck is the percentage of the teams city that is black and perchisp is the percentage of Hispanics The other variables meas ure aspects of player productivity and longevity Here we are interested in race effects after control ling for these other factors In addition to including black and hispan in the equation we add the interactions blackpercblck and hispanperchisp The estimated equation is log1salary2 5 1034 1 0673 years 1 0089 gamesyr 12182 101292 100342 1 00095 bavg 1 0146 hrunsyr 1 0045 rbisyr 1001512 101642 100762 1 0072 runsyr 1 0011 fldperc 1 0075 allstar 100462 100212 100292 How would you augment the model esti mated in 718 to allow the return to tenure to differ by gender Exploring FurthEr 74 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 221 2 198 black 2 190 hispan 1 0125 black percblck 11252 11532 100502 1 0201 hispan perchisp 100982 n 5 330 R2 5 638 719 First we should test whether the four race variables black hispan blackpercblck and hispanperchisp are jointly significant Using the same 330 players the Rsquared when the four race variables are dropped is 626 Since there are four restrictions and df 5 330 2 13 in the unrestricted model the F statistic is about 263 which yields a pvalue of 034 Thus these variables are jointly significant at the 5 level though not at the 1 level How do we interpret the coefficients on the race variables In the following discussion all pro ductivity factors are held fixed First consider what happens for black players holding perchisp fixed The coefficient 2198 on black literally means that if a black player is in a city with no blacks 1percblck 5 02 then the black player earns about 198 less than a comparable white player As percblck increaseswhich means the white population decreases since perchisp is held fixedthe salary of blacks increases relative to that for whites In a city with 10 blacks logsalary for blacks compared to that for whites is 2198 1 01251102 5 2073 so salary is about 73 less for blacks than for whites in such a city When percblck 5 20 blacks earn about 52 more than whites The largest percentage of blacks in a city is about 74 Detroit Similarly Hispanics earn less than whites in cities with a low percentage of Hispanics But we can easily find the value of perchisp that makes the differential between whites and Hispanics equal zero it must make 2190 1 0201 perchisp 5 0 which gives perchisp 945 For cities in which the percentage of Hispanics is less than 945 Hispanics are predicted to earn less than whites for a given black population and the opposite is true if the percentage of Hispanics is above 945 Twelve of the 22 cities represented in the sample have Hispanic populations that are less than 945 of the total population The largest percentage of Hispanics is about 31 How do we interpret these findings We cannot simply claim discrimination exists against blacks and Hispanics because the estimates imply that whites earn less than blacks and Hispanics in cities heavily populated by minorities The importance of city composition on salaries might be due to player preferences perhaps the best black players live disproportionately in cities with more blacks and the best Hispanic players tend to be in cities with more Hispanics The estimates in 719 allow us to determine that some relationship is present but we cannot distinguish between these two hypotheses 74c Testing for Differences in Regression Functions across Groups The previous examples illustrate that interacting dummy variables with other independent variables can be a powerful tool Sometimes we wish to test the null hypothesis that two populations or groups follow the same regression function against the alternative that one or more of the slopes differ across the groups We will also see examples of this in Chapter 13 when we discuss pooling different cross sections over time Suppose we want to test whether the same regression model describes college grade point aver ages for male and female college athletes The equation is cumgpa 5 b0 1 b1sat 1 b2hsperc 1 b3tothrs 1 u where sat is SAT score hsperc is high school rank percentile and tothrs is total hours of college courses We know that to allow for an intercept difference we can include a dummy variable for either males or females If we want any of the slopes to depend on gender we simply interact the appropriate variable with say female and include it in the equation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 222 If we are interested in testing whether there is any difference between men and women then we must allow a model where the intercept and all slopes can be different across the two groups cumgpa 5 b0 1 d0 female 1 b1sat 1 d1 female sat 1 b2hsperc 720 1 d2 female hsperc 1 b3tothrs 1 d3 female tothrs 1 u The parameter d0 is the difference in the intercept between women and men d1 is the slope difference with respect to sat between women and men and so on The null hypothesis that cumgpa follows the same model for males and females is stated as H0 d0 5 0 d1 5 0 d2 5 0 d3 5 0 721 If one of the dj is different from zero then the model is different for men and women Using the spring semester data from the file GPA3 the full model is estimated as cumgpa 5 148 2 353 female 1 0011 sat 1 00075 female sat 10212 14112 100022 1000392 20085 hsperc 2 00055 female hsperc 1 0023 tothrs 100142 1003162 100092 722 200012 female tothrs 1001632 n 5 366 R2 5 406 R2 5 394 None of the four terms involving the female dummy variable is very statistically significant only the femalesat interaction has a t statistic close to two But we know better than to rely on the individual t statistics for testing a joint hypothesis such as 721 To compute the F statistic we must estimate the restricted model which results from dropping female and all of the interactions this gives an R2 the restricted R2 of about 352 so the F statistic is about 814 the pvalue is zero to five decimal places which causes us to soundly reject 721 Thus men and women athletes do follow different GPA models even though each term in 722 that allows women and men to be different is individually insignificant at the 5 level The large standard errors on female and the interaction terms make it difficult to tell exactly how men and women differ We must be very careful in interpreting equation 722 because in obtaining differences between women and men the interaction terms must be taken into account If we look only at the female variable we would wrongly conclude that cumgpa is about 353 less for women than for men holding other factors fixed This is the estimated difference only when sat hsperc and tothrs are all set to zero which is not close to being a possible scenario At sat 5 1 100 hsperc 5 10 and tothrs 5 50 the predicted difference between a woman and a man is 2353 1 00075111002 2 000551102 2 000121502 461 That is the female athlete is pre dicted to have a GPA that is almost onehalf a point higher than the comparable male athlete In a model with three variables sat hsperc and tothrs it is pretty simple to add all of the inter actions to test for group differences In some cases many more explanatory variables are involved and then it is convenient to have a different way to compute the statistic It turns out that the sum of squared residuals form of the F statistic can be computed easily even when many independent vari ables are involved In the general model with k explanatory variables and an intercept suppose we have two groups call them g 5 1 and g 5 2 We would like to test whether the intercept and all slopes are the same across the two groups Write the model as y 5 bg 0 1 bg 1x1 1 bg 2x2 1 p 1 bg kxk 1 u 723 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 223 for g 5 1 and g 5 2 The hypothesis that each beta in 723 is the same across the two groups involves k 1 1 restrictions in the GPA example k 1 1 5 4 The unrestricted model which we can think of as having a group dummy variable and k interaction terms in addition to the inter cept and variables themselves has n 2 21k 1 12 degrees of freedom In the GPA example n 2 21k 1 12 5 366 2 2142 5 358 So far there is nothing new The key insight is that the sum of squared residuals from the unrestricted model can be obtained from two separate regressions one for each group Let SSR1 be the sum of squared residuals obtained estimating 723 for the first group this involves n1 observations Let SSR2 be the sum of squared residuals obtained from estimating the model using the second group n2 observations In the previous example if group 1 is females then n1 5 90 and n2 5 276 Now the sum of squared residuals for the unrestricted model is simply SSRur 5 SSR1 1 SSR2 The restricted sum of squared residuals is just the SSR from pooling the groups and estimating a single equation say SSRP Once we have these we compute the F statistic as usual F 5 3SSRP 2 1SSR1 1 SSR22 4 SSR1 1 SSR2 3n 2 21k 1 12 4 k 1 1 724 where n is the total number of observations This particular F statistic is usually called the Chow statistic in econometrics Because the Chow test is just an F test it is only valid under homoskedas ticity In particular under the null hypothesis the error variances for the two groups must be equal As usual normality is not needed for asymptotic analysis To apply the Chow statistic to the GPA example we need the SSR from the regression that pooled the groups together this is SSRP 5 85515 The SSR for the 90 women in the sample is SSR1 5 19603 and the SSR for the men is SSR2 5 58752 Thus SSRur 5 19603 1 58752 5 78355 The F sta tistic is 3 185515 2 783552783554135842 8184 of course subject to rounding error this is what we get using the Rsquared form of the test in the models with and without the interaction terms A word of caution there is no simple Rsquared form of the test if separate regressions have been estimated for each group the Rsquared form of the test can be used only if interactions have been included to create the unrestricted model One important limitation of the traditional Chow test regardless of the method used to imple ment it is that the null hypothesis allows for no differences at all between the groups In many cases it is more interesting to allow for an intercept difference between the groups and then to test for slope differences we saw one example of this in the wage equation in Example 710 There are two ways to allow the intercepts to differ under the null hypothesis One is to include the group dummy and all interaction terms as in equation 722 but then test joint significance of the interaction terms only The second approach which produces an identical statistic is to form a sumofsquaredresiduals F statistic as in equation 724 but where the restricted SSR called SSRP in equation 724 is obtained using a regression that contains only an intercept shift Because we are testing k restrictions rather than k 1 1 the F statistic becomes F 5 3SSRP 2 1SSR1 1 SSR22 4 SSR1 1 SSR2 3n 2 21k 1 12 4 k Using this approach in the GPA example SSRP is obtained from the regression cumgpa on female sat hsperc and tothrs using the data for both male and female studentathletes Because there are relatively few explanatory variables in the GPA example it is easy to estimate 720 and test H0 d1 5 0 d2 5 0 d3 5 0 with d0 unrestricted under the null The F statistic for the three exclusion restrictions gives a pvalue equal to 205 and so we do not reject the null hypothesis at even the 20 significance level Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 224 Failure to reject the hypothesis that the parameters multiplying the interaction terms are all zero suggests that the best model allows for an intercept difference only cumgpa 5 139 1 310 female 1 0012 sat 2 0084 hsperc 1182 10592 100022 100122 1 0025 tothrs 725 100072 n 5 366 R2 5 398 R2 5 392 The slope coefficients in 725 are close to those for the base group males in 722 dropping the interactions changes very little However female in 725 is highly significant its t statistic is over 5 and the estimate implies that at given levels of sat hsperc and tothrs a female athlete has a predicted GPA that is 31 point higher than that of a male athlete This is a practically important difference 75 A Binary Dependent Variable The Linear Probability Model By now we have learned much about the properties and applicability of the multiple linear regression model In the last several sections we studied how through the use of binary independent variables we can incorporate qualitative information as explanatory variables in a multiple regression model In all of the models up until now the dependent variable y has had quantitative meaning for example y is a dollar amount a test score a percentage or the logs of these What happens if we want to use multiple regression to explain a qualitative event In the simplest case and one that often arises in practice the event we would like to explain is a binary outcome In other words our dependent variable y takes on only two values zero and one For example y can be defined to indicate whether an adult has a high school education y can indicate whether a college student used illegal drugs during a given school year or y can indicate whether a firm was taken over by another firm during a given year In each of these examples we can let y 5 1 denote one of the outcomes and y 5 0 the other outcome What does it mean to write down a multiple regression model such as y 5 b0 1 b1x1 1 p 1 bkxk 1 u 726 when y is a binary variable Because y can take on only two values bj cannot be interpreted as the change in y given a oneunit increase in xj holding all other factors fixed y either changes from zero to one or from one to zero or does not change Nevertheless the bj still have useful interpretations If we assume that the zero conditional mean assumption MLR4 holds that is E1u0x1 p xk2 5 0 then we have as always E1y0x2 5 b0 1 b1x1 1 p 1 bkxk where x is shorthand for all of the explanatory variables The key point is that when y is a binary variable taking on the values zero and one it is always true that P1y 5 10x2 5 E1y0x2 the probability of successthat is the probability that y 5 1is the same as the expected value of y Thus we have the important equation P1y 5 10x2 5 b0 1 b1x1 1 p 1 bkxk 727 which says that the probability of success say p1x2 5 P1y 5 10x2 is a linear function of the xj Equation 727 is an example of a binary response model and P1y 5 10x2 is also called the response Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 225 probability We will cover other binary response models in Chapter 17 Because probabilities must sum to one P1y 5 00x2 5 1 2 P1y 5 10x2 is also a linear function of the xj The multiple linear regression model with a binary dependent variable is called the linear prob ability model LPM because the response probability is linear in the parameters bj In the LPM bj measures the change in the probability of success when xj changes holding other factors fixed DP1y 5 10x2 5 bjDxj 728 With this in mind the multiple regression model can allow us to estimate the effect of various explan atory variables on qualitative events The mechanics of OLS are the same as before If we write the estimated equation as y 5 b 0 1 b 1x1 1 p 1 b kxk we must now remember that y is the predicted probability of success Therefore b 0 is the predicted probability of success when each xj is set to zero which may or may not be interesting The slope coefficient b 1 measures the predicted change in the probability of success when x1 increases by one unit To correctly interpret a linear probability model we must know what constitutes a success Thus it is a good idea to give the dependent variable a name that describes the event y 5 1 As an example let inlf in the labor force be a binary variable indicating labor force participation by a married woman during 1975 inlf 5 1 if the woman reports working for a wage outside the home at some point during the year and zero otherwise We assume that labor force participation depends on other sources of income including husbands earnings nwifeinc measured in thousands of dollars years of education educ past years of labor market experience exper age number of children less than six years old kidslt6 and number of kids between 6 and 18 years of age kidsge6 Using the data in MROZ from Mroz 1987 we estimate the following linear probability model where 428 of the 753 women in the sample report being in the labor force at some point during 1975 inlf 5 586 2 0034 nwifeinc 1 038 educ 1 039 exper 11542 100142 10072 10062 2 00060 exper2 2 016 age 2 262 kidslt6 1 013 kidsge6 729 1000182 10022 10342 10132 n 5 753 R2 5 264 Using the usual t statistics all variables in 729 except kidsge6 are statistically significant and all of the significant variables have the effects we would expect based on economic theory or common sense To interpret the estimates we must remember that a change in the independent variable changes the probability that inlf 5 1 For example the coefficient on educ means that everything else in 729 held fixed another year of education increases the probability of labor force participation by 038 If we take this equation literally 10 more years of education increases the probability of being in the labor force by 0381102 5 38 which is a pretty large increase in a probability The relation ship between the probability of labor force participation and educ is plotted in Figure 73 The other independent variables are fixed at the values nwifeinc 5 50 exper 5 5 age 5 30 kidslt6 5 1 and kidsge6 5 0 for illustration purposes The predicted probability is negative until education equals 384 years This should not cause too much concern because in this sample no woman has less than five years of education The largest reported education is 17 years and this leads to a predicted probability of 5 If we set the other independent variables at different values the range of predicted probabilities would change But the marginal effect of another year of education on the probability of labor force participation is always 038 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 226 The coefficient on nwifeinc implies that if Dnwifeinc 5 10 which means an increase of 10000 the probability that a woman is in the labor force falls by 034 This is not an especially large effect given that an increase in income of 10000 is substantial in terms of 1975 dollars Experience has been entered as a quadratic to allow the effect of past experience to have a diminishing effect on the labor force participation probability Holding other factors fixed the estimated change in the probabil ity is approximated as 039 2 2100062exper 5 039 2 0012 exper The point at which past experi ence has no effect on the probability of labor force participation is 0390012 5 325 which is a high level of experience only 13 of the 753 women in the sample have more than 32 years of experience Unlike the number of older children the number of young children has a huge impact on labor force participation Having one additional child less than six years old reduces the probability of par ticipation by 2262 at given levels of the other variables In the sample just under 20 of the women have at least one young child This example illustrates how easy linear probability models are to estimate and interpret but it also highlights some shortcomings of the LPM First it is easy to see that if we plug certain combina tions of values for the independent variables into 729 we can get predictions either less than zero or greater than one Since these are predicted probabilities and probabilities must be between zero and one this can be a little embarrassing For example what would it mean to predict that a woman is in the labor force with a probability of 10 In fact of the 753 women in the sample 16 of the fitted values from 729 are less than zero and 17 of the fitted values are greater than one A related problem is that a probability cannot be linearly related to the independent variables for all their possible values For example 729 predicts that the effect of going from zero chil dren to one young child reduces the probability of working by 262 This is also the predicted drop if the woman goes from having one young child to two It seems more realistic that the first small child would reduce the probability by a large amount but subsequent children would have a smaller educ probability of labor force participation 384 5 0 146 slope 038 FiguRe 73 Estimated relationship between the probability of being in the labor force and years of education with other explanatory variables fixed Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 227 marginal effect In fact when taken to the extreme 729 implies that going from zero to four young children reduces the probability of working by D inlf 5 2621Dkidslt62 5 262142 5 1048 which is impossible Even with these problems the linear probability model is useful and often applied in economics It usually works well for values of the independent variables that are near the averages in the sample In the labor force participation example no women in the sample have four young children in fact only three women have three young children Over 96 of the women have either no young children or one small child and so we should probably restrict attention to this case when interpreting the estimated equation Predicted probabilities outside the unit interval are a little troubling when we want to make pre dictions Still there are ways to use the estimated probabilities even if some are negative or greater than one to predict a zeroone outcome As before let yi denote the fitted valueswhich may not be bounded between zero and one Define a predicted value as yi 5 1 if yi 5 and yi 5 0 if yi 5 Now we have a set of predicted values yi i 5 1 p n that like the yi are either zero or one We can use the data on yI and yi to obtain the frequencies with which we correctly predict yi 5 1 and yi 5 0 as well as the proportion of overall correct predictions The latter measure when turned into a percentage is a widely used goodnessoffit measure for binary dependent variables the percent correctly predicted An example is given in Computer Exercise C9v and further discussion in the context of more advanced models can be found in Section 171 Due to the binary nature of y the linear probability model does violate one of the GaussMarkov assumptions When y is a binary variable its variance conditional on x is Var1y0x2 5 p1x2 31 2 p1x2 4 730 where px is shorthand for the probability of success p1x2 5 b0 1 b1x1 1 p 1 bkxk This means that except in the case where the probability does not depend on any of the independent variables there must be heteroskedasticity in a linear probability model We know from Chapter 3 that this does not cause bias in the OLS estimators of the bj But we also know from Chapters 4 and 5 that homoske dasticity is crucial for justifying the usual t and F statistics even in large samples Because the stand ard errors in 729 are not generally valid we should use them with caution We will show how to correct the standard errors for heteroskedasticity in Chapter 8 It turns out that in many applications the usual OLS statistics are not far off and it is still acceptable in applied work to present a standard OLS analysis of a linear probability model ExamplE 712 a linear probability model of arrests Let arr86 be a binary variable equal to unity if a man was arrested during 1986 and zero otherwise The population is a group of young men in California born in 1960 or 1961 who have at least one arrest prior to 1986 A linear probability model for describing arr86 is arr86 5 b0 1 b1 pcnv 1 b2 avgsen 1 b3 tottime 1 b4 ptime86 1 b5 qemp86 1 u where pcnv 5 the proportion of prior arrests that led to a conviction avgsen 5 the average sentence served from prior convictions 1in months2 tottime 5 months spent in prison since age 18 prior to 1986 ptime86 5 months spent in prison in 1986 qemp86 5 the number of quarters 10 to 42 that the man was legally employed in 1986 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 228 The data we use are in CRIME1 the same data set used for Example 35 Here we use a binary dependent variable because only 72 of the men in the sample were arrested more than once About 277 of the men were arrested at least once during 1986 The estimated equation is arr86 5 441 2 162 pcnv 1 0061 avgsen 2 0023 tottime 10172 10212 100652 100502 2 022 ptime86 2 043 qemp86 731 10052 10052 n 5 2725 R2 5 0474 The intercept 441 is the predicted probability of arrest for someone who has not been convicted and so pcnv and avgsen are both zero has spent no time in prison since age 18 spent no time in prison in 1986 and was unemployed during the entire year The variables avgsen and tottime are insignificant both individually and jointly the F test gives pvalue 5 347 and avgsen has a coun terintuitive sign if longer sentences are supposed to deter crime Grogger 1991 using a superset of these data and different econometric methods found that tottime has a statistically significant posi tive effect on arrests and concluded that tottime is a measure of human capital built up in criminal activity Increasing the probability of conviction does lower the probability of arrest but we must be careful when interpreting the magnitude of the coefficient The variable pcnv is a proportion between zero and one thus changing pcnv from zero to one essentially means a change from no chance of being convicted to being convicted with certainty Even this large change reduces the probability of arrest only by 162 increasing pcnv by 5 decreases the probability of arrest by 081 The incarcerative effect is given by the coefficient on ptime86 If a man is in prison he cannot be arrested Since ptime86 is measured in months six more months in prison reduces the probability of arrest by 022162 5 132 Equation 731 gives another example of where the linear probability model cannot be true over all ranges of the independent variables If a man is in prison all 12 months of 1986 he cannot be arrested in 1986 Setting all other variables equal to zero the predicted proba bility of arrest when ptime86 5 12 is 441 2 0221122 5 177 which is not zero Nevertheless if we start from the unconditional probability of arrest 277 12 months in prison reduces the probability to essentially zero 277 2 0221122 5 013 Finally employment reduces the probability of arrest in a significant way All other factors fixed a man employed in all four quarters is 172 less likely to be arrested than a man who is not employed at all We can also include dummy independent variables in models with dummy dependent variables The coefficient measures the predicted difference in probability relative to the base group For example if we add two race dummies black and hispan to the arrest equation we obtain arr86 5 380 2 152 pcnv 1 0046 avgsen 2 0026 tottime 10192 10212 100642 100492 2 024 ptime86 2 038 qemp86 1 170 black 1 096 hispan 732 10052 10052 10242 10212 n 5 2725 R2 5 0682 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 229 The coefficient on black means that all other factors being equal a black man has a 17 higher chance of being arrested than a white man the base group Another way to say this is that the probability of arrest is 17 percentage points higher for blacks than for whites The difference is statistically significant as well Similarly Hispanic men have a 096 higher chance of being arrested than white men 76 More on Policy Analysis and Program Evaluation We have seen some examples of models containing dummy variables that can be useful for evaluating policy Example 73 gave an example of program evaluation where some firms received job training grants and others did not As we mentioned earlier we must be careful when evaluating programs because in most exam ples in the social sciences the control and treatment groups are not randomly assigned Consider again the Holzer et al 1993 study where we are now interested in the effect of the job training grants on worker productivity as opposed to amount of job training The equation of interest is log1scrap2 5 b0 1 b1grant 1 b2log1sales2 1 b3log1employ2 1 u where scrap is the firms scrap rate and the latter two variables are included as controls The binary variable grant indicates whether the firm received a grant in 1988 for job training Before we look at the estimates we might be worried that the unobserved factors affecting worker productivitysuch as average levels of education ability experience and tenuremight be correlated with whether the firm receives a grant Holzer et al point out that grants were given on a firstcome firstserved basis But this is not the same as giving out grants randomly It might be that firms with less productive workers saw an opportunity to improve productivity and therefore were more diligent in applying for the grants Using the data in JTRAIN for 1988when firms actually were eligible to receive the grantswe obtain log1scrap2 5 499 2 052 grant 2 455 log1sales2 14662 14312 13732 1 639 log1employ2 733 13652 n 5 50 R2 5 072 Seventeen out of the 50 firms received a training grant and the average scrap rate is 347 across all firms The point estimate of 2052 on grant means that for given sales and employ firms receiv ing a grant have scrap rates about 52 lower than firms without grants This is the direction of the expected effect if the training grants are effective but the t statistic is very small Thus from this crosssectional analysis we must conclude that the grants had no effect on firm productivity We will return to this example in Chapter 9 and show how adding information from a prior year leads to a much different conclusion Even in cases where the policy analysis does not involve assigning units to a control group and a treatment group we must be careful to include factors that might be systematically related to the binary independent variable of interest A good example of this is testing for racial discrimination Race is something that is not determined by an individual or by government administrators In fact What is the predicted probability of arrest for a black man with no prior convictionsso that pcnv avgsen tottime and ptime86 are all zerowho was employed all four quar ters in 1986 Does this seem reasonable Exploring FurthEr 75 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 230 race would appear to be the perfect example of an exogenous explanatory variable given that it is determined at birth However for historical reasons race is often related to other relevant factors there are systematic differences in backgrounds across race and these differences can be important in testing for current discrimination As an example consider testing for discrimination in loan approvals If we can collect data on say individual mortgage applications then we can define the dummy dependent variable approved as equal to one if a mortgage application was approved and zero otherwise A systematic difference in approval rates across races is an indication of discrimination However since approval depends on many other factors including income wealth credit ratings and a general ability to pay back the loan we must control for them if there are systematic differences in these factors across race A linear probability model to test for discrimination might look like the following approved 5 b0 1 b1nonwhite 1 b2income 1 b3wealth 1 b4credrate 1 other factors Discrimination against minorities is indicated by a rejection of H0 b1 5 0 in favor of H0 b1 0 because b1 is the amount by which the probability of a nonwhite getting an approval differs from the probability of a white getting an approval given the same levels of other variables in the equation If income wealth and so on are systematically different across races then it is important to control for these factors in a multiple regression analysis Another problem that often arises in policy and program evaluation is that individuals or firms or cities choose whether or not to participate in certain behaviors or programs For example individuals choose to use illegal drugs or drink alcohol If we want to examine the effects of such behaviors on unemployment status earnings or criminal behavior we should be concerned that drug usage might be correlated with other factors that can affect employment and criminal outcomes Children eligible for programs such as Head Start participate based on parental decisions Since family background plays a role in Head Start decisions and affects student outcomes we should control for these factors when examining the effects of Head Start see for example Currie and Thomas 1995 Individuals selected by employers or government agencies to participate in job training programs can participate or not and this decision is unlikely to be random see for example Lynch 1992 Cities and states choose whether to implement certain gun control laws and it is likely that this decision is systemati cally related to other factors that affect violent crime see for example Kleck and Patterson 1993 The previous paragraph gives examples of what are generally known as selfselection problems in economics Literally the term comes from the fact that individuals selfselect into certain behav iors or programs participation is not randomly determined The term is used generally when a binary indicator of participation might be systematically related to unobserved factors Thus if we write the simple model y 5 b0 1 b1partic 1 u 734 where y is an outcome variable and partic is a binary variable equal to unity if the individual firm or city participates in a behavior or a program or has a certain kind of law then we are worried that the average value of u depends on participation E1u0partic 5 12 2 E1u0partic 5 02 As we know this causes the simple regression estimator of b1 to be biased and so we will not uncover the true effect of participation Thus the selfselection problem is another way that an explanatory variable partic in this case can be endogenous By now we know that multiple regression analysis can to some degree alleviate the self selection problem Factors in the error term in 734 that are correlated with partic can be included in a multiple regression equation assuming of course that we can collect data on these factors Unfortunately in many cases we are worried that unobserved factors are related to participation in which case multiple regression produces biased estimators With standard multiple regression analysis using crosssectional data we must be aware of find ing spurious effects of programs on outcome variables due to the selfselection problem A good Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 231 example of this is contained in Currie and Cole 1993 These authors examine the effect of AFDC Aid to Families with Dependent Children participation on the birth weight of a child Even after controlling for a variety of family and background characteristics the authors obtain OLS estimates that imply participation in AFDC lowers birth weight As the authors point out it is hard to believe that AFDC participation itself causes lower birth weight See Currie 1995 for additional examples Using a different econometric method that we will discuss in Chapter 15 Currie and Cole find evi dence for either no effect or a positive effect of AFDC participation on birth weight When the selfselection problem causes standard multiple regression analysis to be biased due to a lack of sufficient control variables the more advanced methods covered in Chapters 1314 and 15 can be used instead 77 Interpreting Regression Results with Discrete Dependent Variables A binary response is the most extreme form of a discrete random variable it takes on only two val ues zero and one As we discussed in Section 75 the parameters in a linear probability model can be interpreted as measuring the change in the probability that y 5 1 due to a oneunit increase in an explanatory variable We also discussed that because y is a zeroone outcome P1y 5 12 5 E1y2 and this equality continues to hold when we condition on explanatory variables Other discrete dependent variables arise in practice and we have already seen some examples such as the number of times someone is arrested in a given year Example 35 Studies on factors affecting fertility often use the number of living children as the dependent variable in a regression analysis As with number of arrests the number of living children takes on a small set of integer val ues and zero is a common value The data in FERTIL2 which contains information on a large sample of women in Botswana is one such example Often demographers are interested in the effects of edu cation on fertility with special attention to trying to determine whether education has a causal effect on fertility Such examples raise a question about how one interprets regression coefficients after all one cannot have a fraction of a child To illustrate the issues the regression below uses the data in FERTIL2 children 5 21997 1 175 age 2 090 educ 10942 10032 10062 735 n 5 4361 R2 5 560 At this time we ignore the issue of whether this regression adequately controls for all factors that affect fertility Instead we focus on interpreting the regression coefficients Consider the main coefficient of interest b educ 5 2090 If we take this estimate literally it says that each additional year of education reduces the estimated number of children by 090something obviously impossible for any particular woman A similar problem arises when trying to interpret b age 5 175 How can we make sense of these coefficients To interpret regression results generally even in cases where y is discrete and takes on a small number of values it is useful to remember the interpretation of OLS as estimating the effects of the xj on the expected or average value of y Generally under Assumptions MLR1 and MLR4 E1y0x1 x2 p xk2 5 b0 1 b1x1 1 p 1 bkxk 736 Therefore bj is the effect of a ceteris paribus increase of xj on the expected value of y As we discussed in Section 64 for a given set of xj values we interpret the predicted value b 0 1 b 1x1 1 p 1 b kxk as an estimate of E1y0x1 x2 p xk2 Therefore b j is our estimate of how the average of y changes when Dxj 5 1 keeping other factors fixed Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 232 PART 1 Regression Analysis with CrossSectional Data Seen in this light we can now provide meaning to regression results as in equation 735 The coefficient b educ 5 2090 means that we estimate that average fertility falls by 09 children given one more year of education A nice way to summarize this interpretation is that if each woman in a group of 100 obtains another year of education we estimate there will be nine fewer children among them Adding dummy variables to regressions when y is itself discrete causes no problems when we interpret the estimated effect in terms of average values Using the data in FERTIL2 we get children 5 22071 1 177 age 2 079 educ 2 362 electric 737 10952 10032 10062 10682 n 5 4358 R2 5 562 where electric is a dummy variable equal to one if the woman lives in a home with electricity Of course it cannot be true that a particular woman who has electricity has 362 less children than an otherwise comparable woman who does not But we can say that when comparing 100 women with electricity to 100 women withoutat the same age and level of educationwe estimate the former group to have about 36 fewer children Incidentally when y is discrete the linear model does not always provide the best estimates of par tial effects on E1y0x1 x2 p xk2 Chapter 17 contains more advanced models and estimation methods that tend to fit the data better when the range of y is limited in some substantive way Nevertheless a linear model estimated by OLS often provides a good approximation to the true partial effects at least on average Summary In this chapter we have learned how to use qualitative information in regression analysis In the sim plest case a dummy variable is defined to distinguish between two groups and the coefficient estimate on the dummy variable estimates the ceteris paribus difference between the two groups Allowing for more than two groups is accomplished by defining a set of dummy variables if there are g groups then g 2 1 dummy variables are included in the model All estimates on the dummy variables are inter preted relative to the base or benchmark group the group for which no dummy variable is included in the model Dummy variables are also useful for incorporating ordinal information such as a credit or a beauty rating in regression models We simply define a set of dummy variables representing different outcomes of the ordinal variable allowing one of the categories to be the base group Dummy variables can be interacted with quantitative variables to allow slope differences across different groups In the extreme case we can allow each group to have its own slope on every vari able as well as its own intercept The Chow test can be used to detect whether there are any dif ferences across groups In many cases it is more interesting to test whether after allowing for an intercept difference the slopes for two different groups are the same A standard F test can be used for this purpose in an unrestricted model that includes interactions between the group dummy and all variables The linear probability model which is simply estimated by OLS allows us to explain a binary re sponse using regression analysis The OLS estimates are now interpreted as changes in the probability of success 1y 5 12 given a oneunit increase in the corresponding explanatory variable The LPM does have some drawbacks it can produce predicted probabilities that are less than zero or greater than one it implies a constant marginal effect of each explanatory variable that appears in its original form and it contains heteroskedasticity The first two problems are often not serious when we are obtaining estimates Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 7 Multiple Regression Analysis with Qualitative Information 233 of the partial effects of the explanatory variables for the middle ranges of the data Heteroskedasticity does invalidate the usual OLS standard errors and test statistics but as we will see in the next chapter this is easily fixed in large enough samples Section 76 provides a discussion of how binary variables are used to evaluate policies and programs As in all regression analysis we must remember that program participation or some other binary regressor with policy implications might be correlated with unobserved factors that affect the dependent variable resulting in the usual omitted variables bias We ended this chapter with a general discussion of how to interpret regression equations when the dependent variable is discrete The key is to remember that the coefficients can be interpreted as the effects on the expected value of the dependent variable Key Terms Base Group Benchmark Group Binary Variable Chow Statistic Control Group Difference in Slopes Dummy Variable Trap Dummy Variables Experimental Group Interaction Term Intercept Shift Linear Probability Model LPM Ordinal Variable Percent Correctly Predicted Policy Analysis Program Evaluation Response Probability SelfSelection Treatment Group Uncentered RSquared ZeroOne Variable Problems 1 Using the data in SLEEP75 see also Problem 3 in Chapter 3 we obtain the estimated equation sleep 5 384083 2 163 totwrk 2 1171 educ 2 870 age 1235112 10182 15862 111212 1 128 age2 1 8775 male 11342 134332 n 5 706 R2 5 123 R2 5 117 The variable sleep is total minutes per week spent sleeping at night totwrk is total weekly minutes spent working educ and age are measured in years and male is a gender dummy i All other factors being equal is there evidence that men sleep more than women How strong is the evidence ii Is there a statistically significant tradeoff between working and sleeping What is the estimated tradeoff iii What other regression do you need to run to test the null hypothesis that holding other factors fixed age has no effect on sleeping 2 The following equations were estimated using the data in BWGHT log1bwght2 5 466 2 0044 cigs 1 0093 log1faminc2 1 016 parity 1222 100092 100592 10062 1 027 male 1 055 white 10102 10132 n 5 1388 R2 5 0472 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 234 PART 1 Regression Analysis with CrossSectional Data and log1bwght2 5 465 2 0052 cigs 1 0110 log1faminc2 1 017 parity 1382 100102 100852 10062 1 034 male 1 045 white 2 0030 motheduc 1 0032 fatheduc 10112 10152 100302 100262 n 5 1191 R2 5 0493 The variables are defined as in Example 49 but we have added a dummy variable for whether the child is male and a dummy variable indicating whether the child is classified as white i In the first equation interpret the coefficient on the variable cigs In particular what is the effect on birth weight from smoking 10 more cigarettes per day ii How much more is a white child predicted to weigh than a nonwhite child holding the other factors in the first equation fixed Is the difference statistically significant iii Comment on the estimated effect and statistical significance of motheduc iv From the given information why are you unable to compute the F statistic for joint significance of motheduc and fatheduc What would you have to do to compute the F statistic 3 Using the data in GPA2 the following equation was estimated sat 5 102810 1 1930 hsize 2 219 hsize2 2 4509 female 16292 13832 1532 14292 2 16981 black 1 6231 female black 112712 118152 n 5 4137 R2 5 0858 The variable sat is the combined SAT score hsize is size of the students high school graduating class in hundreds female is a gender dummy variable and black is a race dummy variable equal to one for blacks and zero otherwise i Is there strong evidence that hsize2 should be included in the model From this equation what is the optimal high school size ii Holding hsize fixed what is the estimated difference in SAT score between nonblack females and nonblack males How statistically significant is this estimated difference iii What is the estimated difference in SAT score between nonblack males and black males Test the null hypothesis that there is no difference between their scores against the alternative that there is a difference iv What is the estimated difference in SAT score between black females and nonblack females What would you need to do to test whether the difference is statistically significant 4 An equation explaining chief executive officer salary is log1salary2 5 459 1 257 log1sales2 1 011 roe 1 158 finance 1302 10322 10042 10892 1 181 consprod 2 283 utility 10852 10992 n 5 209 R2 5 357 The data used are in CEOSAL1 where finance consprod and utility are binary variables indicating the financial consumer products and utilities industries The omitted industry is transportation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 235 i Compute the approximate percentage difference in estimated salary between the utility and transportation industries holding sales and roe fixed Is the difference statistically significant at the 1 level ii Use equation 710 to obtain the exact percentage difference in estimated salary between the utility and transportation industries and compare this with the answer obtained in part i iii What is the approximate percentage difference in estimated salary between the consumer products and finance industries Write an equation that would allow you to test whether the difference is statistically significant 5 In Example 72 let noPC be a dummy variable equal to one if the student does not own a PC and zero otherwise i If noPC is used in place of PC in equation 76 what happens to the intercept in the estimated equation What will be the coefficient on noPC Hint Write PC 5 1 2 noPC and plug this into the equation colGPA 5 b 0 1 d 0PC 1 b 1hsGPA 1 b 2ACT ii What will happen to the Rsquared if noPC is used in place of PC iii Should PC and noPC both be included as independent variables in the model Explain 6 To test the effectiveness of a job training program on the subsequent wages of workers we specify the model log1wage2 5 b0 1 b1train 1 b2educ 1 b3exper 1 u where train is a binary variable equal to unity if a worker participated in the program Think of the error term u as containing unobserved worker ability If less able workers have a greater chance of being selected for the program and you use an OLS analysis what can you say about the likely bias in the OLS estimator of b1 Hint Refer back to Chapter 3 7 In the example in equation 729 suppose that we define outlf to be one if the woman is out of the labor force and zero otherwise i If we regress outlf on all of the independent variables in equation 729 what will happen to the intercept and slope estimates Hint inlf 5 1 2 outlf Plug this into the population equation inlf 5 b0 1 b1nwifeinc 1 b2educ 1 and rearrange ii What will happen to the standard errors on the intercept and slope estimates iii What will happen to the Rsquared 8 Suppose you collect data from a survey on wages education experience and gender In addition you ask for information about marijuana usage The original question is On how many separate occasions last month did you smoke marijuana i Write an equation that would allow you to estimate the effects of marijuana usage on wage while controlling for other factors You should be able to make statements such as Smoking marijuana five more times per month is estimated to change wage by x ii Write a model that would allow you to test whether drug usage has different effects on wages for men and women How would you test that there are no differences in the effects of drug usage for men and women iii Suppose you think it is better to measure marijuana usage by putting people into one of four categories nonuser light user 1 to 5 times per month moderate user 6 to 10 times per month and heavy user more than 10 times per month Now write a model that allows you to estimate the effects of marijuana usage on wage iv Using the model in part iii explain in detail how to test the null hypothesis that marijuana usage has no effect on wage Be very specific and include a careful listing of degrees of freedom v What are some potential problems with drawing causal inference using the survey data that you collected Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 236 PART 1 Regression Analysis with CrossSectional Data 9 Let d be a dummy binary variable and let z be a quantitative variable Consider the model y 5 b0 1 d0d 1 b1z 1 d1d z 1 u this is a general version of a model with an interaction between a dummy variable and a quantitative variable An example is in equation 717 i Since it changes nothing important set the error to zero u 5 0 Then when d 5 0 we can write the relationship between y and z as the function f01z2 5 b0 1 b1z Write the same relationship when d 5 1 where you should use f11z2 on the lefthand side to denote the linear function of z ii Assuming that d1 2 0 which means the two lines are not parallel show that the value of z such that f01z2 5 f11z2 is z 5 2d0d1 This is the point at which the two lines intersect as in Figure 72 b Argue that z is positive if and only if d0 and d1 have opposite signs iii Using the data in TWOYEAR the following equation can be estimated log1wage2 5 2289 2 357 female 1 50 totcoll 1 030 female totcoll 100112 10152 10032 10052 n 5 6763 R2 5 202 where all coefficients and standard errors have been rounded to three decimal places Using this equation find the value of totcoll such that the predicted values of logwage are the same for men and women iv Based on the equation in part iii can women realistically get enough years of college so that their earnings catch up to those of men Explain 10 For a child i living in a particular school district let voucheri be a dummy variable equal to one if a child is selected to participate in a school voucher program and let scorei be that childs score on a subsequent standardized exam Suppose that the participation variable voucheri is completely ran domized in the sense that it is independent of both observed and unobserved factors that can affect the test score i If you run a simple regression scorei on voucheri using a random sample of size n does the OLS estimator provide an unbiased estimator of the effect of the voucher program ii Suppose you can collect additional background information such as family income family structure eg whether the child lives with both parents and parents education levels Do you need to control for these factors to obtain an unbiased estimator of the effects of the voucher program Explain iii Why should you include the family background variables in the regression Is there a situation in which you would not include the background variables 11 The following equations were estimated using the data in ECONMATH with standard errors reported under coefficients The average class score measured as a percentage is about 722 exactly 50 of the students are male and the average of colgpa grade point average at the start of the term is about 281 score 5 3231 1 1432 colgpa 12002 10702 n 5 856 R2 5 329 R2 5 328 score 5 2966 1 383 male 1 1457 colgpa 12042 10742 10692 n 5 856 R2 5 349 R2 5 348 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 7 Multiple Regression Analysis with Qualitative Information 237 score 5 3036 1 247 male 1 1433 colgpa 1 0479 male colgpa 12862 13962 10982 113832 n 5 856 R2 5 349 R2 5 347 score 5 3036 1 382 male 1 1433 colgpa 1 0479 male 1colgpa 2 2812 12862 10742 10982 113832 n 5 856 R2 5 349 R2 5 347 i Interpret the coefficient on male in the second equation and construct a 95 confidence interval for bmale Does the confidence interval exclude zero ii In the second equation how come the estimate on male is so imprecise Should we now conclude that there are no gender differences in score after controlling for colgpa Hint You might want to compute an F statistic for the null hypothesis that there is no gender difference in the model with the interaction iii Compared with the third equation how come the coefficient on male in the last equation is so much closer to that in the second equation and just as precisely estimated Computer Exercises C1 Use the data in GPA1 for this exercise i Add the variables mothcoll and fathcoll to the equation estimated in 76 and report the results in the usual form What happens to the estimated effect of PC ownership Is PC still statistically significant ii Test for joint significance of mothcoll and fathcoll in the equation from part i and be sure to report the pvalue iii Add hsGPA2 to the model from part i and decide whether this generalization is needed C2 Use the data in WAGE2 for this exercise i Estimate the model log1wage2 5 b0 1 b1educ 1 b2exper 1 b3tenure 1 b4married 1 b5black 1 b6south 1 b7urban 1 u and report the results in the usual form Holding other factors fixed what is the approximate difference in monthly salary between blacks and nonblacks Is this difference statistically significant ii Add the variables exper2 and tenure2 to the equation and show that they are jointly insignificant at even the 20 level iii Extend the original model to allow the return to education to depend on race and test whether the return to education does depend on race iv Again start with the original model but now allow wages to differ across four groups of people married and black married and nonblack single and black and single and nonblack What is the estimated wage differential between married blacks and married nonblacks C3 A model that allows major league baseball player salary to differ by position is log1salary2 5 b0 1 b1 years 1 b2 gamesyr 1 b3 bavg 1 b4 hrunsyr 1 b5 rbisyr 1 b6 runsyr 1 b7 fldperc 1 b8 allstar 1 b9 frstbase 1 b10scndbase 1 b11thrdbase 1 b12 shrtstop 1 b13 catcher 1 u where outfield is the base group Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 238 PART 1 Regression Analysis with CrossSectional Data i State the null hypothesis that controlling for other factors catchers and outfielders earn on average the same amount Test this hypothesis using the data in MLB1 and comment on the size of the estimated salary differential ii State and test the null hypothesis that there is no difference in average salary across positions once other factors have been controlled for iii Are the results from parts i and ii consistent If not explain what is happening C4 Use the data in GPA2 for this exercise i Consider the equation colgpa 5 b0 1 b1hsize 1 b2hsize2 1 b3hsperc 1 b4sat 1 b5female 1 b6athlete 1 u where colgpa is cumulative college grade point average hsize is size of high school graduating class in hundreds hsperc is academic percentile in graduating class sat is combined SAT score female is a binary gender variable and athlete is a binary variable which is one for studentathletes What are your expectations for the coefficients in this equation Which ones are you unsure about ii Estimate the equation in part i and report the results in the usual form What is the estimated GPA differential between athletes and nonathletes Is it statistically significant iii Drop sat from the model and reestimate the equation Now what is the estimated effect of being an athlete Discuss why the estimate is different than that obtained in part ii iv In the model from part i allow the effect of being an athlete to differ by gender and test the null hypothesis that there is no ceteris paribus difference between women athletes and women nonathletes v Does the effect of sat on colgpa differ by gender Justify your answer C5 In Problem 2 in Chapter 4 we added the return on the firms stock ros to a model explaining CEO salary ros turned out to be insignificant Now define a dummy variable rosneg which is equal to one if ros 0 and equal to zero if ros 0 Use CEOSAL1 to estimate the model log1salary2 5 b0 1 b1log1sales2 1 b2roe 1 b3rosneg 1 u Discuss the interpretation and statistical significance of b 3 C6 Use the data in SLEEP75 for this exercise The equation of interest is sleep 5 b0 1 b1totwrk 1 b2educ 1 b3age 1 b4age2 1 b5yngkid 1 u i Estimate this equation separately for men and women and report the results in the usual form Are there notable differences in the two estimated equations ii Compute the Chow test for equality of the parameters in the sleep equation for men and women Use the form of the test that adds male and the interaction terms maletotwrk maleyngkid and uses the full set of observations What are the relevant df for the test Should you reject the null at the 5 level iii Now allow for a different intercept for males and females and determine whether the interaction terms involving male are jointly significant iv Given the results from parts ii and iii what would be your final model C7 Use the data in WAGE1 for this exercise i Use equation 718 to estimate the gender differential when educ 5 125 Compare this with the estimated differential when educ 5 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 239 ii Run the regression used to obtain 718 but with female 1educ 2 1252 replacing femaleeduc How do you interpret the coefficient on female now iii Is the coefficient on female in part ii statistically significant Compare this with 718 and comment C8 Use the data in LOANAPP for this exercise The binary variable to be explained is approve which is equal to one if a mortgage loan to an individual was approved The key explanatory variable is white a dummy variable equal to one if the applicant was white The other applicants in the data set are black and Hispanic To test for discrimination in the mortgage loan market a linear probability model can be used approve 5 b0 1 b1white 1 other factors i If there is discrimination against minorities and the appropriate factors have been controlled for what is the sign of b1 ii Regress approve on white and report the results in the usual form Interpret the coefficient on white Is it statistically significant Is it practically large iii As controls add the variables hrat obrat loanprc unem male married dep sch cosign chist pubrec mortlat1 mortlat2 and vr What happens to the coefficient on white Is there still evidence of discrimination against nonwhites iv Now allow the effect of race to interact with the variable measuring other obligations as a percentage of income obrat Is the interaction term significant v Using the model from part iv what is the effect of being white on the probability of approval when obrat 5 32 which is roughly the mean value in the sample Obtain a 95 confidence interval for this effect C9 There has been much interest in whether the presence of 401k pension plans available to many US workers increases net savings The data set 401KSUBS contains information on net financial assets nettfa family income inc a binary variable for eligibility in a 401k plan e401k and several other variables i What fraction of the families in the sample are eligible for participation in a 401k plan ii Estimate a linear probability model explaining 401k eligibility in terms of income age and gender Include income and age in quadratic form and report the results in the usual form iii Would you say that 401k eligibility is independent of income and age What about gender Explain iv Obtain the fitted values from the linear probability model estimated in part ii Are any fitted values negative or greater than one v Using the fitted values e401ki from part iv define e401ki 5 1 if e401ki 5 and e401ki 5 0 if e401ki 5 Out of 9275 families how many are predicted to be eligible for a 401k plan vi For the 5638 families not eligible for a 401k what percentage of these are predicted not to have a 401k using the predictor e401ki For the 3637 families eligible for a 401k plan what percentage are predicted to have one It is helpful if your econometrics package has a tabulate command vii The overall percent correctly predicted is about 649 Do you think this is a complete description of how well the model does given your answers in part vi viii Add the variable pira as an explanatory variable to the linear probability model Other things equal if a family has someone with an individual retirement account how much higher is the estimated probability that the family is eligible for a 401k plan Is it statistically different from zero at the 10 level Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 240 PART 1 Regression Analysis with CrossSectional Data C10 Use the data in NBASAL for this exercise i Estimate a linear regression model relating points per game to experience in the league and position guard forward or center Include experience in quadratic form and use centers as the base group Report the results in the usual form ii Why do you not include all three position dummy variables in part i iii Holding experience fixed does a guard score more than a center How much more Is the difference statistically significant iv Now add marital status to the equation Holding position and experience fixed are married players more productive based on points per game v Add interactions of marital status with both experience variables In this expanded model is there strong evidence that marital status affects points per game vi Estimate the model from part iv but use assists per game as the dependent variable Are there any notable differences from part iv Discuss C11 Use the data in 401KSUBS for this exercise i Compute the average standard deviation minimum and maximum values of nettfa in the sample ii Test the hypothesis that average nettfa does not differ by 401k eligibility status use a two sided alternative What is the dollar amount of the estimated difference iii From part ii of Computer Exercise C9 it is clear that e401k is not exogenous in a simple regression model at a minimum it changes by income and age Estimate a multiple linear regression model for nettfa that includes income age and e401k as explanatory variables The income and age variables should appear as quadratics Now what is the estimated dollar effect of 401k eligibility iv To the model estimated in part iii add the interactions e401k 1age 2 412 and e401k 1age 2 412 2 Note that the average age in the sample is about 41 so that in the new model the coefficient on e401k is the estimated effect of 401k eligibility at the average age Which interaction term is significant v Comparing the estimates from parts iii and iv do the estimated effects of 401k eligibility at age 41 differ much Explain vi Now drop the interaction terms from the model but define five family size dummy variables fsize1 fsize2 fsize3 fsize4 and fsize5 The variable fsize5 is unity for families with five or more members Include the family size dummies in the model estimated from part iii be sure to choose a base group Are the family dummies significant at the 1 level vii Now do a Chow test for the model nettfa 5 b0 1 b1inc 1 b2inc2 1 b3age 1 b4age2 1 b5e401k 1 u across the five family size categories allowing for intercept differences The restricted sum of squared residuals SSRr is obtained from part vi because that regression assumes all slopes are the same The unrestricted sum of squared residuals is SSRur 5 SSR1 1 SSR2 1 p 1 SSR5 where SSRf is the sum of squared residuals for the equation estimated using only family size f You should convince yourself that there are 30 parameters in the unrestricted model 5 intercepts plus 25 slopes and 10 parameters in the restricted model 5 intercepts plus 5 slopes Therefore the number of restrictions being tested is q 5 20 and the df for the unrestricted model is 9275 2 30 5 9245 C12 Use the data set in BEAUTY which contains a subset of the variables but more usable observations than in the regressions reported by Hamermesh and Biddle 1994 i Find the separate fractions of men and women that are classified as having above average looks Are more people rated as having above average or below average looks Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 241 ii Test the null hypothesis that the population fractions of aboveaveragelooking women and men are the same Report the onesided pvalue that the fraction is higher for women Hint Estimating a simple linear probability model is easiest iii Now estimate the model log1wage2 5 b0 1 b1belavg 1 b2abvavg 1 u separately for men and women and report the results in the usual form In both cases interpret the coefficient on belavg Explain in words what the hypothesis H0 b1 5 0 against H1 b1 0 means and find the pvalues for men and women iv Is there convincing evidence that women with above average looks earn more than women with average looks Explain v For both men and women add the explanatory variables educ exper exper2 union goodhlth black married south bigcity smllcity and service Do the effects of the looks variables change in important ways vi Use the SSR form of the Chow F statistic to test whether the slopes of the regression functions in part v differ across men and women Be sure to allow for an intercept shift under the null C13 Use the data in APPLE to answer this question i Define a binary variable as ecobuy 5 1 if ecolbs 0 and ecobuy 5 0 if ecolbs 5 0 In other words ecobuy indicates whether at the prices given a family would buy any ecologically friendly apples What fraction of families claim they would buy ecolabeled apples ii Estimate the linear probability model ecobuy 5 b0 1 b1ecoprc 1 b2 regprc 1 b3 faminc 1 b4 hhsize 1 b5 educ 1 b6 age 1 u and report the results in the usual form Carefully interpret the coefficients on the price variables iii Are the nonprice variables jointly significant in the LPM Use the usual F statistic even though it is not valid when there is heteroskedasticity Which explanatory variable other than the price variables seems to have the most important effect on the decision to buy ecolabeled apples Does this make sense to you iv In the model from part ii replace faminc with logfaminc Which model fits the data better using faminc or logfaminc Interpret the coefficient on logfaminc v In the estimation in part iv how many estimated probabilities are negative How many are bigger than one Should you be concerned vi For the estimation in part iv compute the percent correctly predicted for each outcome ecobuy 5 0 and ecobuy 5 1 Which outcome is best predicted by the model C14 Use the data in CHARITY to answer this question The variable respond is a dummy variable equal to one if a person responded with a contribution on the most recent mailing sent by a charitable organiza tion The variable resplast is a dummy variable equal to one if the person responded to the previous mailing avggift is the average of past gifts in Dutch guilders and propresp is the proportion of times the person has responded to past mailings i Estimate a linear probability model relating respond to resplast and avggift Report the results in the usual form and interpret the coefficient on resplast ii Does the average value of past gifts seem to affect the probability of responding iii Add the variable propresp to the model and interpret its coefficient Be careful here an increase of one in propresp is the largest possible change iv What happened to the coefficient on resplast when propresp was added to the regression Does this make sense v Add mailsyear the number of mailings per year to the model How big is its estimated effect Why might this not be a good estimate of the causal effect of mailings on responding Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 242 PART 1 Regression Analysis with CrossSectional Data C15 Use the data in FERTIL2 to answer this question i Find the smallest and largest values of children in the sample What is the average of children Does any woman have exactly the average number of children ii What percentage of women have electricity in the home iii Compute the average of children for those without electricity and do the same for those with electricity Comment on what you find Test whether the population means are the same using a simple regression iv From part iii can you infer that having electricity causes women to have fewer children Explain v Estimate a multiple regression model of the kind reported in equation 737 but add age2 urban and the three religious affiliation dummies How does the estimated effect of having electricity compare with that in part iii Is it still statistically significant vi To the equation in part v add an interaction between electric and educ Is its coefficient statistically significant What happens to the coefficient on electric vii The median and mode value for educ is 7 In the equation from part vi use the centered interaction term electric 1educ 2 72 in place of electric educ What happens to the coef ficient on electric compared with part vi Why How does the coefficient on electric compare with that in part v C16 Use the data in CATHOLIC to answer this question i In the entire sample what percentage of the students attend a Catholic high school What is the average of math12 in the entire sample ii Run a simple regression of math12 on cathhs and report the results in the usual way Interpret what you have found iii Now add the variables lfaminc motheduc and fatheduc to the regression from part ii How many observations are used in the regression What happens to the coefficient on cathhs along with its statistical significance iv Return to the simple regression of math12 on cathhs but restrict the regression to observations used in the multiple regression from part iii Do any important conclusions change v To the multiple regression in part iii add interactions between cathhs and each of the other explanatory variables Are the interaction terms individually or jointly significant vi What happens to the coefficient on cathhs in the regression from part v Explain why this coefficient is not very interesting vii Compute the average partial effect of cathhs in the model estimated in part v How does it compare with the coefficients on cathhs in parts iii and v Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 243 T he homoskedasticity assumption introduced in Chapter 3 for multiple regression states that the variance of the unobserved error u conditional on the explanatory variables is constant Homoskedasticity fails whenever the variance of the unobserved factors changes across dif ferent segments of the population where the segments are determined by the different values of the explanatory variables For example in a savings equation heteroskedasticity is present if the variance of the unobserved factors affecting savings increases with income In Chapters 4 and 5 we saw that homoskedasticity is needed to justify the usual t tests F tests and confidence intervals for OLS estimation of the linear regression model even with large sample sizes In this chapter we discuss the available remedies when heteroskedasticity occurs and we also show how to test for its presence We begin by briefly reviewing the consequences of heteroskedastic ity for ordinary least squares estimation 81 Consequences of Heteroskedasticity for OLS Consider again the multiple linear regression model y 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u 81 In Chapter 3 we proved unbiasedness of the OLS estimators b 0 b 1 b 2 p b k under the first four GaussMarkov assumptions MLR1 through MLR4 In Chapter 5 we showed that the same four assumptions imply consistency of OLS The homoskedasticity assumption MLR5 stated in terms of the error variance as Var1u0x1 x2 p xk2 5 s2 played no role in showing whether OLS was unbiased c h a p t e r 8 Heteroskedasticity Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 244 or consistent It is important to remember that heteroskedasticity does not cause bias or inconsistency in the OLS estimators of the bj whereas something like omitting an important variable would have this effect The interpretation of our goodnessoffit measures R2 and R2 is also unaffected by the pres ence of heteroskedasticity Why Recall from Section 63 that the usual Rsquared and the adjusted Rsquared are different ways of estimating the population Rsquared which is simply 1 2 s2 us2 y where s2 u is the population error variance and s2 y is the population variance of y The key point is that because both variances in the population Rsquared are unconditional variances the population Rsquared is unaffected by the presence of heteroskedasticity in Var1u0x1 p xk2 Further SSRn consistently estimates s2 u and SSTn consistently estimates s2 y whether or not Var1u0x1 p xk2 is constant The same is true when we use the degrees of freedom adjustments Therefore R2 and R2 are both consistent estimators of the population Rsquared whether or not the homoskedasticity assumption holds If heteroskedasticity does not cause bias or inconsistency in the OLS estimators why did we introduce it as one of the GaussMarkov assumptions Recall from Chapter 3 that the estimators of the variances Var1b j2 are biased without the homoskedasticity assumption Since the OLS standard errors are based directly on these variances they are no longer valid for constructing confidence in tervals and t statistics The usual OLS t statistics do not have t distributions in the presence of heter oskedasticity and the problem is not resolved by using large sample sizes We will see this explicitly for the simple regression case in the next section where we derive the variance of the OLS slope estimator under heteroskedasticity and propose a valid estimator in the presence of heteroskedasticity Similarly F statistics are no longer F distributed and the LM statistic no longer has an asymptotic chisquare distribution In summary the statistics we used to test hypotheses under the GaussMarkov assumptions are not valid in the presence of heteroskedasticity We also know that the GaussMarkov Theorem which says that OLS is best linear unbiased relies crucially on the homoskedasticity assumption If Var1u0x2 is not constant OLS is no longer BLUE In addition OLS is no longer asymptotically efficient in the class of estimators described in Theorem 53 As we will see in Section 84 it is possible to find estimators that are more efficient than OLS in the presence of heteroskedasticity although it requires knowing the form of the heter oskedasticity With relatively large sample sizes it might not be so important to obtain an efficient estimator In the next section we show how the usual OLS test statistics can be modified so that they are valid at least asymptotically 82 HeteroskedasticityRobust Inference after OLS Estimation Because testing hypotheses is such an important component of any econometric analysis and the usual OLS inference is generally faulty in the presence of heteroskedasticity we must decide if we should en tirely abandon OLS Fortunately OLS is still useful In the last two decades econometricians have learned how to adjust standard errors and t F and LM statistics so that they are valid in the presence of heteroske dasticity of unknown form This is very convenient because it means we can report new statistics that work regardless of the kind of heteroskedasticity present in the population The methods in this section are known as heteroskedasticityrobust procedures because they are validat least in large sampleswhether or not the errors have constant variance and we do not need to know which is the case We begin by sketching how the variances Var1b j2 can be estimated in the presence of heteroske dasticity A careful derivation of the theory is well beyond the scope of this text but the application of heteroskedasticityrobust methods is very easy now because many statistics and econometrics pack ages compute these statistics as an option First consider the model with a single independent variable where we include an i subscript for emphasis yi 5 b0 1 b1xi 1 ui Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 245 We assume throughout that the first four GaussMarkov assumptions hold If the errors contain heter oskedasticity then Var1ui0xi2 5 s2 i where we put an i subscript on s2 to indicate that the variance of the error depends upon the particular value of xi Write the OLS estimator as b 1 5 b1 1 a n i51 1xi 2 x2ui a n i51 1xi 2 x2 2 Under Assumptions MLR1 through MLR4 that is without the homoskedasticity assumption and conditioning on the values xi in the sample we can use the same arguments from Chapter 2 to show that Var1b 12 5 a n i51 1xi 2 x2 2s2 i SST2 x 82 where SSTx 5 g n i511xi 2 x2 2 is the total sum of squares of the xi When s2 i 5 s2 for all i this formula reduces to the usual form s2SSTx Equation 82 explicitly shows that for the simple regression case the variance formula derived under homoskedasticity is no longer valid when heteroskedasticity is present Since the standard error of b 1 is based directly on estimating Var1b 12 we need a way to estimate equation 82 when heteroskedasticity is present White 1980 showed how this can be done Let u i denote the OLS residuals from the initial regression of y on x Then a valid estimator of Var1b 12 for heteroskedasticity of any form including homoskedasticity is a n i51 1xi 2 x2 2u 2 i SST2 x 83 which is easily computed from the data after the OLS regression In what sense is 83 a valid estimator of Var1b 12 This is pretty subtle Briefly it can be shown that when equation 83 is multiplied by the sample size n it converges in probability to E3 1xi 2 mx2 2u2 i 41s2 x2 2 which is the probability limit of n times 82 Ultimately this is what is neces sary for justifying the use of standard errors to construct confidence intervals and t statistics The law of large numbers and the central limit theorem play key roles in establishing these convergences You can refer to Whites original paper for details but that paper is quite technical See also Wooldridge 2010 Chapter 4 A similar formula works in the general multiple regression model y 5 b0 1 b1x1 1 p 1 bkxk 1 u It can be shown that a valid estimator of Var1b j2 under Assumptions MLR1 through MLR4 is Var1b j2 5 a n i51 r2 iju 2 i SSR2 j 84 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 246 where rij denotes the ith residual from regressing xj on all other independent variables and SSRj is the sum of squared residuals from this regression see Section 32 for the partialling out representation of the OLS estimates The square root of the quantity in 84 is called the heteroskedasticityrobust stand ard error for b j In econometrics these robust standard errors are usually attributed to White 1980 Earlier works in statistics notably those by Eicker 1967 and Huber 1967 pointed to the possibility of obtaining such robust standard errors In applied work these are sometimes called White Huber or Eicker standard errors or some hyphenated combination of these names We will just refer to them as heteroskedasticityrobust standard errors or even just robust standard errors when the context is clear Sometimes as a degrees of freedom correction 84 is multiplied by n1n 2 k 2 12 before taking the square root The reasoning for this adjustment is that if the squared OLS residuals u 2 i were the same for all observations ithe strongest possible form of homoskedasticity in a samplewe would get the usual OLS standard errors Other modifications of 84 are studied in MacKinnon and White 1985 Since all forms have only asymptotic justification and they are asymptotically equivalent no form is uniformly pre ferred above all others Typically we use whatever form is computed by the regression package at hand Once heteroskedasticityrobust standard errors are obtained it is simple to construct a heteroskedasticityrobust t statistic Recall that the general form of the t statistic is t 5 estimate 2 hypothesized value standard error 85 Because we are still using the OLS estimates and we have chosen the hypothesized value ahead of time the only difference between the usual OLS t statistic and the heteroskedasticityrobust t statistic is in how the standard error in the denominator is computed The term SSRj in equation 84 can be replaced with SSTj11 2 R2 j 2 where SSTj is the total sum of squares of xj and R2 j is the usual Rsquared from regressing xj on all other explanatory vari ables We implicitly used this equivalence in deriving equation 351 Consequently little sample variation in xj or a strong linear relationship between xj and the other explanatory variablesthat is multicollinearitycan cause the heteroskedasticityrobust standard errors to be large We discussed these issues with the usual OLS standard errors in Section 34 ExamplE 81 log Wage Equation with HeteroskedasticityRobust Standard Errors We estimate the model in Example 76 but we report the heteroskedasticityrobust standard errors along with the usual OLS standard errors Some of the estimates are reported to more digits so that we can compare the usual standard errors with the heteroskedasticityrobust standard errors log1wage2 5 321 1 213 marrmale 2 198 marrfem 2 110 singfem 11002 10552 10582 10562 31094 30574 30584 30574 1 0789 educ 1 0268 exper 2 00054 exper2 100672 100552 1000112 86 300744 300514 3000114 1 0291 tenure 2 00053 tenure2 100682 1000232 300694 3000244 n 5 526 R2 5 461 The usual OLS standard errors are in parentheses below the corresponding OLS estimate and the heteroskedasticityrobust standard errors are in brackets The numbers in brackets are the only new things since the equation is still estimated by OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 247 Several things are apparent from equation 86 First in this particular application any variable that was statistically significant using the usual t statistic is still statistically significant using the heteroskedasticityrobust t statistic This occurs because the two sets of standard errors are not very different The associated pvalues will differ slightly because the robust t statistics are not identical to the usual nonrobust t statistics The largest relative change in standard errors is for the coefficient on educ the usual standard error is 0067 and the robust standard error is 0074 Still the robust stand ard error implies a robust t statistic above 10 Equation 86 also shows that the robust standard errors can be either larger or smaller than the usual standard errors For example the robust standard error on exper is 0051 whereas the usual standard error is 0055 We do not know which will be larger ahead of time As an empirical matter the robust standard errors are often found to be larger than the usual standard errors Before leaving this example we must emphasize that we do not know at this point whether het eroskedasticity is even present in the population model underlying equation 86 All we have done is report along with the usual standard errors those that are valid asymptotically whether or not heter oskedasticity is present We can see that no important conclusions are overturned by using the robust standard errors in this example This often happens in applied work but in other cases the differences between the usual and robust standard errors are much larger As an example of where the differences are substantial see Computer Exercise C2 At this point you may be asking the following question if the heteroskedasticityrobust stand ard errors are valid more often than the usual OLS standard errors why do we bother with the usual standard errors at all This is a sensible question One reason the usual standard errors are still used in crosssectional work is that if the homoskedasticity assumption holds and the errors are normally distributed then the usual t statistics have exact t distributions regardless of the sample size see Chapter 4 The robust standard errors and robust t statistics are justified only as the sample size be comes large even if the CLM assumptions are true With small sample sizes the robust t statistics can have distributions that are not very close to the t distribution and that could throw off our inference In large sample sizes we can make a case for always reporting only the heteroskedasticityrobust standard errors in crosssectional applications and this practice is being followed more and more in applied work It is also common to report both standard errors as in equation 86 so that a reader can determine whether any conclusions are sensitive to the standard error in use It is also possible to obtain F and LM statistics that are robust to heteroskedasticity of an un known arbitrary form The heteroskedasticityrobust F statistic or a simple transformation of it is also called a heteroskedasticityrobust Wald statistic A general treatment of the Wald statistic requires matrix algebra and is sketched in Appendix E see Wooldridge 2010 Chapter 4 for a more detailed treatment Nevertheless using heteroskedasticityrobust statistics for multiple exclusion restrictions is straightforward because many econometrics packages now compute such statistics routinely ExamplE 82 HeteroskedasticityRobust F Statistic Using the data for the spring semester in GPA3 we estimate the following equation cumgpa 5 147 1 00114 sat 2 00857 hsperc 1 00250 tothrs 87 1232 1000182 1001242 1000732 3224 3000194 3001404 3000734 1 303 female 2 128 black 2 059 white 10592 11472 11412 30594 31184 31104 n 5 366 R2 5 4006 R2 5 3905 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 248 Again the differences between the usual standard errors and the heteroskedasticityrobust standard errors are not very big and use of the robust t statistics does not change the statistical significance of any independent variable Joint significance tests are not much affected either Suppose we wish to test the null hypothesis that after the other factors are controlled for there are no differences in cumgpa by race This is stated as H0 bblack 5 0 bwhite 5 0 The usual F statistic is easily obtained once we have the Rsquared from the restricted model this turns out to be 3983 The F statistic is then 3 14006 2 3983211 2 40062 4135922 69 If heteroskedasticity is present this version of the test is invalid The heteroskedasticityrobust version has no simple form but it can be computed using certain statistical packages The value of the heteroskedasticityrobust F statistic turns out to be 75 which differs only slightly from the nonrobust version The pvalue for the robust test is 474 which is not close to standard significance levels We fail to reject the null hypothesis using either test Because the usual sum of squared residuals form of the F statistic is not valid under heteroskedas ticity we must be careful in computing a Chow test of common coefficients across two groups The form of the statistic in equation 724 is not valid if heteroskedasticity is present including the simple case where the error variance differs across the two groups Instead we can obtain a heteroskedasticity robust Chow test by including a dummy variable distinguishing the two groups along with interactions between that dummy variable and all other explanatory variables We can then test whether there is no difference in the two regression functionsby testing that the coefficients on the dummy variable and all interactions are zeroor just test whether the slopes are all the same in which case we leave the coefficient on the dummy variable unrestricted See Computer Exercise C14 for an example 82a Computing HeteroskedasticityRobust LM Tests Not all regression packages compute F statistics that are robust to heteroskedasticity Therefore it is sometimes convenient to have a way of obtaining a test of multiple exclusion restrictions that is robust to heteroskedasticity and does not require a particu lar kind of econometric software It turns out that a heteroskedasticityrobust LM statistic is easily obtained using virtually any regression package To illustrate computation of the robust LM statistic consider the model y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 b4x4 1 b5x5 1 u and suppose we would like to test H0 b4 5 0 b5 5 0 To obtain the usual LM statistic we would first estimate the restricted model that is the model without x4 and x5 to obtain the residuals u Then we would regress u on all of the independent variables and the LM 5 n Ru 2 where Ru 2 is the usual Rsquared from this regression Obtaining a version that is robust to heteroskedasticity requires more work One way to compute the statistic requires only OLS regressions We need the residuals say r1 from the regression of x4 on x1 x2 x3 Also we need the residuals say r2 from the regression of x5 on x1 x2 x3 Thus we regress each of the independent variables excluded under the null on all of the included independent variables We keep the residuals each time The final step appears odd but it is after all just a compu tational device Run the regression of 1 on r1u r2u 88 without an intercept Yes we actually define a dependent variable equal to the value one for all observations We regress this onto the products r1 u and r2u The robust LM statistic turns out to be n 2 SSR1 where SSR1 is just the usual sum of squared residuals from regression 88 Evaluate the following statement The heteroskedasticityrobust standard errors are always bigger than the usual standard errors Exploring FurthEr 81 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 249 The reason this works is somewhat technical Basically this is doing for the LM test what the robust standard errors do for the t test See Wooldridge 1991b or Davidson and MacKinnon 1993 for a more detailed discussion We now summarize the computation of the heteroskedasticityrobust LM statistic in the general case A HeteroskedasticityRobust LM Statistic 1 Obtain the residuals u from the restricted model 2 Regress each of the independent variables excluded under the null on all of the included independ ent variables if there are q excluded variables this leads to q sets of residuals 1 r1 r2 p rq2 3 Find the products between each rj and u for all observations 4 Run the regression of 1 on r1u r2u p rqu without an intercept The heteroskedasticityrobust LM statistic is n 2 SSR1 where SSR1 is just the usual sum of squared residuals from this final regression Under H0 LM is distributed approximately as x2 q Once the robust LM statistic is obtained the rejection rule and computation of pvalues are the same as for the usual LM statistic in Section 52 ExamplE 83 HeteroskedasticityRobust LM Statistic We use the data in CRIME1 to test whether the average sentence length served for past convictions affects the number of arrests in the current year 1986 The estimated model is narr86 5 561 2 136 pcnv 1 0178 avgsen 2 00052 avgsen2 10362 10402 100972 1000302 30404 30344 301014 3000214 2 0394 ptime86 2 0505 qemp86 2 00148 inc86 100872 101442 1000342 89 300624 301424 3000234 1 325 black 1 193 hispan 10452 10402 30584 30404 n 5 2725 R2 5 0728 In this example there are more substantial differences between some of the usual standard errors and the robust standard errors For example the usual t statistic on avgsen2 is about 173 while the ro bust t statistic is about 248 Thus avgsen2 is more significant using the robust standard error The effect of avgsen on narr86 is somewhat difficult to reconcile Because the relationship is quadratic we can figure out where avgsen has a positive effect on narr86 and where the effect be comes negative The turning point is 0178200052 L 1712 recall that this is measured in months Literally this means that narr86 is positively related to avgsen when avgsen is less than 17 months then avgsen has the expected deterrent effect after 17 months To see whether average sentence length has a statistically significant effect on narr86 we must test the joint hypothesis H0 bavgsen 5 0 bavgsen2 5 0 Using the usual LM statistic see Section 52 we obtain LM 5 354 in a chisquare distribution with two df this yields a pvalue 5 170 Thus we do not reject H0 at even the 15 level The heteroskedasticityrobust LM statistic is LM 5 400 rounded to two decimal places with a pvalue 5 135 This is still not very strong evidence against H0 avgsen does not appear to have a strong effect on narr86 Incidentally when avgsen appears alone in 89 that is without the quadratic term its usual t statistic is 658 and its robust t statistic is 592 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 250 83 Testing for Heteroskedasticity The heteroskedasticityrobust standard errors provide a simple method for computing t statistics that are asymptotically t distributed whether or not heteroskedasticity is present We have also seen that heteroskedasticityrobust F and LM statistics are available Implementing these tests does not require knowing whether or not heteroskedasticity is present Nevertheless there are still some good reasons for having simple tests that can detect its presence First as we mentioned in the previous section the usual t statistics have exact t distributions under the classical linear model assumptions For this reason many economists still prefer to see the usual OLS standard errors and test statistics reported unless there is evidence of heteroskedasticity Second if heteroskedasticity is present the OLS esti mator is no longer the best linear unbiased estimator As we will see in Section 84 it is possible to obtain a better estimator than OLS when the form of heteroskedasticity is known Many tests for heteroskedasticity have been suggested over the years Some of them while hav ing the ability to detect heteroskedasticity do not directly test the assumption that the variance of the error does not depend upon the independent variables We will restrict ourselves to more modern tests which detect the kind of heteroskedasticity that invalidates the usual OLS statistics This also has the benefit of putting all tests in the same framework As usual we start with the linear model y 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u 810 where Assumptions MLR1 through MLR4 are maintained in this section In particular we assume that E1u0x1 x2 p xk2 5 0 so that OLS is unbiased and consistent We take the null hypothesis to be that Assumption MLR5 is true H0 Var1u0x1 x2 p xk2 5 s2 811 That is we assume that the ideal assumption of homoskedasticity holds and we require the data to tell us otherwise If we cannot reject 811 at a sufficiently small significance level we usually conclude that heteroskedasticity is not a problem However remember that we never accept H0 we simply fail to reject it Because we are assuming that u has a zero conditional expectation Var1u0x2 5 E1u20x2 and so the null hypothesis of homoskedasticity is equivalent to H0 E1u20x1 x2 p xk2 5 E1u22 5 s2 This shows that in order to test for violation of the homoskedasticity assumption we want to test whether u2 is related in expected value to one or more of the explanatory variables If H0 is false the expected value of u2 given the independent variables can be virtually any function of the xj A simple approach is to assume a linear function u2 5 d0 1 d1x1 1 d2x2 1 p 1 dkxk 1 v 812 where v is an error term with mean zero given the xj Pay close attention to the dependent variable in this equation it is the square of the error in the original regression equation 810 The null hypoth esis of homoskedasticity is H0 d1 5 d2 5 p 5 dk 5 0 813 Under the null hypothesis it is often reasonable to assume that the error in 812 v is independ ent of x1 x2 p xk Then we know from Section 52 that either the F or LM statistics for the overall significance of the independent variables in explaining u2 can be used to test 813 Both statistics would have asymptotic justification even though u2 cannot be normally distributed For example if u is normally distributed then u2s2 is distributed as x2 1 If we could observe the u2 in the sample then we could easily compute this statistic by running the OLS regression of u2 on x1 x2 p xk using all n observations Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 251 As we have emphasized before we never know the actual errors in the population model but we do have estimates of them the OLS residual u i is an estimate of the error ui for observation i Thus we can estimate the equation u 2 5 d0 1 d1x1 1 d2x2 1 p 1 dkxk 1 error 814 and compute the F or LM statistics for the joint significance of x1 p xk It turns out that using the OLS residuals in place of the errors does not affect the large sample distribution of the F or LM statis tics although showing this is pretty complicated The F and LM statistics both depend on the Rsquared from regression 814 call this R2 u2 to dis tinguish it from the Rsquared in estimating equation 810 Then the F statistic is F 5 R22 u k 11 2 R2 2 u 2 1n 2 k 2 12 815 where k is the number of regressors in 814 this is the same number of independent variables in 810 Computing 815 by hand is rarely necessary because most regression packages automati cally compute the F statistic for overall significance of a regression This F statistic has approxi mately an Fk n2k21 distribution under the null hypothesis of homoskedasticity The LM statistic for heteroskedasticity is just the sample size times the Rsquared from 814 LM 5 n R22 u 816 Under the null hypothesis LM is distributed asymptotically as x2 k This is also very easy to obtain after running regression 814 The LM version of the test is typically called the BreuschPagan test for heteroskedasticity BP test Breusch and Pagan 1979 suggested a different form of the test that assumes the errors are normally distributed Koenker 1981 suggested the form of the LM statistic in 816 and it is gener ally preferred due to its greater applicability We summarize the steps for testing for heteroskedasticity using the BP test The BreuschPagan Test for Heteroskedasticity 1 Estimate the model 810 by OLS as usual Obtain the squared OLS residuals u 2 one for each observation 2 Run the regression in 814 Keep the Rsquared from this regression R22 u 3 Form either the F statistic or the LM statistic and compute the pvalue using the Fkn2k21 distri bution in the former case and the x2 k distribution in the latter case If the pvalue is sufficiently small that is below the chosen significance level then we reject the null hypothesis of homoske dasticity If the BP test results in a small enough pvalue some corrective measure should be taken One possibility is to just use the heteroskedasticityrobust standard errors and test statistics discussed in the previous section Another possibility is discussed in Section 84 ExamplE 84 Heteroskedasticity in Housing price Equations We use the data in HPRICE1 to test for heteroskedasticity in a simple housing price equation The estimated equation using the levels of all variables is price 5 2 2177 1 00207 lotsize 1 123 sqrft 1 1385 bdrms 129482 1000642 10132 19012 817 n 5 88 R2 5 672 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 252 This equation tells us nothing about whether the error in the population model is heteroskedastic We need to regress the squared OLS residuals on the independent variables The Rsquared from the regression of u 2 on lotsize sqrft and bdrms is R22 u 5 1601 With n 88 and k 5 3 this produces an F statistic for significance of the independent variables of F 5 3160111 2 16012 418432 534 The associated pvalue is 002 which is strong evidence against the null The LM statistic is 88116012 1409 this gives a pvalue 0028 using the x2 3 distribution giving essentially the same conclusion as the F statistic This means that the usual standard errors reported in 817 are not reliable In Chapter 6 we mentioned that one benefit of using the logarithmic functional form for the de pendent variable is that heteroskedasticity is often reduced In the current application let us put price lotsize and sqrft in logarithmic form so that the elasticities of price with respect to lotsize and sqrft are constant The estimated equation is log1price2 5 21 30 1 168 log1lotsize2 1 700 log1sqrft2 1 0 37 bdrms 818 1652 1 0382 10932 10282 n 5 88 R2 5 643 Regressing the squared OLS residuals from this regression on loglotsize logsqrft and bdrms gives R22 u 5 0480 Thus F 5 141 1pvalue 5 2452 and LM 5 422 1pvalue 5 2392 Therefore we fail to reject the null hypothesis of homoskedasticity in the model with the logarithmic functional forms The occurrence of less heteroskedasticity with the dependent variable in logarithmic form has been noticed in many empirical applications If we suspect that heteroskedasticity depends only upon certain independent variables we can easily mod ify the BreuschPagan test we simply regress u 2 on whatever independent variables we choose and carry out the appropriate F or LM test Remember that the appropriate degrees of freedom depends upon the num ber of independent variables in the regression with u 2 as the dependent variable the number of independent variables showing up in equation 810 is irrelevant If the squared residuals are regressed on only a single independent variable the test for heteroske dasticity is just the usual t statistic on the variable A significant t statistic suggests that heteroskedasticity is a problem 83a The White Test for Heteroskedasticity In Chapter 5 we showed that the usual OLS standard errors and test statistics are asymptotically valid provided all of the GaussMarkov assumptions hold It turns out that the homoskedasticity as sumption Var1u10x1 p xk2 5 s2 can be replaced with the weaker assumption that the squared error u2 is uncorrelated with all the independent variables 1xj2 the squares of the independent variables 1x2 j 2 and all the cross products 1xj xh for j 2 h2 This observation motivated White 1980 to propose a test for heteroskedasticity that adds the squares and cross products of all the independent variables to equation 814 The test is explicitly intended to test for forms of heteroskedasticity that invalidate the usual OLS standard errors and test statistics Consider wage equation 711 where you think that the conditional variance of logwage does not depend on educ exper or tenure However you are worried that the variance of logwage differs across the four demographic groups of married males married females single males and single females What regression would you run to test for heteroskedasticity What are the degrees of freedom in the F test Exploring FurthEr 82 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 253 When the model contains k 5 3 independent variables the White test is based on an estimation of u 2 5 d0 1 d1x1 1 d2x2 1 d3x3 1 d4x2 1 1 d5x2 2 1 d6x2 3 819 1 d7x1x2 1 d8x1x3 1 d9x2x3 1 error Compared with the BreuschPagan test this equation has six more regressors The White test for heteroskedasticity is the LM statistic for testing that all of the dj in equation 819 are zero except for the intercept Thus nine restrictions are being tested in this case We can also use an F test of this hypothesis both tests have asymptotic justification With only three independent variables in the original model equation 819 has nine independ ent variables With six independent variables in the original model the White regression would gener ally involve 27 regressors unless some are redundant This abundance of regressors is a weakness in the pure form of the White test it uses many degrees of freedom for models with just a moderate number of independent variables It is possible to obtain a test that is easier to implement than the White test and more conserving on degrees of freedom To create the test recall that the difference between the White and Breusch Pagan tests is that the former includes the squares and cross products of the independent variables We can preserve the spirit of the White test while conserving on degrees of freedom by using the OLS fitted values in a test for heteroskedasticity Remember that the fitted values are defined for each observation i by yi 5 b 0 1 b 1xi1 1 b 2xi2 1 p 1 b kxik These are just linear functions of the independent variables If we square the fitted values we get a particular function of all the squares and cross products of the independent variables This suggests testing for heteroskedasticity by estimating the equation u 2 5 d0 1 d1y 1 d2y2 1 error 820 where y stands for the fitted values It is important not to confuse y and y in this equation We use the fitted values because they are functions of the independent variables and the estimated parameters using y in 820 does not produce a valid test for heteroskedasticity We can use the F or LM statistic for the null hypothesis H0 d1 5 0 d2 5 0 in equation 820 This results in two restrictions in testing the null of homoskedasticity regardless of the number of independent variables in the original model Conserving on degrees of freedom in this way is often a good idea and it also makes the test easy to implement Since y is an estimate of the expected value of y given the xj using 820 to test for heteroske dasticity is useful in cases where the variance is thought to change with the level of the expected value E1y0x2 The test from 820 can be viewed as a special case of the White test since equation 820 can be shown to impose restrictions on the parameters in equation 819 A Special Case of the White Test for Heteroskedasticity 1 Estimate the model 810 by OLS as usual Obtain the OLS residuals u and the fitted values y Compute the squared OLS residuals u 2 and the squared fitted values y2 2 Run the regression in equation 820 Keep the Rsquared from this regression R2 u2 3 Form either the F or LM statistic and compute the pvalue using the F2n23 distribution in the former case and the x2 2 distribution in the latter case Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 254 ExamplE 85 Special Form of the White Test in the log Housing price Equation We apply the special case of the White test to equation 818 where we use the LM form of the statistic The important thing to remember is that the chisquare distribution always has two df The regression of u 2 on lprice 1lprice2 2 where lprice denotes the fitted values from 818 produces R2 u2 5 0392 thus LM 5 88103922 345 and the pvalue 5 178 This is stronger evidence of heteroskedasticity than is provided by the BreuschPagan test but we still fail to reject homoskedas ticity at even the 15 level Before leaving this section we should discuss one important caveat We have interpreted a rejec tion using one of the heteroskedasticity tests as evidence of heteroskedasticity This is appropriate provided we maintain Assumptions MLR1 through MLR4 But if MLR4 is violatedin particular if the functional form of E1y0x2 is misspecifiedthen a test for heteroskedasticity can reject H0 even if Var1y0x2 is constant For example if we omit one or more quadratic terms in a regression model or use the level model when we should use the log a test for heteroskedasticity can be significant This has led some economists to view tests for heteroskedasticity as general misspecification tests However there are better more direct tests for functional form misspecification and we will cover some of them in Section 91 It is better to use explicit tests for functional form first since functional form misspecification is more important than heteroskedasticity Then once we are satisfied with the functional form we can test for heteroskedasticity 84 Weighted Least Squares Estimation If heteroskedasticity is detected using one of the tests in Section 83 we know from Section 82 that one possible response is to use heteroskedasticityrobust statistics after estimation by OLS Before the development of heteroskedasticityrobust statistics the response to a finding of heteroskedasticity was to specify its form and use a weighted least squares method which we develop in this section As we will argue if we have correctly specified the form of the variance as a function of explana tory variables then weighted least squares WLS is more efficient than OLS and WLS leads to new t and F statistics that have t and F distributions We will also discuss the implications of using the wrong form of the variance in the WLS procedure 84a The Heteroskedasticity Is Known up to a Multiplicative Constant Let x denote all the explanatory variables in equation 810 and assume that Var1u0x2 5 s2h1x2 821 where h1x2 is some function of the explanatory variables that determines the heteroskedasticity Since variances must be positive h1x2 0 for all possible values of the independent variables For now we assume that the function h1x2 is known The population parameter s2 is unknown but we will be able to estimate it from a data sample For a random drawing from the population we can write s2 i 5 Var1ui0xi2 5 s2h1xi2 5 s2hi where we again use the notation xi to denote all independent variables for observation i and hi changes with each observation because the independent variables change across observations For example consider the simple savings function savi 5 b0 1 biinci 1 ui 822 Var1ui0inci2 5 s2inci 823 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 255 Here h1x2 5 h1inc2 5 inc the variance of the error is proportional to the level of income This means that as income increases the variability in savings increases If b1 0 the expected value of savings also increases with income Because inc is always positive the variance in equation 823 is always guaranteed to be positive The standard deviation of ui conditional on inci is sinci How can we use the information in equation 821 to estimate the bj Essentially we take the original equation yi 5 b0 1 b1xi1 1 b2xi2 1 p 1 bkxik 1 ui 824 which contains heteroskedastic errors and transform it into an equation that has homoskedastic errors and satisfies the other GaussMarkov assumptions Since hi is just a function of xi ui hi has a zero expected value conditional on xi Further since Var1ui0xi2 5 E1u2 i 0xi2 5 s2hi the variance of ui hi conditional on xi is s2 E1 1ui hi2 22 5 E1ui 22hi 5 1s2hi2hi 5 s2 where we have suppressed the conditioning on xi for simplicity We can divide equation 824 by hi to get yi hi 5 b0 hi 1 b11xi1 hi2 1 b21xi2 hi2 1 p 825 1 bk1xik hi2 1 1ui hi2 or yp i 5 b0xp i0 1 b1xp i1 1 p 1 bkxp ik 1 up i 826 where xp i0 5 1hi and the other starred variables denote the corresponding original variables divided by hi Equation 826 looks a little peculiar but the important thing to remember is that we derived it so we could obtain estimators of the bj that have better efficiency properties than OLS The intercept b0 in the original equation 824 is now multiplying the variable xp i0 5 1hi Each slope parameter in bj multiplies a new variable that rarely has a useful interpretation This should not cause problems if we recall that for interpreting the parameters and the model we always want to return to the original equation 824 In the preceding savings example the transformed equation looks like savi inci 5 b011 inci2 1 b1inci 1 up i where we use the fact that inciinci 5 inci Nevertheless b1 is the marginal propensity to save out of income an interpretation we obtain from equation 822 Equation 826 is linear in its parameters so it satisfies MLR1 and the random sampling as sumption has not changed Further up i has a zero mean and a constant variance 1s22 conditional on xp i This means that if the original equation satisfies the first four GaussMarkov assumptions then the transformed equation 826 satisfies all five GaussMarkov assumptions Also if ui has a normal distribution then up i has a normal distribution with variance s2 Therefore the transformed equation satisfies the classical linear model assumptions MLR1 through MLR6 if the original model does so except for the homoskedasticity assumption Since we know that OLS has appealing properties is BLUE for example under the Gauss Markov assumptions the discussion in the previous paragraph suggests estimating the parameters in equation 826 by ordinary least squares These estimators bp 0 bp 1 p bp k will be different from the OLS estimators in the original equation The bp j are examples of generalized least squares GLS estimators In this case the GLS estimators are used to account for heteroskedasticity in the errors We will encounter other GLS estimators in Chapter 12 Because equation 826 satisfies all of the ideal assumptions standard errors t statistics and F statistics can all be obtained from regressions using the transformed variables The sum of squared residuals from 826 divided by the degrees of freedom is an unbiased estimator of s2 Further the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 256 GLS estimators because they are the best linear unbiased estimators of the bj are necessarily more efficient than the OLS estimators b j obtained from the untransformed equation Essentially after we have transformed the variables we simply use standard OLS analysis But we must remember to in terpret the estimates in light of the original equation The GLS estimators for correcting heteroskedasticity are called weighted least squares WLS estimators This name comes from the fact that the bp j minimize the weighted sum of squared residu als where each squared residual is weighted by 1hi The idea is that less weight is given to observa tions with a higher error variance OLS gives each observation the same weight because it is best when the error variance is identical for all partitions of the population Mathematically the WLS estimators are the values of the bj that make a n i51 1yi 2 b0 2 b1xi1 2 b2xi2 2 p 2 bkxik2 2hi 827 as small as possible Bringing the square root of 1hi inside the squared residual shows that the weighted sum of squared residuals is identical to the sum of squared residuals in the transformed variables a n i51 1yp i 2 b0xp i0 2 b1xp i1 2 b2xp i2 2 p 2 bkxp ik2 2 Since OLS minimizes the sum of squared residuals regardless of the definitions of the dependent variable and independent variable it follows that the WLS estimators that minimize 827 are sim ply the OLS estimators from 826 Note carefully that the squared residuals in 827 are weighted by 1hi whereas the transformed variables in 826 are weighted by 1hi A weighted least squares estimator can be defined for any set of positive weights OLS is the special case that gives equal weight to all observations The efficient procedure GLS weights each squared residual by the inverse of the conditional variance of ui given xi Obtaining the transformed variables in equation 825 in order to manually perform weighted least squares can be tedious and the chance of making mistakes is nontrivial Fortunately most mod ern regression packages have a feature for computing weighted least squares Typically along with the dependent and independent variables in the original model we just specify the weighting func tion 1hi appearing in 827 That is we specify weights proportional to the inverse of the variance In addition to making mistakes less likely this forces us to interpret weighted least squares estimates in the original model In fact we can write out the estimated equation in the usual way The estimates and standard errors will be different from OLS but the way we interpret those estimates standard er rors and test statistics is the same Econometrics packages that have a builtin WLS option will report an Rsquared and adjusted Rsquared along with WLS estimates and standard errors Typically the WLS Rsquared is obtained from the weighted SSR obtained from minimizing equation 827 and a weighted total sum of squares SST obtained by using the same weights but setting all of the slope coefficients in equation 827 b1 b2 p bk to zero As a goodnessoffit measure this Rsquared is not especially useful as it effectively measures explained variation in yp i rather than yi Nevertheless the WLS Rsquareds computed as just described are appropriate for computing F statistics for exclusion restrictions pro vided we have properly specified the variance function As in the case of OLS the SST terms cancel and so we obtain the F statistic based on the weighted SSR The Rsquared from running the OLS regression in equation 826 is even less useful as a goodnessoffit measure as the computation of SST would make little sense one would necessar ily exclude an intercept from the regression in which case regression packages typically compute the SST without properly centering the yp i This is another reason for using a WLS option that is preprogrammed in a regression package because at least the reported Rsquared properly compares the model with all of the independent variables to a model with only an intercept Because the SST cancels out when testing exclusion restrictions improperly computing SST does not affect the Rsquared form of the F statistic Nevertheless computing such an Rsquared tempts one to think the equation fits better than it does Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 257 ExamplE 86 Financial Wealth Equation We now estimate equations that explain net total financial wealth nettfa measured in 1000s in terms of income inc also measured in 1000s and some other variables including age gender and an indicator for whether the person is eligible for a 401k pension plan We use the data on single people 1fsize 5 12 in 401KSUBS In Computer Exercise C12 in Chapter 6 it was found that a specific quadratic function in age namely 1age 2 252 2 fit the data just as well as an unrestricted quadratic Plus the restricted form gives a simplified interpretation because the minimum age in the sample is 25 nettfa is an increasing function of age after age 5 25 The results are reported in Table 81 Because we suspect heteroskedasticity we report the heteroskedasticityrobust standard errors for OLS The weighted least squares estimates and their standard errors are obtained under the assumption Var1u0inc2 5 s2inc Without controlling for other factors another dollar of income is estimated to increase nettfa by about 82 when OLS is used the WLS estimate is smaller about 79 The difference is not large we certainly do not expect them to be identical The WLS coefficient does have a smaller standard error than OLS almost 40 smaller provided we assume the model Var1nettfa0inc2 5 s2inc is correct Adding the other controls reduced the inc coefficient somewhat with the OLS estimate still larger than the WLS estimate Again the WLS estimate of binc is more precise Age has an increasing effect starting at age 5 25 with the OLS estimate showing a larger effect The WLS estimate of bage is more precise in this case Gender does not have a statistically significant effect on nettfa but being eligible for a 401k plan does the OLS estimate is that those eligible holding fixed income age and gender have net total financial assets about 6890 higher The WLS estimate is substantially below the OLS estimate and suggests a misspecification of the functional form in the mean equation One possibility is to interact e401k and inc see Computer Exercise C11 Using WLS the F statistic for joint significance of 1age 2 252 2 male and e401k is about 308 if we use the Rsquareds reported in Table 81 With 3 and 2012 degrees of freedom the pvalue is zero to more than 15 decimal places of course this is not surpris ing given the very large t statistics for the age and 401k variables TAblE 81 Dependent Variable nettfa Independent Variables 1 OLS 2 WLS 3 OLS 4 WLS inc 821 104 787 063 771 100 740 064 1age 2 252 2 0251 0043 0175 0019 male 248 206 184 156 e401k 689 229 519 170 intercept 21057 253 2958 165 22098 350 21670 196 Observations 2017 2017 2017 2017 Rsquared 0827 0709 1279 1115 Using the OLS residuals obtained from the OLS regression reported in column 1 of Table 81 the regression of u 2 on inc yields a t statistic of 296 Does it appear we should worry about heteroskedasticity in the financial wealth equation Exploring FurthEr 83 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 258 Assuming that the error variance in the financial wealth equation has a variance proportional to income is essentially arbitrary In fact in most cases our choice of weights in WLS has a degree of arbitrariness However there is one case where the weights needed for WLS arise naturally from an underlying econometric model This happens when instead of using individuallevel data we only have averages of data across some group or geographic region For example suppose we are inter ested in determining the relationship between the amount a worker contributes to his or her 401k pension plan as a function of the plan generosity Let i denote a particular firm and let e denote an employee within the firm A simple model is contribie 5 b0 1 b1earnsie 1 b2ageie 1 b3mratei 1 uie 828 where contribie is the annual contribution by employee e who works for firm i earnsie is annual earn ings for this person and ageie is the persons age The variable mratei is the amount the firm puts into an employees account for every dollar the employee contributes If 828 satisfies the GaussMarkov assumptions then we could estimate it given a sample on individuals across various employers Suppose however that we only have average values of contri butions earnings and age by employer In other words individuallevel data are not available Thus let contribi denote average contribution for people at firm i and similarly for earnsi and agei Let mi denote the number of employees at firm i we assume that this is a known quantity Then if we aver age equation 828 across all employees at firm i we obtain the firmlevel equation contribi 5 b0 1 b1earnsi 1 b2agei 1 b3mratei 1 ui 829 where ui 5 m21 i g mi e51 uie is the average error across all employees in firm i If we have n firms in our sample then 829 is just a standard multiple linear regression model that can be estimated by OLS The estimators are unbiased if the original model 828 satisfies the GaussMarkov assumptions and the individual errors uie are independent of the firms size mi because then the expected value of ui given the explanatory variables in 829 is zero If the individuallevel equation 828 satisfies the homoskedasticity assumption and the errors within firm i are uncorrelated across employees then we can show that the firmlevel equation 829 has a particular kind of heteroskedasticity Specifically if Var1uie2 5 s2 for all i and e and Cov1uie uig2 5 0 for every pair of employees e 2 g within firm i then Var1ui2 5 s2mi this is just the usual formula for the variance of an average of uncorrelated random variables with common vari ance In other words the variance of the error term ui decreases with firm size In this case hi 5 1mi and so the most efficient procedure is weighted least squares with weights equal to the number of employees at the firm 11hi 5 mi2 This ensures that larger firms receive more weight This gives us an efficient way of estimating the parameters in the individuallevel model when we only have aver ages at the firm level A similar weighting arises when we are using per capita data at the city county state or coun try level If the individuallevel equation satisfies the GaussMarkov assumptions then the error in the per capita equation has a variance proportional to one over the size of the population Therefore weighted least squares with weights equal to the population is appropriate For example suppose we have citylevel data on per capita beer consumption in ounces the percentage of people in the popu lation over 21 years old average adult education levels average income levels and the city price of beer Then the citylevel model beerpc 5 b0 1 b1perc21 1 b2avgeduc 1 b3incpc 1 b4 price 1 u can be estimated by weighted least squares with the weights being the city population The advantage of weighting by firm size city population and so on relies on the underlying individual equation being homoskedastic If heteroskedasticity exists at the individual level then the proper weighting depends on the form of heteroskedasticity Further if there is correlation across errors within a group say firm then Var1ui2 2 s2mi see Problem 7 Uncertainty about the form of Var1ui2 in equations such as 829 is why more and more researchers simply use OLS and compute Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 259 robust standard errors and test statistics when estimating models using per capita data An alternative is to weight by group size but to report the heteroskedasticityrobust statistics in the WLS estimation This ensures that while the estimation is efficient if the individuallevel model satisfies the Gauss Markov assumptions heteroskedasticity at the individual level or withingroup correlation are ac counted for through robust inference 84b The Heteroskedasticity Function Must Be Estimated Feasible GLS In the previous subsection we saw some examples of where the heteroskedasticity is known up to a multiplicative form In most cases the exact form of heteroskedasticity is not obvious In other words it is difficult to find the function h1xi2 of the previous section Nevertheless in many cases we can model the function h and use the data to estimate the unknown parameters in this model This results in an esti mate of each hi denoted as h i Using h i instead of hi in the GLS transformation yields an estimator called the feasible GLS FGLS estimator Feasible GLS is sometimes called estimated GLS or EGLS There are many ways to model heteroskedasticity but we will study one particular fairly flexible approach Assume that Var1u0x2 5 s2exp1d0 1 d1x1 1 d2x2 1 p 1 dkxk2 830 where x1 x2 p xk are the independent variables appearing in the regression model see equation 81 and the dj are unknown parameters Other functions of the xj can appear but we will focus primarily on 830 In the notation of the previous subsection h1x2 5 exp1d0 1 d1x1 1 d2x2 1 p 1 dkxk2 You may wonder why we have used the exponential function in 830 After all when testing for heteroskedasticity using the BreuschPagan test we assumed that heteroskedasticity was a linear function of the xj Linear alternatives such as 812 are fine when testing for heteroskedasticity but they can be problematic when correcting for heteroskedasticity using weighted least squares We have encountered the reason for this problem before linear models do not ensure that predicted values are positive and our estimated variances must be positive in order to perform WLS If the parameters dj were known then we would just apply WLS as in the previous subsection This is not very realistic It is better to use the data to estimate these parameters and then to use these estimates to construct weights How can we estimate the dj Essentially we will transform this equa tion into a linear form that with slight modification can be estimated by OLS Under assumption 830 we can write u2 5 s2exp1d0 1 d1x1 1 d2x2 1 p 1 dkxk2v where v has a mean equal to unity conditional on x 5 x1 x2 xk If we assume that v is actually independent of x we can write log1u22 5 a0 1 d1x1 1 d2x2 1 p 1 dkxk 1 e 831 where e has a zero mean and is independent of x the intercept in this equation is different from d0 but this is not important in implementing WLS The dependent variable is the log of the squared error Since 831 satisfies the GaussMarkov assumptions we can get unbiased estimators of the dj by using OLS As usual we must replace the unobserved u with the OLS residuals Therefore we run the regression of log1u 22 on x1 x2 p xk 832 Actually what we need from this regression are the fitted values call these g i Then the estimates of hi are simply h i 5 exp1g i2 833 We now use WLS with weights 1h i in place of 1hi in equation 827 We summarize the steps Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 260 A Feasible GLS Procedure to Correct for Heteroskedasticity 1 Run the regression of y on x1 x2 p xk and obtain the residuals u 2 Create log1u 22 by first squaring the OLS residuals and then taking the natural log 3 Run the regression in equation 832 and obtain the fitted values g 4 Exponentiate the fitted values from 832 h 5 exp1g 2 5 Estimate the equation y 5 b0 1 b1x1 1 p 1 bkxk 1 u by WLS using weights 1h In other words we replace hi with h i in equation 827 Remember the squared residual for observation i gets weighted by 1h i If instead we first transform all vari ables and run OLS each variable gets multiplied by 1h i including the intercept If we could use hi rather than h i in the WLS procedure we know that our estimators would be unbiased in fact they would be the best linear unbiased estimators assuming that we have properly modeled the heteroskedasticity Having to estimate hi using the same data means that the FGLS esti mator is no longer unbiased so it cannot be BLUE either Nevertheless the FGLS estimator is con sistent and asymptotically more efficient than OLS This is difficult to show because of estimation of the variance parameters But if we ignore thisas it turns out we maythe proof is similar to show ing that OLS is efficient in the class of estimators in Theorem 53 At any rate for large sample sizes FGLS is an attractive alternative to OLS when there is evidence of heteroskedasticity that inflates the standard errors of the OLS estimates We must remember that the FGLS estimators are estimators of the parameters in the usual popu lation model y 5 b0 1 b1x1 1 p 1 bkxk 1 u Just as the OLS estimates measure the marginal impact of each xj on y so do the FGLS estimates We use the FGLS estimates in place of the OLS estimates because the FGLS estimators are more efficient and have associated test statistics with the usual t and F distributions at least in large samples If we have some doubt about the variance specified in equation 830 we can use heteroskedasticityrobust standard errors and test statistics in the transformed equation Another useful alternative for estimating hi is to replace the independent variables in regression 832 with the OLS fitted values and their squares In other words obtain the g i as the fitted values from the regression of log1u 22 on y y2 834 and then obtain the h i exactly as in equation 833 This changes only step 3 in the previous procedure If we use regression 832 to estimate the variance function you may be wondering if we can simply test for heteroskedasticity using this same regression an F or LM test can be used In fact Park 1966 suggested this Unfortunately when compared with the tests discussed in Section 83 the Park test has some problems First the null hypothesis must be something stronger than homoskedas ticity effectively u and x must be independent This is not required in the BreuschPagan or White tests Second using the OLS residuals u in place of u in 832 can cause the F statistic to deviate from the F distribution even in large sample sizes This is not an issue in the other tests we have cov ered For these reasons the Park test is not recommended when testing for heteroskedasticity Regres sion 832 works well for weighted least squares because we only need consistent estimators of the dj and regression 832 certainly delivers those Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 261 ExamplE 87 Demand for Cigarettes We use the data in SMOKE to estimate a demand function for daily cigarette consumption Since most people do not smoke the dependent variable cigs is zero for most observations A linear model is not ideal because it can result in negative predicted values Nevertheless we can still learn some thing about the determinants of cigarette smoking by using a linear model The equation estimated by ordinary least squares with the usual OLS standard errors in paren theses is cigs 5 2364 1 880 log1income2 2 751 log1cigpric2 124082 17282 157732 2 501 educ 1 771 age 2 0090 age2 2 283 restaurn 835 11672 11602 100172 11112 n 5 807 R2 5 0526 where cigs 5 number of cigarettes smoked per day income 5 annual income cigpric 5 the perpack price of cigarettes in cents educ 5 years of schooling age 5 age measured in years restaurn 5 a binary indicator equal to unity if the person resides in a state with restaurant smoking restrictions Since we are also going to do weighted least squares we do not report the heteroskedasticityrobust standard errors for OLS Incidentally 13 out of the 807 fitted values are less than zero this is less than 2 of the sample and is not a major cause for concern Neither income nor cigarette price is statistically significant in 835 and their effects are not practically large For example if income increases by 10 cigs is predicted to increase by 18801002 1102 5 088 or less than onetenth of a cigarette per day The magnitude of the price ef fect is similar Each year of education reduces the average cigarettes smoked per day by onehalf of a cigarette and the effect is statistically significant Cigarette smoking is also related to age in a quadratic fash ion Smoking increases with age up until age 5 7713210092 4 4283 and then smoking decreases with age Both terms in the quadratic are statistically significant The presence of a restriction on smoking in restaurants decreases cigarette smoking by almost three cigarettes per day on average Do the errors underlying equation 835 contain heteroskedasticity The BreuschPagan regres sion of the squared OLS residuals on the independent variables in 835 see equation 814 pro duces R2 u2 5 040 This small Rsquared may seem to indicate no heteroskedasticity but we must remember to compute either the F or LM statistic If the sample size is large a seemingly small R2 u2 can result in a very strong rejection of homoskedasticity The LM statistic is LM 5 80710402 5 3228 and this is the outcome of a x2 6 random variable The pvalue is less than 000015 which is very strong evidence of heteroskedasticity Therefore we estimate the equation using the feasible GLS procedure based on equation 832 The weighted least squares estimates are cigs 5 564 1 130 log1income2 2 294 log1cigpric2 117802 1442 14462 2 463 educ 1 482 age 2 0056 age2 2 346 restaurn 836 11202 10972 100092 1802 n 5 807 R2 5 1134 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 262 The income effect is now statistically significant and larger in magnitude The price effect is also notably bigger but it is still statistically insignificant One reason for this is that cigpric varies only across states in the sample and so there is much less variation in logcigpric than in logincome educ and age The estimates on the other variables have naturally changed somewhat but the basic story is still the same Cigarette smoking is negatively related to schooling has a quadratic relationship with age and is negatively affected by restaurant smoking restrictions We must be a little careful in computing F statistics for testing multiple hypotheses after estima tion by WLS This is true whether the sum of squared residuals or Rsquared form of the F statistic is used It is important that the same weights be used to estimate the unrestricted and restricted models We should first estimate the unrestricted model by OLS Once we have obtained the weights we can use them to estimate the restricted model as well The F statistic can be computed as usual Fortu nately many regression packages have a simple command for testing joint restrictions after WLS estimation so we need not perform the restricted regression ourselves Example 87 hints at an issue that sometimes arises in applications of weighted least squares the OLS and WLS estimates can be substantially different This is not such a big problem in the demand for cigarettes equation because all the coefficients maintain the same signs and the biggest changes are on variables that were statistically insignificant when the equation was estimated by OLS The OLS and WLS estimates will always differ due to sam pling error The issue is whether their difference is enough to change important conclusions If OLS and WLS produce statistically significant estimates that differ in signfor example the OLS price elasticity is positive and significant while the WLS price elasticity is negative and significant or the difference in magnitudes of the estimates is practically large we should be suspicious Typi cally this indicates that one of the other Gauss Markov assumptions is false particularly the zero conditional mean assumption on the error MLR4 If E1y0x2 2 b0 1 b1x1 1 p 1 bkxk then OLS and WLS have different expected values and probability limits For WLS to be consistent for the bj it is not enough for u to be uncorrelated with each xj we need the stronger assumption MLR4 in the linear model MLR1 Therefore a significant difference between OLS and WLS can indicate a functional form mis specification in E1y0x2 The Hausman test Hausman 1978 can be used to formally compare the OLS and WLS estimates to see if they differ by more than sampling error suggests they should but this test is beyond the scope of this text In many cases an informal eyeballing of the estimates is sufficient to detect a problem 84c What If the Assumed Heteroskedasticity Function Is Wrong We just noted that if OLS and WLS produce very different estimates it is likely that the conditional mean E1y0x2 is misspecified What are the properties of WLS if the variance function we use is mis specified in the sense that Var1y0x2 2 s2h1x2 for our chosen function hx The most important issue Let u i be the WLS residuals from 836 which are not weighted and let cigsi be the fitted values These are obtained us ing the same formulas as OLS they differ because of different estimates of the bj One way to determine whether heteroske dasticity has been eliminated is to use the u 2 i h i 5 1u ih i2 2 in a test for heteroskedas ticity If hi 5 Var1ui0xi2 then the transformed residuals should have little evidence of het eroskedasticity There are many possibili ties but onebased on Whites test in the transformed equationis to regress u 2 i h i on cigsi h i and cigs2 i h i including an in tercept The joint F statistic when we use SMOKE is 1115 Does it appear that our correction for heteroskedasticity has actually eliminated the heteroskedasticity Exploring FurthEr 84 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 263 is whether misspecification of hx causes bias or inconsistency in the WLS estimator Fortunately the answer is no at least under MLR4 Recall that if E1u0x2 5 0 then any function of x is uncorrelated with u and so the weighted error u h1x2 is uncorrelated with the weighted regressors xj h1x2 for any function hx that is always positive This is why as we just discussed we can take large dif ferences between the OLS and WLS estimators as indicative of functional form misspecification If we estimate parameters in the function say h1x d 2 then we can no longer claim that WLS is unbi ased but it will generally be consistent whether or not the variance function is correctly specified If WLS is at least consistent under MLR1 to MLR4 what are the consequences of using WLS with a misspecified variance function There are two The first which is very important is that the usual WLS standard errors and test statistics computed under the assumption that Var1y0x2 5 s2h1x2 are no longer valid even in large samples For example the WLS estimates and standard errors in column 4 of Table 81 assume that Var1nettfa0inc age male e401k2 5 Var1nettfa0inc2 5 s2inc so we are assuming not only that the variance depends just on income but also that it is a linear function of income If this assumption is false the standard errors and any statistics we obtain using those standard errors are not valid Fortunately there is an easy fix just as we can obtain standard errors for the OLS estimates that are robust to arbitrary heteroskedasticity we can obtain standard errors for WLS that allow the variance function to be arbitrarily misspecified It is easy to see why this works Write the transformed equation as yi hi 5 b011 hi2 1 b11xi1 hi2 1 p 1 bk1xik hi2 1 ui hi Now if Var1ui0xi2 2 s2hi then the weighted error ui hi is heteroskedastic So we can just apply the usual heteroskedasticityrobust standard errors after estimating this equation by OLSwhich remember is identical to WLS To see how robust inference with WLS works in practice column 1 of Table 82 reproduces the last column of Table 81 and column 2 contains standard errors robust to Var1ui0xi2 2 s2inci The standard errors in column 2 allow the variance function to be misspecified We see that for the income and age variables the robust standard errors are somewhat above the usual WLS standard errorscertainly by enough to stretch the confidence intervals On the other hand the robust standard errors for male and e401k are actually smaller than those that assume a correct variance function We saw this could happen with the heteroskedasticityrobust standard errors for OLS too Even if we use flexible forms of variance functions such as that in 830 there is no guarantee that we have the correct model While exponential heteroskedasticity is appealing and reasonably flex ible it is after all just a model Therefore it is always a good idea to compute fully robust standard errors and test statistics after WLS estimation TAblE 82 WLS Estimation of the nettfa Equation Independent Variables With Nonrobust Standard Errors With Robust Standard Errors inc 740 064 740 075 age 252 0175 0019 0175 0026 male 184 156 184 131 e401k 519 170 519 157 intercept 21670 196 21670 224 Observations 2017 2017 Rsquared 1115 1115 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 264 A modern criticism of WLS is that if the variance function is misspecified it is not guaranteed to be more efficient than OLS In fact that is the case if Var1y0x2 is neither constant nor equal to s2h1x2 where hx is the proposed model of heteroskedasticity then we cannot rank OLS and WLS in terms of variances or asymptotic variances when the variance parameters must be estimated However this theoretically correct criticism misses an important practical point Namely in cases of strong heteroskedasticity it is often better to use a wrong form of heteroskedasticity and apply WLS than to ignore heteroskedasticity altogether in estimation and use OLS Models such as 830 can well approximate a variety of heteroskedasticity functions and may produce estimators with smaller asymptotic variances Even in Example 86 where the form of heteroskedasticity was assumed to have the simple form Var1nettfa0x2 5 s2inc the fully robust standard errors for WLS are well below the fully robust standard errors for OLS Comparing robust standard errors for the two estimators puts them on equal footing we assume neither homoskedasticity nor that the variance has the form s2inc For example the robust standard error for the WLS estimator of binc is about 075 which is 25 lower than the robust standard error for OLS about 100 For the coefficient on 1age 2 252 2 the robust standard error of WLS is about 0026 almost 40 below the robust standard error for OLS about 0043 84d Prediction and Prediction Intervals with Heteroskedasticity If we start with the standard linear model under MLR1 to MLR4 but allow for heteroskedasticity of the form Var1y0x2 5 s2h1x2 see equation 821 the presence of heteroskedasticity affects the point prediction of y only insofar as it affects estimation of the bj Of course it is natural to use WLS on a sample of size n to obtain the b j Our prediction of an unobserved outcome y0 given known values of the explanatory variables x0 has the same form as in Section 64 y0 5 b 0 1 x0b This makes sense once we know E1y0x2 we base our prediction on it the structure of Var1y0x2 plays no direct role On the other hand prediction intervals do depend directly on the nature of Var1y0x2 Recall in Section 64 that we constructed a prediction interval under the classical linear model assumptions Suppose now that all the CLM assumptions hold except that 821 replaces the homoskedasticity as sumption MLR5 We know that the WLS estimators are BLUE and because of normality have con ditional normal distributions We can obtain se1y02 using the same method in Section 64 except that now we use WLS A simple approach is to write yi 5 u0 1 b11xi1 2 x0 12 1 p 1 bk1xik 2 x0 k2 1 ui where the x0 j are the values of the explanatory variables for which we want a predicted value of y We can estimate this equation by WLS and then obtain y0 5 u 0 and se1y02 5 se1u 02 We also need to estimate the standard deviation of u0 the unobserved part of y0 But Var1u00x 5 x02 5 s2h1x02 and so se1u02 5 s h1x02 where s is the standard error of the regression from the WLS estimation Therefore a 95 prediction interval is y0 6 t025 se1e02 837 where se1e02 5 53se1y02 42 1 s 2h1x02 612 This interval is exact only if we do not have to estimate the variance function If we estimate parameters as in model 830 then we cannot obtain an exact interval In fact accounting for the es timation error in the b j and the d j the variance parameters becomes very difficult We saw two exam ples in Section 64 where the estimation error in the parameters was swamped by the variation in the unobservables u0 Therefore we might still use equation 837 with h1x02 simply replaced by h 1x02 In fact if we are to ignore the parameter estimation error entirely we can drop se1y02 from se1e02 Remember se1y02 converges to zero at the rate 1n while se1u 02 is roughly constant We can also obtain a prediction for y in the model log1y2 5 b0 1 b1x1 1 p 1 bkxk 1 u 838 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 265 where u is heteroskedastic We assume that u has a conditional normal distribution with a specific form of heteroskedasticity We assume the exponential form in equation 830 but add the normality assumption u0x1 x2 p xk Normal30 exp1d0 1 d1x1 1 p 1 dkxk2 4 839 As a notational shorthand write the variance function as exp1d0 1 xd 2 Then because logy given x has a normal distribution with mean b0 1 xb and variance exp1d0 1 xd 2 it follows that E1y0x2 5 exp1b0 1 xb 1 exp1d0 1 xd 222 840 Now we estimate the bj and dj using WLS estimation of 838 That is after using OLS to obtain the residuals run the regression in 832 to obtain fitted values g i 5 a 0 1 d 1xi1 1 p 1 d kxik 841 and then compute the h i as in 833 Using these h i obtain the WLS estimates b j and also com pute s 2 from the weighted squared residuals Now compared with the original model for Var1u0x2 d0 5 a0 1 log1s22 and so Var1u0x2 5 s2 exp1a0 1 d1x1 1 p 1 dkxk2 Therefore the estimated variance is s 2 exp1g i2 5 s 2h i and the fitted value for yi is yi 5 exp1logyi 1 s 2h i 22 842 We can use these fitted values to obtain an Rsquared measure as described in Section 64 use the squared correlation coefficient between yi and yi For any values of the explanatory variables x0 we can estimate E1y0x 5 x02 as E 1y0x 5 x02 5 exp1b 0 1 x0b 1 s 2 exp1a 0 1 x0d 222 843 where b j 5 the WLS estimates a 0 5 the intercept in 841 d j 5 the slopes from the same regression s 2 is obtained from the WLS estimation Obtaining a proper standard error for the prediction in 842 is very complicated analytically but as in Section 64 it would be fairly easy to obtain a standard error using a resampling method such as the bootstrap described in Appendix 6A Obtaining a prediction interval is more of a challenge when we estimate a model for heteroske dasticity and a full treatment is complicated Nevertheless we saw in Section 64 two examples where the error variance swamps the estimation error and we would make only a small mistake by ignoring the estimation error in all parameters Using arguments similar to those in Section 64 an approximate 95 prediction interval for large sample sizes is exp32196 s h 1x02 4 exp1b 0 1 x0b 2 to exp3196 s h 1x02 4 exp1b 0 1 x0b 2 where h 1x02 is the estimated variance function evaluated at x0 h 1x02 5 exp1a 0 1 d 1x0 1 1 p 1 d kx0 k2 As in Section 64 we obtain this approximate interval by simply exponentiating the endpoints 85 The Linear Probability Model Revisited As we saw in Section 75 when the dependent variable y is a binary variable the model must contain heteroskedasticity unless all of the slope parameters are zero We are now in a position to deal with this problem The simplest way to deal with heteroskedasticity in the linear probability model is to continue to use OLS estimation but to also compute robust standard errors in test statistics This ignores the fact that we actually know the form of heteroskedasticity for the LPM Nevertheless OLS estimation of the LPM is simple and often produces satisfactory results Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 266 ExamplE 88 labor Force participation of married Women In the labor force participation example in Section 75 see equation 729 we reported the usual OLS standard errors Now we compute the heteroskedasticityrobust standard errors as well These are reported in brackets below the usual standard errors inlf 5 586 2 0034 nwifeinc 1 038 educ 1 039 exper 11542 100142 10072 10062 31514 300154 30074 30064 2 00060 exper2 2 016 age 2 262 kidslt6 1 0130 kidsge6 844 1000182 10022 10342 101322 3000194 30024 30324 301354 n 5 753 R2 5 264 Several of the robust and OLS standard errors are the same to the reported degree of precision in all cases the differences are practically very small Therefore while heteroskedasticity is a problem in theory it is not in practice at least not for this example It often turns out that the usual OLS standard errors and test statistics are similar to their heteroskedasticityrobust counterparts Furthermore it requires a minimal effort to compute both Generally the OLS estimators are inefficient in the LPM Recall that the conditional variance of y in the LPM is Var1y0x2 5 p1x2 31 2 p1x2 4 845 where p1x2 5 b0 1 b1x1 1 p 1 bkxk 846 is the response probability probability of success y 5 1 It seems natural to use weighted least squares but there are a couple of hitches The probability px clearly depends on the unknown popu lation parameters bj Nevertheless we do have unbiased estimators of these parameters namely the OLS estimators When the OLS estimators are plugged into equation 846 we obtain the OLS fitted values Thus for each observation i Var1yi0xi2 is estimated by h i 5 yi11 2 yi2 847 where yi is the OLS fitted value for observation i Now we apply feasible GLS just as in Section 84 Unfortunately being able to estimate hi for each i does not mean that we can proceed directly with WLS estimation The problem is one that we briefly discussed in Section 75 the fitted values yi need not fall in the unit interval If either yi 0 or yi 1 equation 847 shows that h i will be nega tive Since WLS proceeds by multiplying observation i by 1 h i the method will fail if h i is negative or zero for any observation In other words all of the weights for WLS must be positive In some cases 0 yi 1 for all i in which case WLS can be used to estimate the LPM In cases with many observations and small probabilities of success or failure it is very common to find some fitted values outside the unit interval If this happens as it does in the labor force participation example in equation 844 it is easiest to abandon WLS and to report the heteroskedasticityrobust statistics An alternative is to adjust those fitted values that are less than zero or greater than unity and then to apply WLS One suggestion is to set yi 5 01 if yi 0 and yi 5 99 if yi 1 Unfortunately this requires an arbitrary choice on the part of the researcherfor example why not use 001 and 999 as the adjusted values If many fitted values are outside the unit interval the adjustment to the fitted values can affect the results in this situation it is probably best to just use OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 267 Estimating the Linear Probability Model by Weighted Least Squares 1 Estimate the model by OLS and obtain the fitted values y 2 Determine whether all of the fitted values are inside the unit interval If so proceed to step 3 If not some adjustment is needed to bring all fitted values into the unit interval 3 Construct the estimated variances in equation 847 4 Estimate the equation y 5 b0 1 b1x1 1 p 1 bkxk 1 u by WLS using weights 1h ExamplE 89 Determinants of personal Computer Ownership We use the data in GPA1 to estimate the probability of owning a computer Let PC denote a binary in dicator equal to unity if the student owns a computer and zero otherwise The variable hsGPA is high school GPA ACT is achievement test score and parcoll is a binary indicator equal to unity if at least one parent attended college Separate college indicators for the mother and the father do not yield individually significant results as these are pretty highly correlated The equation estimated by OLS is PC 5 20004 1 065 hsGPA 1 0006 ACT 1 221 parcoll 149052 11372 101552 10932 848 348884 31394 301584 30874 n 5 141 R2 5 0415 Just as with Example 88 there are no striking differences between the usual and robust standard errors Nevertheless we also estimate the model by WLS Because all of the OLS fitted values are inside the unit interval no adjustments are needed PC 5 026 1 033 hsGPA 1 0043 ACT 1 215 parcoll 14772 11302 101552 10862 849 n 5 142 R2 5 0464 There are no important differences in the OLS and WLS estimates The only significant explanatory variable is parcoll and in both cases we estimate that the probability of PC ownership is about 22 higher if at least one parent attended college Summary We began by reviewing the properties of ordinary least squares in the presence of heteroskedasticity Heter oskedasticity does not cause bias or inconsistency in the OLS estimators but the usual standard errors and test statistics are no longer valid We showed how to compute heteroskedasticityrobust standard errors and t statistics something that is routinely done by many regression packages Most regression packages also compute a heteroskedasticityrobust Ftype statistic We discussed two common ways to test for heteroskedasticity the BreuschPagan test and a special case of the White test Both of these statistics involve regressing the squared OLS residuals on either the independent variables BP or the fitted and squared fitted values White A simple F test is asymptotically valid there are also Lagrange multiplier versions of the tests OLS is no longer the best linear unbiased estimator in the presence of heteroskedasticity When the form of heteroskedasticity is known GLS estimation can be used This leads to weighted least squares as Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 268 PART 1 Regression Analysis with CrossSectional Data a means of obtaining the BLUE estimator The test statistics from the WLS estimation are either exactly valid when the error term is normally distributed or asymptotically valid under nonnormality This as sumes of course that we have the proper model of heteroskedasticity More commonly we must estimate a model for the heteroskedasticity before applying WLS The re sulting feasible GLS estimator is no longer unbiased but it is consistent and asymptotically efficient The usual statistics from the WLS regression are asymptotically valid We discussed a method to ensure that the estimated variances are strictly positive for all observations something needed to apply WLS As we discussed in Chapter 7 the linear probability model for a binary dependent variable necessarily has a heteroskedastic error term A simple way to deal with this problem is to compute heteroskedasticity robust statistics Alternatively if all the fitted values that is the estimated probabilities are strictly be tween zero and one weighted least squares can be used to obtain asymptotically efficient estimators Key Terms BreuschPagan Test for Heteroskedasticity BP Test Feasible GLS FGLS Estimator Generalized Least Squares GLS Estimators Heteroskedasticity of Unknown Form HeteroskedasticityRobust F Statistic HeteroskedasticityRobust LM Statistic HeteroskedasticityRobust Standard Error HeteroskedasticityRobust t Statistic Weighted Least Squares WLS Estimators White Test for Heteroskedasticity Problems 1 Which of the following are consequences of heteroskedasticity i The OLS estimators b j are inconsistent ii The usual F statistic no longer has an F distribution iii The OLS estimators are no longer BLUE 2 Consider a linear model to explain monthly beer consumption beer 5 b0 1 b1inc 1 b2 price 1 b3educ 1 b4 female 1 u E1u0inc price educ female2 5 0 Var1u0inc price educ female2 5 s2inc2 Write the transformed equation that has a homoskedastic error term 3 True or False WLS is preferred to OLS when an important variable has been omitted from the model 4 Using the data in GPA3 the following equation was estimated for the fall and second semester students trmgpa 5 2212 1 900 crsgpa 1 193 cumgpa 1 0014 tothrs 1552 11752 10642 100122 3554 31664 30744 300124 1 0018 sat 2 0039 hsperc 1 351 female 2 157 season 100022 100182 10852 10982 300024 300194 30794 30804 n 5 269 R2 5 465 Here trmgpa is term GPA crsgpa is a weighted average of overall GPA in courses taken cumgpa is GPA prior to the current semester tothrs is total credit hours prior to the semester sat is SAT score hsperc is graduating percentile in high school class female is a gender dummy and season is a dummy variable equal to unity if the students sport is in season during the fall The usual and heteroskedasticityrobust standard errors are reported in parentheses and brackets respectively Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 269 i Do the variables crsgpa cumgpa and tothrs have the expected estimated effects Which of these variables are statistically significant at the 5 level Does it matter which standard errors are used ii Why does the hypothesis H0 bcrsgpa 5 1 make sense Test this hypothesis against the twosided alternative at the 5 level using both standard errors Describe your conclusions iii Test whether there is an inseason effect on term GPA using both standard errors Does the significance level at which the null can be rejected depend on the standard error used 5 The variable smokes is a binary variable equal to one if a person smokes and zero otherwise Using the data in SMOKE we estimate a linear probability model for smokes smokes 5 656 2 069 log1cigpric2 1 012 log1income2 2 029 educ 18552 12042 10262 10062 38564 32074 30264 30064 1 020 age 2 00026 age2 2 101 restaurn 2 026 white 10062 1000062 10392 10522 30054 3000064 30384 30504 n 5 807 R2 5 062 The variable white equals one if the respondent is white and zero otherwise the other independent variables are defined in Example 87 Both the usual and heteroskedasticityrobust standard errors are reported i Are there any important differences between the two sets of standard errors ii Holding other factors fixed if education increases by four years what happens to the estimated probability of smoking iii At what point does another year of age reduce the probability of smoking iv Interpret the coefficient on the binary variable restaurn a dummy variable equal to one if the person lives in a state with restaurant smoking restrictions v Person number 206 in the data set has the following characteristics cigpric 5 6744 income 5 6500 educ 5 16 age 5 77 restaurn 5 0 white 5 0 and smokes 5 0 Compute the predicted probability of smoking for this person and comment on the result 6 There are different ways to combine features of the BreuschPagan and White tests for heteroskedastic ity One possibility not covered in the text is to run the regression u 2 i on xi1 xi2 p xik y2 i i 5 1 p n where the u i are the OLS residuals and the yi are the OLS fitted values Then we would test joint significance of xi1 xi2 p xik and y2 i Of course we always include an intercept in this regression i What are the df associated with the proposed F test for heteroskedasticity ii Explain why the Rsquared from the regression above will always be at least as large as the Rsquareds for the BP regression and the special case of the White test iii Does part ii imply that the new test always delivers a smaller pvalue than either the BP or special case of the White statistic Explain iv Suppose someone suggests also adding yi to the newly proposed test What do you think of this idea 7 Consider a model at the employee level yie 5 b0 1 b1xie1 1 b2xie2 1 p 1 bk xiek 1 fi 1 vie where the unobserved variable fi is a firm effect to each employee at a given firm i The error term vie is specific to employee e at firm i The composite error is uie 5 fi 1 vie such as in equation 828 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 270 PART 1 Regression Analysis with CrossSectional Data i Assume that Var1fi2 5 s2 f Var1vie2 5 s2 v and fi and vie are uncorrelated Show that Var1uie2 5 s2 f 1 s2 v call this s2 ii Now suppose that for e 2 g vie and vig are uncorrelated Show that Cov1uie uig2 5 s2 f iii Let ui 5 m21 i g mi e51 uie be the average of the composite errors within a firm Show that Var1ui2 5 s2 f 1 s2 vmi iv Discuss the relevance of part iii for WLS estimation using data averaged at the firm level where the weight used for observation i is the usual firm size 8 The following equations were estimated using the data in ECONMATH The first equation is for men and the second is for women The third and fourth equations combine men and women score 5 2052 1 13 60 colgpa 1 0 670 act 13722 10942 101502 n 5 406 R2 5 4025 SSR 5 3878138 score 5 13 79 1 11 89 colgpa 1 1 03 act 14112 11092 10182 n 5 408 R2 5 3666 SSR 5 4802982 score 5 1560 1 317 male 1 1282 colgpa 1 0838 act 12802 10732 10722 101162 n 5 814 R2 5 3946 SSR 5 8712896 score 5 1379 1 6 73 male 1 11 89 colgpa 1 1 03 act 1 1 72 male colgpa 2 0 364 male act 13912 15552 11042 10172 11442 102322 n 5 814 R2 5 3968 SSR 5 8681120 i Compute the usual Chow statistic for testing the null hypothesis that the regression equations are the same for men and women Find the pvalue of the test ii Compute the usual Chow statistic for testing the null hypothesis that the slope coefficients are the same for men and women and report the pvalue iii Do you have enough information to compute heteroskedasticityrobust versions of the tests in ii and iii Explain Computer Exercises C1 Consider the following model to explain sleeping behavior sleep 5 b0 1 b1totwrk 1 b2 educ 1 b3 age 1 b4 age2 1 b5 yngkid 1 b6 male 1 u i Write down a model that allows the variance of u to differ between men and women The vari ance should not depend on other factors ii Use the data in SLEEP75 to estimate the parameters of the model for heteroskedasticity You have to estimate the sleep equation by OLS first to obtain the OLS residuals Is the estimated variance of u higher for men or for women iii Is the variance of u statistically different for men and for women C2 i Use the data in HPRICE1 to obtain the heteroskedasticityrobust standard errors for equation 817 Discuss any important differences with the usual standard errors ii Repeat part i for equation 818 iii What does this example suggest about heteroskedasticity and the transformation used for the dependent variable Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 271 C3 Apply the full White test for heteroskedasticity see equation 819 to equation 818 Using the chi square form of the statistic obtain the pvalue What do you conclude C4 Use VOTE1 for this exercise i Estimate a model with voteA as the dependent variable and prtystrA democA logexpendA and logexpendB as independent variables Obtain the OLS residuals u i and regress these on all of the independent variables Explain why you obtain R2 5 0 ii Now compute the BreuschPagan test for heteroskedasticity Use the F statistic version and report the pvalue iii Compute the special case of the White test for heteroskedasticity again using the F statistic form How strong is the evidence for heteroskedasticity now C5 Use the data in PNTSPRD for this exercise i The variable sprdcvr is a binary variable equal to one if the Las Vegas point spread for a college basketball game was covered The expected value of sprdcvr say m is the probability that the spread is covered in a randomly selected game Test H0 m 5 5 against H1 m 2 5 at the 10 significance level and discuss your findings Hint This is easily done using a t test by regress ing sprdcvr on an intercept only ii How many games in the sample of 553 were played on a neutral court iii Estimate the linear probability model sprdcvr 5 b0 1 b1 favhome 1 b2 neutral 1 b3 fav25 1 b4 und25 1 u and report the results in the usual form Report the usual OLS standard errors and the heteroskedasticityrobust standard errors Which variable is most significant both practically and statistically iv Explain why under the null hypothesis H0 b1 5 b2 5 b3 5 b4 5 0 there is no heteroskedasticity in the model v Use the usual F statistic to test the hypothesis in part iv What do you conclude vi Given the previous analysis would you say that it is possible to systematically predict whether the Las Vegas spread will be covered using information available prior to the game C6 In Example 712 we estimated a linear probability model for whether a young man was arrested dur ing 1986 arr86 5 b0 1 b1 pcnv 1 b2 avgsen 1 b3 tottime 1 b4 ptime86 1 b5 qemp86 1 u i Using the data in CRIME1 estimate this model by OLS and verify that all fitted values are strictly between zero and one What are the smallest and largest fitted values ii Estimate the equation by weighted least squares as discussed in Section 85 iii Use the WLS estimates to determine whether avgsen and tottime are jointly significant at the 5 level C7 Use the data in LOANAPP for this exercise i Estimate the equation in part iii of Computer Exercise C8 in Chapter 7 computing the heteroskedasticityrobust standard errors Compare the 95 confidence interval on bwhite with the nonrobust confidence interval ii Obtain the fitted values from the regression in part i Are any of them less than zero Are any of them greater than one What does this mean about applying weighted least squares C8 Use the data set GPA1 for this exercise i Use OLS to estimate a model relating colGPA to hsGPA ACT skipped and PC Obtain the OLS residuals ii Compute the special case of the White test for heteroskedasticity In the regression of u 2 i on colGPAi colGPA2 i obtain the fitted values say h i Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 272 PART 1 Regression Analysis with CrossSectional Data iii Verify that the fitted values from part ii are all strictly positive Then obtain the weighted least squares estimates using weights 1h i Compare the weighted least squares estimates for the effect of skipping lectures and the effect of PC ownership with the corresponding OLS estimates What about their statistical significance iv In the WLS estimation from part iii obtain heteroskedasticityrobust standard errors In other words allow for the fact that the variance function estimated in part ii might be misspecified See Question 84 Do the standard errors change much from part iii C9 In Example 87 we computed the OLS and a set of WLS estimates in a cigarette demand equation i Obtain the OLS estimates in equation 835 ii Obtain the h i used in the WLS estimation of equation 836 and reproduce equation 836 From this equation obtain the unweighted residuals and fitted values call these u i and yi respectively For example in Stata the unweighted residuals and fitted values are given by default iii Let ui 5 u i h i and yi 5 yi h i be the weighted quantities Carry out the special case of the White test for heteroskedasticity by regressing u2 i on yi y2 i being sure to include an intercept as always Do you find heteroskedasticity in the weighted residuals iv What does the finding from part iii imply about the proposed form of heteroskedasticity used in obtaining 836 v Obtain valid standard errors for the WLS estimates that allow the variance function to be misspecified C10 Use the data set 401KSUBS for this exercise i Using OLS estimate a linear probability model for e401k using as explanatory variables inc inc2 age age2 and male Obtain both the usual OLS standard errors and the heteroskedasticity robust versions Are there any important differences ii In the special case of the White test for heteroskedasticity where we regress the squared OLS residuals on a quadratic in the OLS fitted values u 2 i on yi y2 i i 5 1 p n argue that the probability limit of the coefficient on yi should be one the probability limit of the coefficient on y2 i should be 21 and the probability limit of the intercept should be zero Hint Remember that Var1y0x1 p xk2 5 p1x2 31 2 p1x2 4 where p1x2 5 b0 1 b1x1 1 p 1 bkxk iii For the model estimated from part i obtain the White test and see if the coefficient estimates roughly correspond to the theoretical values described in part ii iv After verifying that the fitted values from part i are all between zero and one obtain the weighted least squares estimates of the linear probability model Do they differ in important ways from the OLS estimates C11 Use the data in 401KSUBS for this question restricting the sample to fsize 5 1 i To the model estimated in Table 81 add the interaction term e401k inc Estimate the equation by OLS and obtain the usual and robust standard errors What do you conclude about the statis tical significance of the interaction term ii Now estimate the more general model by WLS using the same weights 1inci as in Table 81 Compute the usual and robust standard error for the WLS estimator Is the interaction term statistically significant using the robust standard error iii Discuss the WLS coefficient on e401k in the more general model Is it of much interest by itself Explain iv Reestimate the model by WLS but use the interaction term e401k inc 30 the average income in the sample is about 2944 Now interpret the coefficient on e401k C12 Use the data in MEAP00 to answer this question i Estimate the model math4 5 b0 1 b1lunch 1 b2log1enroll2 1 b3log1exppp2 1 u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 273 by OLS and obtain the usual standard errors and the fully robust standard errors How do they generally compare ii Apply the special case of the White test for heteroskedasticity What is the value of the F test What do you conclude iii Obtain g i as the fitted values from the regression log1u 2 i 2 on math4i math42 i where math4i are the OLS fitted values and the u i are the OLS residuals Let h i 5 exp1g i2 Use the h i to obtain WLS estimates Are there big differences with the OLS coefficients iv Obtain the standard errors for WLS that allow misspecification of the variance function Do these differ much from the usual WLS standard errors v For estimating the effect of spending on math4 does OLS or WLS appear to be more precise C13 Use the data in FERTIL2 to answer this question i Estimate the model children 5 b0 1 b1age 1 b2 age2 1 b3 educ 1 b4 electric 1 b5 urban 1 u and report the usual and heteroskedasticityrobust standard errors Are the robust standard errors always bigger than the nonrobust ones ii Add the three religious dummy variables and test whether they are jointly significant What are the pvalues for the nonrobust and robust tests iii From the regression in part ii obtain the fitted values y and the residuals u Regress u 2 on y y2 and test the joint significance of the two regressors Conclude that heteroskedasticity is present in the equation for children iv Would you say the heteroskedasticity you found in part iii is practically important C14 Use the data in BEAUTY for this question i Using the data pooled for men and women estimate the equation lwage 5 b0 1 b1belavg 1 b2abvavg 1 b3female 1 b4educ 1 b5exper 1 b5exper2 1 u and report the results using heteroskedasticityrobust standard errors below coefficients Are any of the coefficients surprising in either their signs or magnitudes Is the coefficient on female practically large and statistically significant ii Add interactions of female with all other explanatory variables in the equation from part i five interactions in all Compute the usual F test of joint significance of the five interactions and a heteroskedasticityrobust version Does using the heteroskedasticityrobust version change the outcome in any important way iii In the full model with interactions determine whether those involving the looks variables female belavg and female abvavgare jointly significant Are their coefficients practically small Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 274 c h a p t e r 9 More on Specification and Data Issues I n Chapter 8 we dealt with one failure of the GaussMarkov assumptions While heteroskedasticity in the errors can be viewed as a problem with a model it is a relatively minor one The presence of heteroskedasticity does not cause bias or inconsistency in the OLS estimators Also it is fairly easy to adjust confidence intervals and t and F statistics to obtain valid inference after OLS estimation or even to get more efficient estimators by using weighted least squares In this chapter we return to the much more serious problem of correlation between the error u and one or more of the explanatory variables Remember from Chapter 3 that if u is for whatever reason correlated with the explanatory variable xj then we say that xj is an endogenous explanatory variable We also provide a more detailed discussion on three reasons why an explanatory variable can be endogenous in some cases we discuss possible remedies We have already seen in Chapters 3 and 5 that omitting a key variable can cause correlation between the error and some of the explanatory variables which generally leads to bias and incon sistency in all of the OLS estimators In the special case that the omitted variable is a function of an explanatory variable in the model the model suffers from functional form misspecification We begin in the first section by discussing the consequences of functional form misspecification and how to test for it In Section 92 we show how the use of proxy variables can solve or at least mitigate omitted variables bias In Section 93 we derive and explain the bias in OLS that can arise under certain forms of measurement error Additional data problems are discussed in Section 94 All of the procedures in this chapter are based on OLS estimation As we will see certain prob lems that cause correlation between the error and some explanatory variables cannot be solved by using OLS on a single cross section We postpone a treatment of alternative estimation methods until Part 3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 275 91 Functional Form Misspecification A multiple regression model suffers from functional form misspecification when it does not properly account for the relationship between the dependent and the observed explanatory variables For exam ple if hourly wage is determined by log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 1 u but we omit the squared experience term exper2 then we are committing a functional form misspecifica tion We already know from Chapter 3 that this generally leads to biased estimators of b0 b1 and b2 We do not estimate b3 because exper2 is excluded from the model Thus misspecifying how exper affects log1wage2 generally results in a biased estimator of the return to education b1 The amount of this bias depends on the size of b3 and the correlation among educ exper and exper2 Things are worse for estimating the return to experience even if we could get an unbiased estima tor of b2 we would not be able to estimate the return to experience because it equals b2 1 2b3exper in decimal form Just using the biased estimator of b2 can be misleading especially at extreme val ues of exper As another example suppose the log1wage2 equation is log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 91 1 b4 female 1 b5 female educ 1 u where female is a binary variable If we omit the interaction term female educ then we are mis specifying the functional form In general we will not get unbiased estimators of any of the other parameters and since the return to education depends on gender it is not clear what return we would be estimating by omitting the interaction term Omitting functions of independent variables is not the only way that a model can suffer from misspecified functional form For example if 91 is the true model satisfying the first four Gauss Markov assumptions but we use wage rather than log1wage2 as the dependent variable then we will not obtain unbiased or consistent estimators of the partial effects The tests that follow have some ability to detect this kind of functional form problem but there are better tests that we will mention in the subsection on testing against nonnested alternatives Misspecifying the functional form of a model can certainly have serious consequences Nevertheless in one important respect the problem is minor by definition we have data on all the necessary variables for obtaining a functional relationship that fits the data well This can be con trasted with the problem addressed in the next section where a key variable is omitted on which we cannot collect data We already have a very powerful tool for detecting misspecified functional form the F test for joint exclusion restrictions It often makes sense to add quadratic terms of any significant variables to a model and to perform a joint test of significance If the additional quadratics are significant they can be added to the model at the cost of complicating the interpretation of the model However significant quadratic terms can be symptomatic of other functional form problems such as using the level of a variable when the logarithm is more appropriate or vice versa It can be difficult to pinpoint the precise reason that a functional form is misspecified Fortunately in many cases using logarithms of certain variables and adding quadratics are sufficient for detecting many important nonlinear rela tionships in economics ExamplE 91 Economic model of Crime Table 91 contains OLS estimates of the economic model of crime see Example 83 We first esti mate the model without any quadratic terms those results are in column 1 In column 2 the squares of pcnv ptime86 and inc86 are added we chose to include the squares of these variables because each level term is significant in column 1 The variable qemp86 is a dis crete variable taking on only five values so we do not include its square in column 2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 276 Each of the squared terms is significant and together they are jointly very significant F 5 3137 with df 5 3 and 2713 the pvalue is essentially zero Thus it appears that the initial model over looked some potentially important nonlinearities The presence of the quadratics makes interpret ing the model somewhat difficult For example pcnv no longer has a strict deterrent effect the relation ship between narr86 and pcnv is positive up until pcnv 5 365 and then the relationship is negative We might conclude that there is little or no deterrent effect at lower values of pcnv the effect only kicks in at higher prior conviction rates We would have to use more sophisticated functional forms than the quadratic to verify this conclusion It may be that pcnv is not entirely exogenous For example men who have not been convicted in the past so that pcnv 5 0 are perhaps casual criminals and so they are less likely to be arrested in 1986 This could be biasing the estimates Similarly the relationship between narr86 and ptime86 is positive up until ptime86 5 485 almost five months in prison and then the relationship is negative The vast majority of men in the sample spent no time in prison in 1986 so again we must be careful in interpreting the results Why do we not include the squares of black and hispan in column 2 of Table 91 Would it make sense to add interac tions of black and hispan with some of the other variables reported in the table Exploring FurthEr 91 TAblE 91 Dependent Variable narr86 Independent Variables 1 2 pcnv 2133 040 533 154 pcnv2 2730 156 avgsen 2011 012 2017 012 tottime 012 009 012 009 ptime86 2041 009 287 004 ptime862 20296 0039 qemp86 2051 014 2014 017 inc86 20015 0003 20034 0008 inc862 2000007 000003 black 327 045 292 045 hispan 194 040 164 039 intercept 596 036 505 037 Observations Rsquared 2725 0723 2725 1035 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 277 Legal income has a negative effect on narr86 until inc86 5 24285 since income is measured in hundreds of dollars this means an annual income of 24285 Only 46 of the men in the sample have incomes above this level Thus we can conclude that narr86 and inc86 are negatively related with a diminishing effect Example 91 is a tricky functional form problem due to the nature of the dependent variable Other models are theoretically better suited for handling dependent variables taking on a small num ber of integer values We will briefly cover these models in Chapter 17 91a RESET as a General Test for Functional Form Misspecification Some tests have been proposed to detect general functional form misspecification Ramseys 1969 regression specification error test RESET has proven to be useful in this regard The idea behind RESET is fairly simple If the original model y 5 b0 1 b1x1 1 p 1 bkxk 1 u 92 satisfies MLR4 then no nonlinear functions of the independent variables should be significant when added to equation 92 In Example 91 we added quadratics in the significant explanatory vari ables Although this often detects functional form problems it has the drawback of using up many degrees of freedom if there are many explanatory variables in the original model much as the straight form of the White test for heteroskedasticity consumes degrees of freedom Further certain kinds of neglected nonlinearities will not be picked up by adding quadratic terms RESET adds polynomials in the OLS fitted values to equation 92 to detect general kinds of functional form misspecification To implement RESET we must decide how many functions of the fitted values to include in an expanded regression There is no right answer to this question but the squared and cubed terms have proven to be useful in most applications Let y denote the OLS fitted values from estimating 92 Consider the expanded equation y 5 b0 1 b1x1 1 p 1 bkxk 1 d1y2 1 d2y3 1 error 93 This equation seems a little odd because functions of the fitted values from the initial estimation now appear as explanatory variables In fact we will not be interested in the estimated parameters from 93 we only use this equation to test whether 92 has missed important nonlinearities The thing to remember is that y2 and y3 are just nonlinear functions of the xj The null hypothesis is that 92 is correctly specified Thus RESET is the F statistic for test ing H0 d1 5 0 d2 5 0 in the expanded model 93 A significant F statistic suggests some sort of functional form problem The distribution of the F statistic is approximately F2n2k23 in large samples under the null hypothesis and the GaussMarkov assumptions The df in the expanded equation 93 is n 2 k 2 1 2 2 5 n 2 k 2 3 An LM version is also available and the chisquare distribution will have two df Further the test can be made robust to heteroskedasticity using the methods discussed in Section 82 ExamplE 92 Housing price Equation We estimate two models for housing prices The first one has all variables in level form price 5 b0 1 b1lotsize 1 b2sqrft 1 b3bdrms 1 u 94 The second one uses the logarithms of all variables except bdrms lprice 5 b0 1 b1llotsize 1 b2lsqrft 1 b3bdrms 1 u 95 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 278 Using n 5 88 houses in HPRICE1 the RESET statistic for equation 94 turns out to be 467 this is the value of an F282 random variable 1n 5 88 k 5 32 and the associated pvalue is 012 This is evidence of functional form misspecification in 94 The RESET statistic in 95 is 256 with pvalue 5 084 Thus we do not reject 95 at the 5 significance level although we would at the 10 level On the basis of RESET the loglog model in 95 is preferred In the previous example we tried two models for explaining housing prices One was rejected by RESET while the other was not at least at the 5 level Often things are not so simple A drawback with RESET is that it provides no real direction on how to proceed if the model is rejected Rejecting 94 by using RESET does not immediately suggest that 95 is the next step Equation 95 was estimated because constant elasticity models are easy to interpret and can have nice statistical proper ties In this example it so happens that it passes the functional form test as well Some have argued that RESET is a very general test for model misspecification including unob served omitted variables and heteroskedasticity Unfortunately such use of RESET is largely mis guided It can be shown that RESET has no power for detecting omitted variables whenever they have expectations that are linear in the included independent variables in the model see Wooldridge 2001 Section 21 for a precise statement Further if the functional form is properly specified RESET has no power for detecting heteroskedasticity The bottom line is that RESET is a functional form test and nothing more 91b Tests against Nonnested Alternatives Obtaining tests for other kinds of functional form misspecificationfor example trying to decide whether an independent variable should appear in level or logarithmic formtakes us outside the realm of classical hypothesis testing It is possible to test the model y 5 b0 1 b1x1 1 b2x2 1 u 96 against the model y 5 b0 1 b1log1x12 1 b2log1x22 1 u 97 and vice versa However these are nonnested models see Chapter 6 and so we cannot simply use a standard F test Two different approaches have been suggested The first is to construct a comprehen sive model that contains each model as a special case and then to test the restrictions that led to each of the models In the current example the comprehensive model is y 5 g0 1 g1x1 1 g2x2 1 g3log1x12 1 g4log1x22 1 u 98 We can first test H0 g3 5 0 g4 5 0 as a test of 96 We can also test H0 g1 5 0 g2 5 0 as a test of 97 This approach was suggested by Mizon and Richard 1986 Another approach has been suggested by Davidson and MacKinnon 1981 They point out that if model 96 holds with E1u0x1 x22 5 0 the fitted values from the other model 97 should be insig nificant when added to equation 96 Therefore to test whether 96 is the correct model we first estimate model 97 by OLS to obtain the fitted values call these yˇ The DavidsonMacKinnon test is obtained from the t statistic on yˇ in the auxiliary equation y 5 b0 1 b1x1 1 b2x2 1 u1yˇ 1 error Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 279 Because the yˇ are just nonlinear functions of x1 and x2 they should be insignificant if 96 is the cor rect conditional mean model Therefore a significant t statistic against a twosided alternative is a rejection of 96 Similarly if y denotes the fitted values from estimating 96 the test of 97 is the t statistic on y in the model y 5 b0 1 b1log1x12 1 b2log1x22 1 u1y 1 error a significant t statistic is evidence against 97 The same two tests can be used for testing any two nonnested models with the same dependent variable There are a few problems with nonnested testing First a clear winner need not emerge Both models could be rejected or neither model could be rejected In the latter case we can use the adjusted Rsquared to choose between them If both models are rejected more work needs to be done However it is important to know the practical consequences from using one form or the other if the effects of key independent variables on y are not very different then it does not really matter which model is used A second problem is that rejecting 96 using say the DavidsonMacKinnon test does not mean that 97 is the correct model Model 96 can be rejected for a variety of functional form misspecifications An even more difficult problem is obtaining nonnested tests when the competing models have different dependent variables The leading case is y versus log1y2 We saw in Chapter 6 that just obtaining goodnessoffit measures that can be compared requires some care Tests have been pro posed to solve this problem but they are beyond the scope of this text See Wooldridge 1994a for a test that has a simple interpretation and is easy to implement 92 Using Proxy Variables for Unobserved Explanatory Variables A more difficult problem arises when a model excludes a key variable usually because of data una vailability Consider a wage equation that explicitly recognizes that ability abil affects log1wage2 log1wage2 5 b0 1 b1educ 1 b2exper 1 b3abil 1 u 99 This model shows explicitly that we want to hold ability fixed when measuring the return to educ and exper If say educ is correlated with abil then putting abil in the error term causes the OLS estimator of b1 and b2 to be biased a theme that has appeared repeatedly Our primary interest in equation 99 is in the slope parameters b1 and b2 We do not really care whether we get an unbiased or consistent estimator of the intercept b0 as we will see shortly this is not usually possible Also we can never hope to estimate b3 because abil is not observed in fact we would not know how to interpret b3 anyway since ability is at best a vague concept How can we solve or at least mitigate the omitted variables bias in an equation like 99 One possibility is to obtain a proxy variable for the omitted variable Loosely speaking a proxy variable is something that is related to the unobserved variable that we would like to control for in our analy sis In the wage equation one possibility is to use the intelligence quotient or IQ as a proxy for abil ity This does not require IQ to be the same thing as ability what we need is for IQ to be correlated with ability something we clarify in the following discussion All of the key ideas can be illustrated in a model with three independent variables two of which are observed y 5 b0 1 b1x1 1 b2x2 1 b3xp 3 1 u 910 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 280 We assume that data are available on y x1 and x2in the wage example these are log1wage2 educ and exper respectively The explanatory variable xp 3 is unobserved but we have a proxy variable for xp 3 Call the proxy variable x3 What do we require of x3 At a minimum it should have some relationship to xp 3 This is captured by the simple regression equation xp 3 5 d0 1 d3x3 1 v3 911 where v3 is an error due to the fact that xp 3 and x3 are not exactly related The parameter d3 measures the relationship between xp 3 and x3 typically we think of xp 3 and x3 as being positively related so that d3 0 If d3 5 0 then x3 is not a suitable proxy for xp 3 The intercept d0 in 911 which can be positive or negative simply allows xp 3 and x3 to be measured on different scales For exam ple unobserved ability is certainly not required to have the same average value as IQ in the US population How can we use x3 to get unbiased or at least consistent estimators of b1 and b2 The proposal is to pretend that x3 and xp 3 are the same so that we run the regression of y on x1 x2 x3 912 We call this the plugin solution to the omitted variables problem because x3 is just plugged in for xp 3 before we run OLS If x3 is truly related to xp 3 this seems like a sensible thing However since x3 and xp 3 are not the same we should determine when this procedure does in fact give consistent estima tors of b1 and b2 The assumptions needed for the plugin solution to provide consistent estimators of b1 and b2 can be broken down into assumptions about u and v3 1 The error u is uncorrelated with x1 x2 and xp 3 which is just the standard assumption in model 910 In addition u is uncorrelated with x3 This latter assumption just means that x3 is irrelevant in the population model once x1 x2 and xp 3 have been included This is essentially true by definition since x3 is a proxy variable for xp 3 it is xp 3 that directly affects y not x3 Thus the assumption that u is uncorrelated with x1 x2 xp 3 and x3 is not very controversial Another way to state this assumption is that the expected value of u given all these variables is zero 2 The error v3 is uncorrelated with x1 x2 and x3 Assuming that v3 is uncorrelated with x1 and x2 requires x3 to be a good proxy for xp 3 This is easiest to see by writing the analog of these assump tions in terms of conditional expectations E1xp 30x1 x2 x32 5 E1xp 30x32 5 d0 1 d3x3 913 The first equality which is the most important one says that once x3 is controlled for the expected value of xp 3 does not depend on x1 or x2 Alternatively xp 3 has zero correlation with x1 and x2 once x3 is partialled out In the wage equation 99 where IQ is the proxy for ability condition 913 becomes E1abil0educ exper IQ2 5 E1abil0IQ2 5 d0 1 d3IQ Thus the average level of ability only changes with IQ not with educ and exper Is this reasonable Maybe it is not exactly true but it may be close to being true It is certainly worth including IQ in the wage equation to see what happens to the estimated return to education We can easily see why the previous assumptions are enough for the plugin solution to work If we plug equation 911 into equation 910 and do simple algebra we get y 5 1b0 1 b3d02 1 b1x1 1 b2x2 1 b3d3x3 1 u 1 b3v3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 281 Call the composite error in this equation e 5 u 1 b3v3 it depends on the error in the model of inter est 910 and the error in the proxy variable equation v3 Since u and v3 both have zero mean and each is uncorrelated with x1 x2 and x3 e also has zero mean and is uncorrelated with x1 x2 and x3 Write this equation as y 5 a0 1 b1x1 1 b2x2 1 a3x3 1 e where a0 5 1b0 1 b3d02 is the new intercept and a3 5 b3d3 is the slope parameter on the proxy variable x3 As we alluded to earlier when we run the regression in 912 we will not get unbiased estimators of b0 and b3 instead we will get unbiased or at least consistent estimators of a0 b1 b2 and a3 The important thing is that we get good estimates of the parameters b1 and b2 In most cases the estimate of a3 is actually more interesting than an estimate of b3 anyway For example in the wage equation a3 measures the return to wage given one more point on IQ score ExamplE 93 IQ as a proxy for ability The file WAGE2 from Blackburn and Neumark 1992 contains information on monthly earnings education several demographic variables and IQ scores for 935 men in 1980 As a method to account for omitted ability bias we add IQ to a standard log wage equation The results are shown in Table 92 Our primary interest is in what happens to the estimated return to education Column 1 contains the estimates without using IQ as a proxy variable The estimated return to education is 65 If we think omitted ability is positively correlated with educ then we assume that this estimate is too high More precisely the average estimate across all random samples would be too high When IQ is TAblE 92 Dependent Variable logwage Independent Variables 1 2 3 educ 065 006 054 007 018 041 exper 014 003 014 003 014 003 tenure 012 002 011 002 011 002 married 199 039 200 039 201 039 south 2091 026 2080 026 2080 026 urban 184 027 182 027 184 027 black 2188 038 2143 039 2147 040 IQ 0036 0010 20009 0052 educ IQ 00034 00038 intercept 5395 113 5176 128 5648 546 Observations Rsquared 935 253 935 263 935 263 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 282 added to the equation the return to education falls to 54 which corresponds with our prior beliefs about omitted ability bias The effect of IQ on socioeconomic outcomes has been documented in the controversial book The Bell Curve by Herrnstein and Murray 1994 Column 2 shows that IQ does have a statistically sig nificant positive effect on earnings after controlling for several other factors Everything else being equal an increase of 10 IQ points is predicted to raise monthly earnings by 36 The standard devia tion of IQ in the US population is 15 so a one standard deviation increase in IQ is associated with higher earnings of 54 This is identical to the predicted increase in wage due to another year of education It is clear from column 2 that education still has an important role in increasing earnings even though the effect is not as large as originally estimated Some other interesting observations emerge from columns 1 and 2 Adding IQ to the equation only increases the Rsquared from 253 to 263 Most of the variation in log1wage2 is not explained by the factors in column 2 Also adding IQ to the equation does not eliminate the estimated earnings dif ference between black and white men a black man with the same IQ education experience and so on as a white man is predicted to earn about 143 less and the difference is very statistically significant Column 3 in Table 92 includes the interaction term educ IQ This allows for the possibility that educ and abil interact in determining log1wage2 We might think that the return to education is higher for people with more ability but this turns out not to be the case the interaction term is not significant and its addition makes educ and IQ individually insignif icant while complicating the model Therefore the estimates in column 2 are preferred There is no reason to stop at a single proxy variable for ability in this example The data set WAGE2 also contains a score for each man on the Knowledge of the World of Work KWW test This provides a different measure of ability which can be used in place of IQ or along with IQ to estimate the return to education see Computer Exercise C2 It is easy to see how using a proxy variable can still lead to bias if the proxy variable does not sat isfy the preceding assumptions Suppose that instead of 911 the unobserved variable xp 3 is related to all of the observed variables by xp 3 5 d0 1 d1x1 1 d2x2 1 d3x3 1 v3 914 where v3 has a zero mean and is uncorrelated with x1 x2 and x3 Equation 911 assumes that d1 and d2 are both zero By plugging equation 914 into 910 we get y 5 1b0 1 b3d02 1 1b1 1 b3d12x1 1 1b2 1 b3d22x2 1 b3d3x3 1 u 1 b3v3 915 from which it follows that plim1b 12 5 b1 1 b3d1 and plim1b 22 5 b2 1 b3d2 This follows because the error in 915 u 1 b3v3 has zero mean and is uncorrelated with x1 x2 and x3 In the previous example where x1 5 educ and xp 3 5 abil b3 0 so there is a positive bias inconsistency if abil has a positive partial correlation with educ 1d1 02 Thus we could still be getting an upward bias in the return to education by using IQ as a proxy for abil if IQ is not a good proxy But we can reasonably hope that this bias is smaller than if we ignored the problem of omitted ability entirely A complaint that is sometimes aired about including variables such as IQ in a regression that includes educ is that it exacerbates the problem of multicollinearity likely leading to a less precise estimate of beduc But this complaint misses two important points First the inclusion of IQ reduces the error variance because the part of ability explained by IQ has been removed from the error Typically What do you make of the small and statistically insignificant coefficient on educ in column 3 of Table 92 Hint When educ IQ is in the equation what is the interpretation of the coefficient on educ Exploring FurthEr 92 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 283 this will be reflected in a smaller standard error of the regression although it need not get smaller because of its degreesoffreedom adjustment Second and most importantly the added multicol linearity is a necessary evil if we want to get an estimator of beduc with less bias the reason educ and IQ are correlated is that educ and abil are thought to be correlated and IQ is a proxy for abil If we could observe abil we would include it in the regression and of course there would be unavoidable multicollinearity caused by correlation between educ and abil Proxy variables can come in the form of binary information as well In Example 79 see equation 715 we discussed Kruegers 1993 estimates of the return to using a computer on the job Krueger also included a binary variable indicating whether the worker uses a computer at home as well as an interaction term between computer usage at work and at home His primary reason for including computer usage at home in the equation was to proxy for unobserved technical ability that could affect wage directly and be related to computer usage at work 92a Using Lagged Dependent Variables as Proxy Variables In some applications like the earlier wage example we have at least a vague idea about which unob served factor we would like to control for This facilitates choosing proxy variables In other applica tions we suspect that one or more of the independent variables is correlated with an omitted variable but we have no idea how to obtain a proxy for that omitted variable In such cases we can include as a control the value of the dependent variable from an earlier time period This is especially useful for policy analysis Using a lagged dependent variable in a crosssectional equation increases the data require ments but it also provides a simple way to account for historical factors that cause current differences in the dependent variable that are difficult to account for in other ways For example some cities have had high crime rates in the past Many of the same unobserved factors contribute to both high current and past crime rates Likewise some universities are traditionally better in academics than other uni versities Inertial effects are also captured by putting in lags of y Consider a simple equation to explain city crime rates crime 5 b0 1 b1unem 1 b2expend 1 b3crime21 1 u 916 where crime is a measure of per capita crime unem is the city unemployment rate expend is per cap ita spending on law enforcement and crime21 indicates the crime rate measured in some earlier year this could be the past year or several years ago We are interested in the effects of unem on crime as well as of law enforcement expenditures on crime What is the purpose of including crime21 in the equation Certainly we expect that b3 0 because crime has inertia But the main reason for putting this in the equation is that cities with high historical crime rates may spend more on crime prevention Thus factors unobserved to us the econometricians that affect crime are likely to be correlated with expend and unem If we use a pure crosssectional analysis we are unlikely to get an unbiased estimator of the causal effect of law enforcement expenditures on crime But by including crime21 in the equation we can at least do the following experiment if two cities have the same previous crime rate and current unemployment rate then b2 measures the effect of another dollar of law enforcement on crime ExamplE 94 City Crime Rates We estimate a constant elasticity version of the crime model in equation 916 unem because it is a percentage is left in level form The data in CRIME2 are from 46 cities for the year 1987 The crime rate is also available for 1982 and we use that as an additional independent variable in trying to control for city unobservables that affect crime and may be correlated with current law enforcement expenditures Table 93 contains the results Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 284 Without the lagged crime rate in the equation the effects of the unemployment rate and expendi tures on law enforcement are counterintuitive neither is statistically significant although the t statis tic on log1lawexpc872 is 117 One possibility is that increased law enforcement expenditures improve reporting conventions and so more crimes are reported But it is also likely that cities with high recent crime rates spend more on law enforcement Adding the log of the crime rate from five years earlier has a large effect on the expenditures coef ficient The elasticity of the crime rate with respect to expenditures becomes 214 with t 5 2128 This is not strongly significant but it suggests that a more sophisticated model with more cities in the sample could produce significant results Not surprisingly the current crime rate is strongly related to the past crime rate The estimate indicates that if the crime rate in 1982 was 1 higher then the crime rate in 1987 is predicted to be about 119 higher We cannot reject the hypothesis that the elasticity of current crime with respect to past crime is unity 3t 5 11194 2 12132 1474 Adding the past crime rate increases the explana tory power of the regression markedly but this is no surprise The primary reason for including the lagged crime rate is to obtain a better estimate of the ceteris paribus effect of log1lawexpc872 on log1crmrte872 The practice of putting in a lagged y as a general way of controlling for unobserved variables is hardly perfect But it can aid in getting a better estimate of the effects of policy variables on various outcomes When the data are available additional lags also can be included Adding lagged values of y is not the only way to use two years of data to control for omitted factors When we discuss panel data methods in Chapters 13 and 14 we will cover other ways to use repeated data on the same crosssectional units at different points in time 92b A Different Slant on Multiple Regression The discussion of proxy variables in this section suggests an alternative way of interpreting a multi ple regression analysis when we do not necessarily observe all relevant explanatory variables Until now we have specified the population model of interest with an additive error as in equation 99 Our discussion of that example hinged upon whether we have a suitable proxy variable IQ score in this case other test scores more generally for the unobserved explanatory variable which we called ability A less structured more general approach to multiple regression is to forego specifying models with unobservables Rather we begin with the premise that we have access to a set of observable explanatory variableswhich includes the variable of primary interest such as years of schooling TAblE 93 Dependent Variable log1crmrte872 Independent Variables 1 2 unem87 2029 032 009 020 log1lawexpc872 203 173 2140 109 log1crmrte822 1194 132 intercept 334 125 076 821 Observations Rsquared 46 057 46 680 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 285 and controls such as observable test scores We then model the mean of y conditional on the observed explanatory variables For example in the wage example with lwage denoting log1wage2 we can estimate E1lwage0educ exper tenure south urban black IQ2 exactly what is reported in Table 92 The difference now is that we set our goals more modestly Namely rather than introduce the nebulous concept of ability in equation 99 we state from the outset that we will estimate the ceteris paribus effect of education holding IQ and the other observed factors fixed There is no need to discuss whether IQ is a suitable proxy for ability Consequently while we may not be answering the question underlying equation 99 we are answering a question of interest if two people have the same IQ levels and same values of experience tenure and so on yet they differ in education levels by a year what is the expected difference in their log wages As another example if we include as an explanatory variable the poverty rate in a schoollevel regression to assess the effects of spending on standardized test scores we should recognize that the poverty rate only crudely captures the relevant differences in children and parents across schools But often it is all we have and it is better to control for the poverty rate than to do nothing because we cannot find suitable proxies for student ability parental involvement and so on Almost certainly controlling for the poverty rate gets us closer to the ceteris paribus effects of spending than if we leave the poverty rate out of the analysis In some applications of regression analysis we are interested simply in predicting the outcome y given a set of explanatory variables 1x1 p xk2 In such cases it makes little sense to think in terms of bias in estimated coefficients due to omitted variables Instead we should focus on obtaining a model that predicts as well as possible and make sure we do not include as regressors variables that cannot be observed at the time of prediction For example an admissions officer for a college or university might be interested in predicting success in college as measured by grade point average in terms of variables that can be measured at application time Those variables would include high school performance maybe just grade point average but perhaps performance in specific kinds of courses standardized test scores participation in various activities such as debate or math club and even family background variables We would not include a variable measuring college class attend ance because we do not observe attendance in college at application time Nor would we wring our hands about potential biases caused by omitting an attendance variable we have no interest in say measuring the effect of high school GPA holding attendance in college fixed Likewise we would not worry about biases in coefficients because we cannot observe factors such as motivation Naturally for predictive purposes it would probably help substantially if we had a measure of motivation but in its absence we fit the best model we can with observed explanatory variables 93 Models with Random Slopes In our treatment of regression so far we have assumed that the slope coefficients are the same across individuals in the population or that if the slopes differ they differ by measurable characteristics in which case we are led to regression models containing interaction terms For example as we saw in Section 74 we can allow the return to education to differ by men and women by interacting educa tion with a gender dummy in a log wage equation Here we are interested in a related but different question What if the partial effect of a variable depends on unobserved factors that vary by population unit If we have only one explanatory vari able x we can write a general model for a random draw i from the population for emphasis as yi 5 ai 1 bixi 917 where ai is the intercept for unit i and bi is the slope In the simple regression model from Chapter 2 we assumed bi 5 b and labeled ai as the error ui The model in 917 is sometimes called a random coefficient model or random slope model because the unobserved slope coefficient bi is viewed as a random draw from the population along with the observed data 1xi yi2 and the unobserved Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 286 intercept ai As an example if yi 5 log1wagei2 and xi 5 educi then 917 allows the return to educa tion bi to vary by person If say bi contains unmeasured ability just as ai would the partial effect of another year of schooling can depend on ability With a random sample of size n we implicitly draw n values of bi along with n values of ai and the observed data on x and y Naturally we cannot estimate a slopeor for that matter an interceptfor each i But we can hope to estimate the average slope and average intercept where the average is across the population Therefore define a 5 E1ai2 and b 5 E1bi2 Then b is the aver age of the partial effect of x on y and so we call b the average partial effect APE or the average marginal effect AME In the context of a log wage equation b is the average return to a year of schooling in the population If we write ai 5 a 1 ci and bi 5 b 1 di then di is the individualspecific deviation from the APE By construction E1ci2 5 0 and E1di2 5 0 Substituting into 917 gives yi 5 a 1 bxi 1 ci 1 dixi a 1 bxi 1 ui 918 where ui 5 ci 1 dixi To make the notation easier to follow we now use a the mean value of ai as the intercept and b the mean of bi as the slope In other words we can write the random coefficient as a constant coefficient model but where the error term contains an interaction between an unobserv able di and the observed explanatory variable xi When would a simple regression of yi on xi provide an unbiased estimate of b and a We can apply the result for unbiasedness from Chapter 2 If E1ui0xi2 5 0 then OLS is generally unbiased When ui 5 ci 1 dixi sufficient is E1ci0xi2 5 E1ci2 5 0 and E1di0xi2 5 E1di2 5 0 We can write these in terms of the unitspecific intercept and slope as E1ai0xi2 5 E1ai2 and E1bi0xi2 5 E1bi2 919 that is ai and bi are both mean independent of xi This is a useful finding if we allow for unitspecific slopes OLS consistently estimates the population average of those slopes when they are mean inde pendent of the explanatory variable See Problem 6 for a weaker set of conditions that imply consist ency of OLS The error term in 918 almost certainly contains heteroskedasticity In fact if Var1ci0xi2 5 s2 c Var1di0xi2 5 s2 d and Cov1cidi0xi2 5 0 then Var1ui0xi2 5 s2 c 1 s2 dx2 i 920 and so there must be heteroskedasticity in ui unless s2 d 5 0 which means bi 5 b for all i We know how to account for heteroskedasticity of this kind We can use OLS and compute heteroskedasticity robust standard errors and test statistics or we can estimate the variance function in 920 and apply weighted least squares Of course the latter strategy imposes homoskedasticity on the random inter cept and slope and so we would want to make a WLS analysis fully robust to violations of 920 Because of equation 920 some authors like to view heteroskedasticity in regression models generally as arising from random slope coefficients But we should remember that the form of 920 is special and it does not allow for heteroskedasticity in ai or bi We cannot convincingly distinguish between a random slope model where the intercept and slope are independent of xi and a constant slope model with heteroskedasticity in ai The treatment for multiple regression is similar Generally write yi 5 ai 1 bi1xi1 1 bi2xi2 1 p 1 bikxik 921 Then by writing ai 5 a 1 ci and bij 5 bj 1 dij we have yi 5 a 1 b1xi1 1 p 1 bkxik 1 ui 922 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 287 where ui 5 ci 1 di1xi1 1 p 1 dikxik If we maintain the mean independence assumptions E1ai0xi2 5 E1ai2 and E1bij0xi2 5 E1bij2 j 5 1 p k then E1yi0xi2 5 a 1 b1xi1 1 p 1 bkxik and so OLS using a random sample produces unbiased estimators of a and the bj As in the simple regression case Var1ui0xi2 is almost certainly heteroskedastic We can allow the bij to depend on observable explanatory variables as well as unob servables For example suppose with k 5 2 the effect of xi2 depends on xi1 and we write bi2 5 b2 1 d11xi1 2 m12 1 di2 where m1 5 E1xi12 If we assume E1di20xi2 5 0 and similarly for ci and di1 then E1yi0xi1 xi22 5 a 1 b1xi1 1 b2xi2 1 d11xi1 2 m12xi2 which means we have an interac tion between xi1 and xi2 Because we have subtracted the mean m1 from xi1 b2 is the APE of xi2 The bottom line of this section is that allowing for random slopes is fairly straightforward if the slopes are independent or at least mean independent of the explanatory variables In addition we can easily model the slopes as functions of the exogenous variables which leads to models with squares and interactions Of course in Chapter 6 we discussed how such models can be useful without ever introducing the notion of a random slope The random slopes specification provides a separate justi fication for such models Estimation becomes considerably more difficult if the random intercept as well as some slopes are correlated with some of the regressors We cover the problem of endogenous explanatory variables in Chapter 15 94 Properties of OLS under Measurement Error Sometimes in economic applications we cannot collect data on the variable that truly affects eco nomic behavior A good example is the marginal income tax rate facing a family that is trying to choose how much to contribute to charity in a given year The marginal rate may be hard to obtain or summarize as a single number for all income levels Instead we might compute the average tax rate based on total income and tax payments When we use an imprecise measure of an economic variable in a regression model then our model contains measurement error In this section we derive the consequences of measurement error for ordi nary least squares estimation OLS will be consistent under certain assumptions but there are others under which it is inconsistent In some of these cases we can derive the size of the asymptotic bias As we will see the measurement error problem has a similar statistical structure to the omit ted variableproxy variable problem discussed in the previous section but they are conceptually dif ferent In the proxy variable case we are looking for a variable that is somehow associated with the unobserved variable In the measurement error case the variable that we do not observe has a welldefined quantitative meaning such as a marginal tax rate or annual income but our recorded measures of it may contain error For example reported annual income is a measure of actual annual income whereas IQ score is a proxy for ability Another important difference between the proxy variable and measurement error problems is that in the latter case often the mismeasured independent variable is the one of primary interest In the proxy variable case the partial effect of the omitted variable is rarely of central interest we are usually concerned with the effects of the other independent variables Before we consider details we should remember that measurement error is an issue only when the variables for which the econometrician can collect data differ from the variables that influence decisions by individuals families firms and so on 94a Measurement Error in the Dependent Variable We begin with the case where only the dependent variable is measured with error Let yp denote the variable in the population as always that we would like to explain For example yp could be annual family savings The regression model has the usual form yp 5 b0 1 b1x1 1 p 1 bkxk 1 u 923 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 288 and we assume it satisfies the GaussMarkov assumptions We let y represent the observable measure of yp In the savings case y is reported annual savings Unfortunately families are not perfect in their reporting of annual family savings it is easy to leave out categories or to overestimate the amount contributed to a fund Generally we can expect y and yp to differ at least for some subset of families in the population The measurement error in the population is defined as the difference between the observed value and the actual value e0 5 y 2 yp 924 For a random draw i from the population we can write ei0 5 yi 2 yp i but the important thing is how the measurement error in the population is related to other factors To obtain an estimable model we write yp 5 y 2 e0 plug this into equation 923 and rearrange y 5 b0 1 b1x1 1 p 1 bkxk 1 u 1 e0 925 The error term in equation 925 is u 1 e0 Because y x1 x2 p xk are observed we can estimate this model by OLS In effect we just ignore the fact that y is an imperfect measure of yp and proceed as usual When does OLS with y in place of yp produce consistent estimators of the bj Since the original model 923 satisfies the GaussMarkov assumptions u has zero mean and is uncorrelated with each xj It is only natural to assume that the measurement error has zero mean if it does not then we simply get a biased estimator of the intercept b0 which is rarely a cause for concern Of much more importance is our assumption about the relationship between the measurement error e0 and the explanatory variables xj The usual assumption is that the measurement error in y is statisti cally independent of each explanatory variable If this is true then the OLS estimators from 925 are unbiased and consistent Further the usual OLS inference procedures t F and LM statistics are valid If e0 and u are uncorrelated as is usually assumed then Var1u 1 e02 5 s2 u 1 s2 0 s2 u This means that measurement error in the dependent variable results in a larger error variance than when no error occurs this of course results in larger variances of the OLS estimators This is to be expected and there is nothing we can do about it except collect better data The bottom line is that if the measurement error is uncorrelated with the independent variables then OLS estimation has good properties ExamplE 95 Savings Function with measurement Error Consider a savings function savp 5 b0 1 b1inc 1 b2size 1 b3educ 1 b4age 1 u but where actual savings 1savp2 may deviate from reported savings sav The question is whether the size of the measurement error in sav is systematically related to the other variables It might be reasonable to assume that the measurement error is not correlated with inc size educ and age On the other hand we might think that families with higher incomes or more education report their savings more accurately We can never know whether the measurement error is correlated with inc or educ unless we can collect data on savp then the measurement error can be computed for each observation as ei0 5 savi 2 savpi Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 289 When the dependent variable is in logarithmic form so that log1yp2 is the dependent variable it is natural for the measurement error equation to be of the form log1y2 5 log1yp2 1 e0 926 This follows from a multiplicative measurement error for y y 5 ypa0 where a0 0 and e0 5 log1a02 ExamplE 96 measurement Error in Scrap Rates In Section 76 we discussed an example where we wanted to determine whether job training grants reduce the scrap rate in manufacturing firms We certainly might think the scrap rate reported by firms is measured with error In fact most firms in the sample do not even report a scrap rate In a simple regression framework this is captured by log1scrapp2 5 b0 1 b1grant 1 u where scrapp is the true scrap rate and grant is the dummy variable indicating whether a firm received a grant The measurement error equation is log1scrap2 5 log1scrapp2 1 e0 Is the measurement error e0 independent of whether the firm receives a grant A cynical person might think that a firm receiving a grant is more likely to underreport its scrap rate in order to make the grant look effective If this happens then in the estimable equation log1scrap2 5 b0 1 b1grant 1 u 1 e0 the error u 1 e0 is negatively correlated with grant This would produce a downward bias in b1 which would tend to make the training program look more effective than it actually was Remember a more negative b1 means the program was more effective since increased worker productivity is associated with a lower scrap rate The bottom line of this subsection is that measurement error in the dependent variable can cause biases in OLS if it is systematically related to one or more of the explanatory variables If the meas urement error is just a random reporting error that is independent of the explanatory variables as is often assumed then OLS is perfectly appropriate 94b Measurement Error in an Explanatory Variable Traditionally measurement error in an explanatory variable has been considered a much more impor tant problem than measurement error in the dependent variable In this subsection we will see why this is the case We begin with the simple regression model y 5 b0 1 b1xp 1 1 u 927 and we assume that this satisfies at least the first four GaussMarkov assumptions This means that estimation of 927 by OLS would produce unbiased and consistent estimators of b0 and b1 The problem is that xp 1 is not observed Instead we have a measure of xp 1 call it x1 For example xp 1 could be actual income and x1 could be reported income Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 290 The measurement error in the population is simply e1 5 x1 2 xp 1 928 and this can be positive negative or zero We assume that the average measurement error in the population is zero E1e12 5 0 This is natural and in any case it does not affect the important con clusions that follow A maintained assumption in what follows is that u is uncorrelated with x 1 and x1 In conditional expectation terms we can write this as E1y0xp 1 x12 5 E1y0xp 12 which just says that x1 does not affect y after xp 1 has been controlled for We used the same assumption in the proxy variable case and it is not controversial it holds almost by definition We want to know the properties of OLS if we simply replace xp 1 with x1 and run the regression of y on x1 They depend crucially on the assumptions we make about the measurement error Two assumptions have been the focus in econometrics literature and they both represent polar extremes The first assumption is that e1 is uncorrelated with the observed measure x1 Cov1x1e12 5 0 929 From the relationship in 928 if assumption 929 is true then e1 must be correlated with the unob served variable xp 1 To determine the properties of OLS in this case we write xp 1 5 x1 2 e1 and plug this into equation 927 y 5 b0 1 b1x1 1 1u 2 b1e12 930 Because we have assumed that u and e1 both have zero mean and are uncorrelated with x1 u 2 b1e1 has zero mean and is uncorrelated with x1 It follows that OLS estimation with x1 in place of xp 1 produces a consistent estimator of b1 and also b0 Since u is uncorrelated with e1 the variance of the error in 930 is Var1u 2 b1e12 5 s2 u 1 b2 1s2 e1 Thus except when b1 5 0 measurement error increases the error variance But this does not affect any of the OLS properties except that the vari ances of the b j will be larger than if we observe xp 1 directly The assumption that e1 is uncorrelated with x1 is analogous to the proxy variable assumption we made in Section 92 Since this assumption implies that OLS has all of its nice properties this is not usually what econometricians have in mind when they refer to measurement error in an explanatory variable The classical errorsinvariables CEV assumption is that the measurement error is uncor related with the unobserved explanatory variable Cov1xp 1e12 5 0 931 This assumption comes from writing the observed measure as the sum of the true explanatory variable and the measurement error x1 5 xp 1 1 e1 and then assuming the two components of x1 are uncorrelated This has nothing to do with assump tions about u we always maintain that u is uncorrelated with xp 1 and x1 and therefore with e1 If assumption 931 holds then x1 and e1 must be correlated Cov1x1e12 5 E1x1e12 5 E1xp 1e12 1 E1e2 12 5 0 1 s2 e1 5 s2 e1 932 Thus the covariance between x1 and e1 is equal to the variance of the measurement error under the CEV assumption Referring to equation 930 we can see that correlation between x1 and e1 is going to cause problems Because u and x1 are uncorrelated the covariance between x1 and the composite error u 2 b1e1 is Cov1x1u 2 b1e12 5 2b1Cov1x1e12 5 2b1s2 e1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 291 Thus in the CEV case the OLS regression of y on x1 gives a biased and inconsistent estimator Using the asymptotic results in Chapter 5 we can determine the amount of inconsistency in OLS The probability limit of b 1 is b1 plus the ratio of the covariance between x1 and u 2 b1e1 and the variance of x1 plim1b 12 5 b1 1 Cov1x1u 2 b1e12 Var1x12 5 b1 2 b1s2 e1 s2 x1 1 s2 e1 933 5 b1a1 2 s2 e1 s2 x1 1 s2 e1 b 5 b1a s2 x1 s2 x1 1 s2 e1 b where we have used the fact that Var1x12 5 Var1xp 12 1 Var1e12 Equation 933 is very interesting The term multiplying b1 which is the ratio Var1x 12Var1x12 is always less than one an implication of the CEV assumption 931 Thus plim1b 12 is always closer to zero than is b1 This is called the attenuation bias in OLS due to CEV on average or in large samples the estimated OLS effect will be attenuated In particular if b1 is positive b 1 will tend to underestimate b1 This is an important conclusion but it relies on the CEV setup If the variance of xp 1 is large relative to the variance in the measurement error then the inconsist ency in OLS will be small This is because Var1x 12Var1x12 will be close to unity when s2 xp 1 s2 e1 is large Therefore depending on how much variation there is in xp 1 relative to e1 measurement error need not cause large biases Things are more complicated when we add more explanatory variables For illustration consider the model y 5 b0 1 b1xp 1 1 b2x2 1 b2x3 1 u 934 where the first of the three explanatory variables is measured with error We make the natural assump tion that u is uncorrelated with xp 1 x2 x3 and x1 Again the crucial assumption concerns the measure ment error e1 In almost all cases e1 is assumed to be uncorrelated with x2 and x3the explanatory variables not measured with error The key issue is whether e1 is uncorrelated with x1 If it is then the OLS regression of y on x1 x2 and x3 produces consistent estimators This is easily seen by writing y 5 b0 1 b1x1 1 b2x2 1 b2x3 1 u 2 b1e1 935 where u and e1 are both uncorrelated with all the explanatory variables Under the CEV assumption in 931 OLS will be biased and inconsistent because e1 is cor related with x1 in equation 935 Remember this means that in general all OLS estimators will be biased not just b 1 What about the attenuation bias derived in equation 933 It turns out that there is still an attenuation bias for estimating b1 it can be shown that plim1b 12 5 b1a s2 r1 s2 r1 1 s2 e1 b 936 where rp 1 is the population error in the equation xp 1 5 a0 1 a1x2 1 a2x3 1 rp 1 Formula 936 also works in the general k variable case when x1 is the only mismeasured variable Things are less clearcut for estimating the bj on the variables not measured with error In the special case that xp 1 is uncorrelated with x2 and x3 b 2 and b 3 are consistent But this is rare in prac tice Generally measurement error in a single variable causes inconsistency in all estimators Unfortunately the sizes and even the directions of the biases are not easily derived Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 292 ExamplE 97 Gpa Equation with measurement Error Consider the problem of estimating the effect of family income on college grade point average after controlling for hsGPA high school grade point average and SAT scholastic aptitude test It could be that though family income is important for performance before college it has no direct effect on college performance To test this we might postulate the model colGPA 5 b0 1 b1 famincp 1 b2hsGPA 1 b3SAT 1 u where famincp is actual annual family income This might appear in logarithmic form but for the sake of illustration we leave it in level form Precise data on colGPA hsGPA and SAT are relatively easy to obtain But family income especially as reported by students could be easily mismeasured If faminc 5 famincp 1 e1 and the CEV assumptions hold then using reported family income in place of actual family income will bias the OLS estimator of b1 toward zero One consequence of the down ward bias is that a test of H0 b1 5 0 will have less chance of detecting b1 0 Of course measurement error can be present in more than one explanatory variable or in some explanatory variables and the dependent variable As we discussed earlier any measurement error in the dependent variable is usually assumed to be uncorrelated with all the explanatory variables whether it is observed or not Deriving the bias in the OLS estimators under extensions of the CEV assumptions is complicated and does not lead to clear results In some cases it is clear that the CEV assumption in 931 cannot be true Consider a variant on Example 97 colGPA 5 b0 1 b1smokedp 1 b2hsGPA 1 b3SAT 1 u where smokedp is the actual number of times a student smoked marijuana in the last 30 days The variable smoked is the answer to this question On how many separate occasions did you smoke mari juana in the last 30 days Suppose we postulate the standard measurement error model smoked 5 smokedp 1 e1 Even if we assume that students try to report the truth the CEV assumption is unlikely to hold People who do not smoke marijuana at allso that smokedp 5 0are likely to report smoked 5 0 so the measurement error is probably zero for students who never smoke marijuana When smokedp 0 it is much more likely that the student miscounts how many times he or she smoked marijuana in the last 30 days This means that the measurement error e1 and the actual number of times smoked smokedp are correlated which violates the CEV assumption in 931 Unfortunately deriving the implications of measurement error that do not satisfy 929 or 931 is difficult and beyond the scope of this text Before leaving this section we emphasize that the CEV assumption 931 while more believable than assumption 929 is still a strong assumption The truth is probably somewhere in between and if e1 is correlated with both xp 1 and x1 OLS is inconsist ent This raises an important question must we live with inconsistent estimators under CEV or other kinds of measurement error that are correlated with x1 Fortunately the answer is no Chapter 15 shows how under certain assumptions the parameters can be consistently estimated in the presence of general measurement error We postpone this discus sion until later because it requires us to leave the realm of OLS estimation See Problem 7 for how multiple measures can be used to reduce the attenuation bias Let educp be actual amount of schooling measured in years which can be a noninte ger and let educ be reported highest grade completed Do you think educ and educp are related by the CEV model Exploring FurthEr 93 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 293 95 Missing Data Nonrandom Samples and Outlying Observations The measurement error problem discussed in the previous section can be viewed as a data problem we cannot obtain data on the variables of interest Further under the CEV model the composite error term is correlated with the mismeasured independent variable violating the GaussMarkov assumptions Another data problem we discussed frequently in earlier chapters is multicollinearity among the explanatory variables Remember that correlation among the explanatory variables does not violate any assumptions When two independent variables are highly correlated it can be difficult to estimate the partial effect of each But this is properly reflected in the usual OLS statistics In this section we provide an introduction to data problems that can violate the random sam pling assumption MLR2 We can isolate cases in which nonrandom sampling has no practical effect on OLS In other cases nonrandom sampling causes the OLS estimators to be biased and incon sistent A more complete treatment that establishes several of the claims made here is given in Chapter 17 95a Missing Data The missing data problem can arise in a variety of forms Often we collect a random sample of people schools cities and so on and then discover later that information is missing on some key variables for several units in the sample For example in the data set BWGHT 196 of the 1388 observations have no information on fathers education In the data set on median starting law school salaries LAWSCH85 six of the 156 schools have no reported information on median LSAT scores for the entering class other variables are also missing for some of the law schools If data are missing for an observation on either the dependent variable or one of the independ ent variables then the observation cannot be used in a standard multiple regression analysis In fact provided missing data have been properly indicated all modern regression packages keep track of missing data and simply ignore observations when computing a regression We saw this explicitly in the birth weight scenario in Example 49 when 197 observations were dropped due to missing infor mation on parents education In the literature on missing data an estimator that uses only observations with a complete set of data on y and x1 p xk is called a complete cases estimator as mentioned earlier this estimator is computed as the default for OLS and all estimators covered later in the text Other than reducing the sample size are there any statistical consequences of using the OLS estimator and ignoring the miss ing data If in the language of the missing data literature see for example Little and Rubin 2002 Chapter 1 the data are missing completely at random sometimes called MCAR then missing data cause no statistical problems The MCAR assumption implies that the reason the data are miss ing is independent in a statistical sense of both the observed and unobserved factors affecting y In effect we can still assume that the data have been obtained by random sampling from the population so that Assumption MLR2 continues to hold When MCAR holds there are ways to use partial information obtained from units that are dropped from the complete case estimation For example that for a multiple regression model data are always available for y and x1 x2 p xk21 but are sometimes missing for the explanatory variable xk A common solution is to create two new variables For a unit i the first variable say zik is defined to be xik when xik is observed and zero otherwise The second variable is a missing data indicator say mik which equals one when xik is missing and equals zero when xik is observed Having defined these two variables all of the units are used in the regression yi on xi1 xi2 p xik21 zik mik i 5 1 p n This procedure can be shown to produce unbiased and consistent estimators of all parameters provided the missing data mechanism for xk is MCAR Incidentally it is a very poor idea to omit mik from the regression as that is the same thing as assuming xik is zero whenever it is missing Replacing Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 294 missing values with zero and not including the missing data indicator can cause substantial bias in the OLS estimators A similar trick can be used when data are missing on more than one explanatory vari able but not on y Problem 910 provides the argument in the simple regression model An important point is that the estimator that uses all of the data and adds missing data indica tors is actually less robust than the complete cases estimator As will be seen in the next subsection the complete cases estimator turns out to be consistent even when the reason the data are missing is systematically related to 1x1 p xk2 is a function of 1x1 x2 p xk2 provided it does not depend on the unobserved error u There are more complicated schemes for using partial information that are based on filling in the missing data but these are beyond the scope of this text The reader is referred to Little and Rubin 2002 95b Nonrandom Samples The MCAR assumption ensures that units for which we observe a full set of data are not system atically different from units for which some variables are missing Unfortunately MCAR is often unrealistic An example of a missing data mechanism that does not satisfy MCAR can be gotten by looking at the data set CARD where the measure of IQ is missing for 949 men If the probability that the IQ score is missing is say higher for men with lower IQ scores the mechanism violates MCAR For example in the birth weight data set what if the probability that education is missing is higher for those people with lower than average levels of education Or in Section 92 we used a wage data set that included IQ scores This data set was constructed by omitting several people from the sample for whom IQ scores were not available If obtaining an IQ score is easier for those with higher IQs the sample is not representative of the population The random sampling assumption MLR2 is violated and we must worry about these consequences for OLS estimation Fortunately certain types of nonrandom sampling do not cause bias or inconsistency in OLS Under the GaussMarkov assumptions but without MLR2 it turns out that the sample can be chosen on the basis of the independent variables without causing any statistical problems This is called sam ple selection based on the independent variables and it is an example of exogenous sample selection In the statistics literature exogenous sample selection due to missing data is often called missing at random but this is not a particularly good label because the probability of missing data is allowed to depend on the explanatory variables See Little and Rubin 2002 Chapter 1 To illustrate exogenously missing data suppose that we are estimating a saving function where annual saving depends on income age family size and some unobserved factors u A simple model is saving 5 b0 1 b1income 1 b2age 1 b3size 1 u 937 Suppose that our data set was based on a survey of people over 35 years of age thereby leaving us with a nonrandom sample of all adults While this is not ideal we can still get unbiased and consist ent estimators of the parameters in the population model 937 using the nonrandom sample We will not show this formally here but the reason OLS on the nonrandom sample is unbiased is that the regression function E1saving0incomeagesize2 is the same for any subset of the population described by income age or size Provided there is enough variation in the independent variables in the sub population selection on the basis of the independent variables is not a serious problem other than that it results in smaller sample sizes In the IQ example just mentioned things are not so clearcut because no fixed rule based on IQ is used to include someone in the sample Rather the probability of being in the sample increases with IQ If the other factors determining selection into the sample are independent of the error term in the wage equation then we have another case of exogenous sample selection and OLS using the selected sample will have all of its desirable properties under the other GaussMarkov assumptions The situation is much different when selection is based on the dependent variable y which is called sample selection based on the dependent variable and is an example of endogenous sample selection If the sample is based on whether the dependent variable is above or below a given value Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 295 bias always occurs in OLS in estimating the population model For example suppose we wish to esti mate the relationship between individual wealth and several other factors in the population of all adults wealth 5 b0 1 b1educ 1 b2exper 1 b3age 1 u 938 Suppose that only people with wealth below 250000 are included in the sample This is a nonrandom sample from the population of interest and it is based on the value of the dependent variable Using a sample on people with wealth below 250000 will result in biased and inconsistent estimators of the parameters in 932 Briefly this occurs because the population regression E1wealth0educexperage2 is not the same as the expected value conditional on wealth being less than 250000 Other sampling schemes lead to nonrandom samples from the population usually intentionally A common method of data collection is stratified sampling in which the population is divided into nonoverlapping exhaustive groups or strata Then some groups are sampled more frequently than is dictated by their population representation and some groups are sampled less frequently For exam ple some surveys purposely oversample minority groups or lowincome groups Whether special methods are needed again hinges on whether the stratification is exogenous based on exogenous explanatory variables or endogenous based on the dependent variable Suppose that a survey of military personnel oversampled women because the initial interest was in studying the factors that determine pay for women in the military Oversampling a group that is relatively small in the popu lation is common in collecting stratified samples Provided men were sampled as well we can use OLS on the stratified sample to estimate any gender differential along with the returns to education and experience for all military personnel We might be willing to assume that the returns to education and experience are not gender specific OLS is unbiased and consistent because the stratification is with respect to an explanatory variable namely gender If instead the survey oversampled lowerpaid military personnel then OLS using the strati fied sample does not consistently estimate the parameters of the military wage equation because the stratification is endogenous In such cases special econometric methods are needed see Wooldridge 2010 Chapter 19 Stratified sampling is a fairly obvious form of nonrandom sampling Other sample selection issues are more subtle For instance in several previous examples we have estimated the effects of various variables particularly education and experience on hourly wage The data set WAGE1 that we have used throughout is essentially a random sample of working individuals Labor economists are often interested in estimating the effect of say education on the wage offer The idea is this every person of working age faces an hourly wage offer and he or she can either work at that wage or not work For someone who does work the wage offer is just the wage earned For people who do not work we usually cannot observe the wage offer Now since the wage offer equation log1wageo2 5 b0 1 b1educ 1 b2exper 1 u 939 represents the population of all workingage people we cannot estimate it using a random sample from this population instead we have data on the wage offer only for working people although we can get data on educ and exper for nonworking people If we use a random sample on working people to estimate 939 will we get unbiased estimators This case is not clearcut Since the sample is selected based on someones decision to work as opposed to the size of the wage offer this is not like the previ ous case However since the decision to work might be related to unobserved factors that affect the wage offer selection might be endogenous and this can result in a sample selection bias in the OLS estima tors We will cover methods that can be used to test and correct for sample selection bias in Chapter 17 Suppose we are interested in the effects of campaign expenditures by incumbents on voter support Some incumbents choose not to run for reelection If we can only collect voting and spending outcomes on incumbents that actually do run is there likely to be endogenous sample selection Exploring FurthEr 94 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 296 95c Outliers and Influential Observations In some applications especially but not only with small data sets the OLS estimates are sensitive to the inclusion of one or several observations A complete treatment of outliers and influential observations is beyond the scope of this book because a formal development requires matrix algebra Loosely speaking an observation is an influential observation if dropping it from the analysis changes the key OLS estimates by a practically large amount The notion of an outlier is also a bit vague because it requires comparing values of the variables for one observation with those for the remaining sample Nevertheless one wants to be on the lookout for unusual observations because they can greatly affect the OLS estimates OLS is susceptible to outlying observations because it minimizes the sum of squared residuals large residuals positive or negative receive a lot of weight in the least squares minimization prob lem If the estimates change by a practically large amount when we slightly modify our sample we should be concerned When statisticians and econometricians study the problem of outliers theoretically sometimes the data are viewed as being from a random sample from a given populationalbeit with an unusual dis tribution that can result in extreme valuesand sometimes the outliers are assumed to come from a different population From a practical perspective outlying observations can occur for two reasons The easiest case to deal with is when a mistake has been made in entering the data Adding extra zeros to a number or misplacing a decimal point can throw off the OLS estimates especially in small sample sizes It is always a good idea to compute summary statistics especially minimums and maximums in order to catch mistakes in data entry Unfortunately incorrect entries are not always obvious Outliers can also arise when sampling from a small population if one or several members of the population are very different in some relevant aspect from the rest of the population The decision to keep or drop such observations in a regression analysis can be a difficult one and the statistical properties of the resulting estimators are complicated Outlying observations can provide important information by increasing the variation in the explanatory variables which reduces standard errors But OLS results should probably be reported with and without outlying observations in cases where one or several data points substantially change the results ExamplE 98 RD Intensity and Firm Size Suppose that RD expenditures as a percentage of sales rdintens are related to sales in millions and profits as a percentage of sales profmarg rdintens 5 b0 1 b1sales 1 b2 profmarg 1 u 940 The OLS equation using data on 32 chemical companies in RDCHEM is rdintens 5 2625 1 000053 sales 1 0446 profmarg 105862 10000442 104622 n 5 32 R2 5 0761 R2 5 0124 Neither sales nor profmarg is statistically significant at even the 10 level in this regression Of the 32 firms 31 have annual sales less than 20 billion One firm has annual sales of almost 40 billion Figure 91 shows how far this firm is from the rest of the sample In terms of sales this firm is over twice as large as every other firm so it might be a good idea to estimate the model with out it When we do this we obtain rdintens 5 2297 1 000186 sales 1 0478 profmarg 105922 10000842 104452 n 5 31 R2 5 1728 R2 5 1137 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 297 0 10 10000 RD as a percentage of sales 20000 30000 40000 rm sales in millions of dollars f possible outlier 5 i FiguRE 91 Scatterplot of RD intensity against firm sales When the largest firm is dropped from the regression the coefficient on sales more than triples and it now has a t statistic over two Using the sample of smaller firms we would conclude that there is a statistically significant positive effect between RD intensity and firm size The profit margin is still not significant and its coefficient has not changed by much Sometimes outliers are defined by the size of the residual in an OLS regression where all of the observations are used Generally this is not a good idea because the OLS estimates adjust to make the sum of squared residuals as small as possible In the previous example including the largest firm flat tened the OLS regression line considerably which made the residual for that estimation not especially large In fact the residual for the largest firm is 162 when all 32 observations are used This value of the residual is not even one estimated standard deviation s 5 182 from the mean of the residuals which is zero by construction Studentized residuals are obtained from the original OLS residuals by dividing them by an esti mate of their standard deviation conditional on the explanatory variables in the sample The formula for the studentized residuals relies on matrix algebra but it turns out there is a simple trick to compute a studentized residual for any observation Namely define a dummy variable equal to one for that observationsay observation hand then include the dummy variable in the regression using all observations along with the other explanatory variables The coefficient on the dummy variable has a useful interpretation it is the residual for observation h computed from the regression line using only the other observations Therefore the dummys coefficient can be used to see how far off the observa tion is from the regression line obtained without using that observation Even better the t statistic on the dummy variable is equal to the studentized residual for observation h Under the classical linear model assumptions this t statistic has a tn2k22 distribution Therefore a large value of the t statistic in absolute value implies a large residual relative to its estimated standard deviation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 298 For Example 98 if we define a dummy variable for the largest firm observation 10 in the data file and include it as an additional regressor its coefficient is 657 verifying that the observa tion for the largest firm is very far from the regression line obtained using the other observations However when studentized the residual is only 182 While this is a marginally significant t statistic twosided pvalue 5 08 it is not close to being the largest studentized residual in the sample If we use the same method for the observation with the highest value of rdintensthe first observa tion with rdintens 942the coefficient on the dummy variable is 672 with a t statistic of 456 Therefore by this measure the first observation is more of an outlier than the tenth Yet dropping the first observation changes the coefficient on sales by only a small amount to about 000051 from 000053 although the coefficient on profmarg becomes larger and statistically significant So is the first observation an outlier too These calculations show the conundrum one can enter when trying to determine observations that should be excluded from a regression analysis even when the data set is small Unfortunately the size of the studentized residual need not correspond to how influential an observation is for the OLS slope estimates and certainly not for all of them at once A general problem with using studentized residuals is that in effect all other observations are used to estimate the regression line to compute the residual for a particular observation In other words when the studentized residual is obtained for the first observation the tenth observation has been used in estimating the intercept and slope Given how flat the regression line is with the largest firm tenth observation included it is not too surprising that the first observation with its high value of rdintens is far off the regression line Of course we can add two dummy variables at the same timeone for the first observation and one for the tenthwhich has the effect of using only the remaining 30 observations to estimate the regression line If we estimate the equation without the first and tenth observations the results are rdintens 5 1939 1 000160 sales 1 0701 profmarg 104592 1000652 103432 n 5 30 R2 5 2711 R2 5 2171 The coefficient on the dummy for the first observation is 647 1t 5 4582 and for the tenth observa tion it is 541 1t 5 21952 Notice that the coefficients on the sales and profmarg are both statisti cally significant the latter at just about the 5 level against a twosided alternative 1pvalue 5 0512 Even in this regression there are still two observations with studentized residuals greater than two corresponding to the two remaining observations with RD intensities above six Certain functional forms are less sensitive to outlying observations In Section 62 we mentioned that for most economic variables the logarithmic transformation significantly narrows the range of the data and also yields functional formssuch as constant elasticity modelsthat can explain a broader range of data ExamplE 99 RD Intensity We can test whether RD intensity increases with firm size by starting with the model rd 5 salesb1exp1b0 1 b2profmarg 1 u2 941 Then holding other factors fixed RD intensity increases with sales if and only if b1 1 Taking the log of 941 gives log1rd2 5 b0 1 b1log1sales2 1 b2profmarg 1 u 942 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 299 When we use all 32 firms the regression equation is log1rd2 5 24378 1 1084 log1sales2 1 0217 profmarg 14682 10602 101282 n 5 32 R2 5 9180 R2 5 9123 while dropping the largest firm gives log1rd2 5 24404 1 1088 log1sales2 1 0218 profmarg 15112 10672 101302 n 5 31 R2 5 9037 R2 5 8968 Practically these results are the same In neither case do we reject the null H0 b1 5 1 against H1 b1 1 Why In some cases certain observations are suspected at the outset of being fundamentally different from the rest of the sample This often happens when we use data at very aggregated levels such as the city county or state level The following is an example ExamplE 910 State Infant mortality Rates Data on infant mortality per capita income and measures of health care can be obtained at the state level from the Statistical Abstract of the United States We will provide a fairly simple analysis here just to illustrate the effect of outliers The data are for the year 1990 and we have all 50 states in the United States plus the District of Columbia DC The variable infmort is number of deaths within the first year per 1000 live births pcinc is per capita income physic is physicians per 100000 mem bers of the civilian population and popul is the population in thousands The data are contained in INFMRT We include all independent variables in logarithmic form infmort 5 3386 2 468 log1pcinc2 1 415 log1physic2 120432 12602 11512 2 088 log1popul2 943 12872 n 5 51 R2 5 139 R2 5 084 Higher per capita income is estimated to lower infant mortality an expected result But more physi cians per capita is associated with higher infant mortality rates something that is counterintuitive Infant mortality rates do not appear to be related to population size The District of Columbia is unusual in that it has pockets of extreme poverty and great wealth in a small area In fact the infant mortality rate for DC in 1990 was 207 compared with 124 for the highest state It also has 615 physicians per 100000 of the civilian population compared with 337 for the highest state The high number of physicians coupled with the high infant mortality rate in DC could certainly influence the results If we drop DC from the regression we obtain infmort 5 2395 2 57 log1pcinc2 2 274 log1physic2 944 112422 11642 11192 1 629 log1popul2 11912 n 5 50 R2 5 273 R2 5 226 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 300 We now find that more physicians per capita lowers infant mortality and the estimate is statisti cally different from zero at the 5 level The effect of per capita income has fallen sharply and is no longer statistically significant In equation 944 infant mortality rates are higher in more populous states and the relationship is very statistically significant Also much more variation in infmort is explained when DC is dropped from the regression Clearly DC had substantial influence on the initial estimates and we would probably leave it out of any further analysis As Example 98 demonstrates inspecting observations in trying to determine which are outliers and even which ones have substantial influence on the OLS estimates is a difficult endeavor More advanced treatments allow more formal approaches to determine which observations are likely to be influential observations Using matrix algebra Belsley Kuh and Welsh 1980 define the leverage of an observa tion which formalizes the notion that an observation has a large or small influence on the OLS estimates These authors also provide a more indepth discussion of standardized and studentized residuals 96 Least Absolute Deviations Estimation Rather than trying to determine which observations if any have undue influence on the OLS esti mates a different approach to guarding against outliers is to use an estimation method that is less sensitive to outliers than OLS One such method which has become popular among applied econo metricians is called least absolute deviations LAD The LAD estimators of the bj in a linear model minimize the sum of the absolute values of the residuals min b0 b1 p bk a n i51 0yi 2 b0 2 b1xi1 2 p 2 bkxik0 945 Unlike OLS which minimizes the sum of squared residuals the LAD estimates are not available in closed formthat is we cannot write down formulas for them In fact historically solving the prob lem in equation 945 was computationally difficult especially with large sample sizes and many explanatory variables But with the vast improvements in computational speed over the past two dec ades LAD estimates are fairly easy to obtain even for large data sets Figure 92 shows the OLS and LAD objective functions The LAD objective function is linear on either side of zero so that if say a positive residual increases by one unit the LAD objective function increases by one unit By contrast the OLS objective function gives increasing importance to large residuals and this makes OLS more sensitive to outlying observations Because LAD does not give increasing weight to larger residuals it is much less sensitive to changes in the extreme values of the data than OLS In fact it is known that LAD is designed to estimate the parameters of the conditional median of y given x1 x2 p xk rather than the conditional mean Because the median is not affected by large changes in the extreme observations it follows that the LAD parameter estimates are more resilient to outlying observations See Section A1 for a brief discussion of the sample median In choosing the estimates OLS squares each residual and so the OLS estimates can be very sensitive to outlying observations as we saw in Examples 98 and 910 In addition to LAD being more computationally intensive than OLS a second drawback of LAD is that all statistical inference involving the LAD estimators is justified only as the sample size grows The formulas are somewhat complicated and require matrix algebra and we do not need them here Koenker 2005 provides a comprehensive treatment Recall that under the classical linear model assumptions the OLS t statistics have exact t distributions and F statistics have exact F distribu tions While asymptotic versions of these statistics are available for LADand reported routinely by software packages that compute LAD estimatesthese are justified only in large samples Like the additional computational burden involved in computing LAD estimates the lack of exact inference for LAD is only of minor concern because most applications of LAD involve several hundred if not Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 301 several thousand observations Of course we might be pushing it if we apply largesample approxi mations in an example such as Example 98 with n 5 32 In a sense this is not very different from OLS because more often than not we must appeal to large sample approximations to justify OLS inference whenever any of the CLM assumptions fail A more subtle but important drawback to LAD is that it does not always consistently esti mate the parameters appearing in the conditional mean function E1y0x1 p xk2 As mentioned ear lier LAD is intended to estimate the effects on the conditional median Generally the mean and median are the same only when the distribution of y given the covariates x1 p xk is symmetric about b0 1 b1x1 1 p 1 bkxk Equivalently the population error term u is symmetric about zero Recall that OLS produces unbiased and consistent estimators of the parameters in the conditional mean whether or not the error distribution is symmetric symmetry does not appear among the Gauss Markov assumptions When LAD and OLS are applied to cases with asymmetric distributions the estimated partial effect of say x1 obtained from LAD can be very different from the partial effect obtained from OLS But such a difference could just reflect the difference between the median and the mean and might not have anything to do with outliers See Computer Exercise C9 for an example If we assume that the population error u in model 92 is independent of 1x1 p xk2 then the OLS and LAD slope estimates should differ only by sampling error whether or not the distribution of u is symmetric The intercept estimates generally will be different to reflect the fact that if the mean of u is zero then its median is different from zero under asymmetry Unfortunately independence between the error and the explanatory variables is often unrealistically strong when LAD is applied In particular independence rules out heteroskedasticity a problem that often arises in applications with asymmetric distributions An advantage that LAD has over OLS is that because LAD estimates the median it is easy to obtain partial effectsand predictionsusing monotonic transformations Here we consider the most common transformation taking the natural log Suppose that log1y2 follows a linear model where the error has a zero conditional median log1y2 5 b0 1 xb 1 u 946 Med1u0x2 5 0 947 LAD OLS 0 5 10 15 4 2 0 2 4 u FiguRE 92 The OLS and LAD objective functions Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 302 PART 1 Regression Analysis with CrossSectional Data which implies that Med3log1y2 0x4 5 b0 1 xb A wellknown feature of the conditional mediansee for example Wooldridge 2010 Chapter 12 is that it passes through increasing functions Therefore Med1y0x2 5 exp1b0 1 xb2 948 It follows that bj is the semielasticity of Med1y0x2 with respect to xj In other words the partial effect of xj in the linear equation 946 can be used to uncover the partial effect in the nonlinear model 948 It is important to understand that this holds for any distribution of u such that 947 holds and we need not assume u and x are independent By contrast if we specify a linear model for E3log1y2 0x4 then in general there is no way to uncover E1y0x2 If we make a full distributional assumption for u given x then in principle we can recover E1y0x2 We covered the special case in equation 640 under the assumption that log1y2 follows a classical linear model However in general there is no way to find E1y0x2 from a model for E3log1y2 0x4 even though we can always obtain Med1y0x2 from Med3log1y2 0x4 Problem 9 investigates how heteroskedasticity in a linear model for log1y2 confounds our ability to find E1y0x2 LAD is a special case of what is often called robust regression Unfortunately the way robust is used here can be confusing In the statistics literature a robust regression estimator is relatively insen sitive to extreme observations Effectively observations with large residuals are given less weight than in least squares Berk 1990 contains an introductory treatment of estimators that are robust to outlying observations Based on our earlier discussion in econometric parlance LAD is not a robust estimator of the conditional mean because it requires extra assumptions in order to consistently esti mate the conditional mean parameters In Equation 92 either the distribution of u given 1x1 p xk2 has to be symmetric about zero or u must be independent of 1x1 p xk2 Neither of these is required for OLS LAD is also a special case of quantile regression which is used to estimate the effect of the xj on different parts of the distributionnot just the median or mean For example in a study to see how having access to a particular pension plan affects wealth it could be that access affects highwealth people differently from lowwealth people and these effects both differ from the median person Wooldridge 2010 Chapter 12 contains a treatment and examples of quantile regression Summary We have further investigated some important specification and data issues that often arise in empirical crosssectional analysis Misspecified functional form makes the estimated equation difficult to interpret Nevertheless incorrect functional form can be detected by adding quadratics computing RESET or testing against a nonnested alternative model using the DavidsonMacKinnon test No additional data collection is needed Solving the omitted variables problem is more difficult In Section 92 we discussed a possible solu tion based on using a proxy variable for the omitted variable Under reasonable assumptions including the proxy variable in an OLS regression eliminates or at least reduces bias The hurdle in applying this method is that proxy variables can be difficult to find A general possibility is to use data on a dependent variable from a prior year Applied economists are often concerned with measurement error Under the classical errorsinvar iables CEV assumptions measurement error in the dependent variable has no effect on the statistical properties of OLS In contrast under the CEV assumptions for an independent variable the OLS estimator Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 303 for the coefficient on the mismeasured variable is biased toward zero The bias in coefficients on the other variables can go either way and is difficult to determine Nonrandom samples from an underlying population can lead to biases in OLS When sample selection is correlated with the error term u OLS is generally biased and inconsistent On the other hand exogenous sample selectionwhich is either based on the explanatory variables or is otherwise independent of u does not cause problems for OLS Outliers in data sets can have large impacts on the OLS estimates espe cially in small samples It is important to at least informally identify outliers and to reestimate models with the suspected outliers excluded Least absolute deviations estimation is an alternative to OLS that is less sensitive to outliers and that delivers consistent estimates of conditional median parameters In the past 20 years with computational advances and improved understanding of the pros and cons of LAD and OLS LAD is used more and more in empirical researchoften as a supplement to OLS Key Terms Attenuation Bias Average Marginal Effect Average Partial Effect APE Classical ErrorsinVariables CEV Complete Cases Estimator Conditional Median DavidsonMacKinnon Test Endogenous Explanatory Variable Endogenous Sample Selection Exogenous Sample Selection Functional Form Misspecification Influential Observations Lagged Dependent Variable Least Absolute Deviations LAD Measurement Error Missing at Random Missing Completely at Random MCAR Missing Data Multiplicative Measurement Error Nonnested Models Nonrandom Sample Outliers PlugIn Solution to the Omitted Variables Problem Proxy Variable Random Coefficient Slope Model Regression Specification Error Test RESET Stratified Sampling Studentized Residuals Problems 1 In Problem 11 in Chapter 4 the Rsquared from estimating the model log1salary2 5 b0 1 b1log1sales2 1 b2log1mktval2 1 b3profmarg 1 b4ceoten 1 b5comten 1 u using the data in CEOSAL2 was R2 5 353 1n 5 1772 When ceoten2 and comten2 are added R2 5 375 Is there evidence of functional form misspecification in this model 2 Let us modify Computer Exercise C4 in Chapter 8 by using voting outcomes in 1990 for incumbents who were elected in 1988 Candidate A was elected in 1988 and was seeking reelection in 1990 voteA90 is Candidate As share of the twoparty vote in 1990 The 1988 voting share of Candidate A is used as a proxy variable for quality of the candidate All other variables are for the 1990 election The following equations were estimated using the data in VOTE2 voteA90 5 7571 1 312 prtystrA 1 493 democA 19252 10462 11012 2929 log1expendA2 2 1950 log1expendB2 16842 12812 n 5 186 R2 5 495 R2 5 483 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 304 PART 1 Regression Analysis with CrossSectional Data and voteA90 5 7081 1 282 prtystrA 1 452 democA 110012 10522 11062 2839 log1expendA2 2 1846 log1expendB2 1 067 voteA88 16872 12922 10532 n 5 186 R2 5 499 R2 5 485 i Interpret the coefficient on voteA88 and discuss its statistical significance ii Does adding voteA88 have much effect on the other coefficients 3 Let math10 denote the percentage of students at a Michigan high school receiving a passing score on a standardized math test see also Example 42 We are interested in estimating the effect of perstudent spending on math performance A simple model is math10 5 b0 1 b1log1expend2 1 b2log1enroll2 1 b3poverty 1 u where poverty is the percentage of students living in poverty i The variable lnchprg is the percentage of students eligible for the federally funded school lunch program Why is this a sensible proxy variable for poverty ii The table that follows contains OLS estimates with and without lnchprg as an explanatory variable Dependent Variable math10 Independent Variables 1 2 log1expend2 1113 330 775 304 log1enroll2 022 615 126 58 lnchprg 324 036 intercept 6924 2672 2314 2499 Observations Rsquared 428 0297 428 1893 Explain why the effect of expenditures on math10 is lower in column 2 than in column 1 Is the effect in column 2 still statistically greater than zero iii Does it appear that pass rates are lower at larger schools other factors being equal Explain iv Interpret the coefficient on lnchprg in column 2 v What do you make of the substantial increase in R2 from column 1 to column 2 4 The following equation explains weekly hours of television viewing by a child in terms of the childs age mothers education fathers education and number of siblings tvhoursp 5 b0 1 b1age 1 b2age2 1 b3motheduc 1 b4fatheduc 1 b5sibs 1 u We are worried that tvhoursp is measured with error in our survey Let tvhours denote the reported hours of television viewing per week i What do the classical errorsinvariables CEV assumptions require in this application ii Do you think the CEV assumptions are likely to hold Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 305 5 In Example 44 we estimated a model relating number of campus crimes to student enrollment for a sample of colleges The sample we used was not a random sample of colleges in the United States because many schools in 1992 did not report campus crimes Do you think that college failure to report crimes can be viewed as exogenous sample selection Explain 6 In the model 917 show that OLS consistently estimates a and b if ai is uncorrelated with xi and bi is uncorrelated with xi and x2 i which are weaker assumptions than 919 Hint Write the equation as in 918 and recall from Chapter 5 that sufficient for consistency of OLS for the intercept and slope is E1ui2 5 0 and Cov1xi ui2 5 0 7 Consider the simple regression model with classical measurement error y 5 b0 1 b1xp 1 u where we have m measures on xp Write these as zh 5 xp 1 eh h 5 1 p m Assume that xp is uncorrelated with u e1 p em that the measurement errors are pairwise uncorrelated and have the same variance s2 e Let w 5 1z1 1 p 1 zm2m be the average of the measures on xp so that for each observation i wi 5 1zi1 1 p 1 zim2m is the average of the m measures Let b1 be the OLS estimator from the simple regression yi on 1 wi i 5 1 p n using a random sample of data i Show that plim1b12 5 b1e sx 2 3s2 x 1 1s2 em4 f Hint The plim of b1 is Cov1wy2Var1w2 ii How does the inconsistency in b1 compare with that when only a single measure is available that is m 5 1 What happens as m grows Comment 8 The point of this exercise is to show that tests for functional form cannot be relied on as a general test for omitted variables Suppose that conditional on the explanatory variables x1 and x2 a linear model relating y to x1 and x2 satisfies the GaussMarkov assumptions y 5 b0 1 b1x1 1 b2x2 1 u E1u0x1 x22 5 0 Var1u0x1 x22 5 s2 To make the question interesting assume b2 2 0 Suppose further that x2 has a simple linear relationship with x1 x2 5 d0 1 d1x1 1 r E1r0x12 5 0 Var1r0x12 5 t2 i Show that E1y0x12 5 1b0 1 b2d02 1 1b1 1 b2d12 x1 Under random sampling what is the probability limit of the OLS estimator from the simple regression of y on x1 Is the simple regression estimator generally consistent for b1 ii If you run the regression of y on x1 x2 1 what will be the probability limit of the OLS estimator of the coefficient on x2 1 Explain iii Using substitution show that we can write y 5 1b0 1 b2d02 1 1b1 1 b2d12x1 1 u 1 b2r It can be shown that if we define v 5 u 1 b2r then E1v0x12 5 0 Var1v0x12 5 s2 1 b2 2t2 What consequences does this have for the t statistic on x2 1 from the regression in part ii iv What do you conclude about adding a nonlinear function of x1in particular x2 1in an attempt to detect omission of x2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 306 PART 1 Regression Analysis with CrossSectional Data 9 Suppose that log1y2 follows a linear model with a linear form of heteroskedasticity We write this as log1y2 5 b0 1 xb 1 u u0x Normal10h1x2 2 so that conditional on x u has a normal distribution with mean and median zero but with variance hx that depends on x Because Med1u0x2 5 0 equation 948 holds Med1y0x2 5 exp1b0 1 xb2 Further using an extension of the result from Chapter 6 it can be shown that E1y0x2 5 exp3b0 1 xb 1 h1x224 i Given that hx can be any positive function is it possible to conclude E1y0x2xj is the same sign as bj ii Suppose h1x2 5 d0 1 xd and ignore the problem that linear functions are not necessarily always positive Show that a particular variable say x1 can have a negative effect on Med1y0x2 but a positive effect on E1y0x2 iii Consider the case covered in Section 64 where h1x2 5 s2 How would you predict y using an estimate of E1y0x2 How would you predict y using an estimate of Med1y0x2 Which prediction is always larger 10 This exercise shows that in a simple regression model adding a dummy variable for missing data on the explanatory variable produces a consistent estimator of the slope coefficient if the missingness is unrelated to both the unobservable and observable factors affecting y Let m be a variable such that m 5 1 if we do not observe x and m 5 0 if we observe x We assume that y is always observed The population model is y 5 b0 1 b1x 1 u E1u0x2 5 0 i Provide an interpretation of the stronger assumption E1u0xm2 5 0 In particular what kind of missing data schemes would cause this assumption to fail ii Show that we can always write y 5 b0 1 b111 2 m2x 1 b1mx 1 u iii Let 1xi yi mi2 i 5 1 p n6 be random draws from the population where xi is missing when mi 5 1 Explain the nature of the variable zi 5 11 2 mi2xi In particular what does this variable equal when xi is missing iv Let r 5 P1m 5 12 and assume that m and x are independent Show that Cov3 11 2 m2xmx4 5 2r11 2 r2mx where mx 5 E1x2 What does this imply about estimating b1 from the regression yi on zi i 5 1 p n v If m and x are independent it can be shown that mx 5 d0 1 d1m 1 v where v is uncorrelated with m and z 5 11 2 m2x Explain why this makes m a suitable proxy variable for mx What does this mean about the coefficient on zi in the regression yi on zi mi i 5 1 p n vi Suppose for a population of children y is a standardized test score obtained from school records and x is family income which is reported voluntarily by families and so some families do not report their income Is it realistic to assume m and x are independent Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 307 Computer Exercises C1 i Apply RESET from equation 93 to the model estimated in Computer Exercise C5 in Chapter 7 Is there evidence of functional form misspecification in the equation ii Compute a heteroskedasticityrobust form of RESET Does your conclusion from part i change C2 Use the data set WAGE2 for this exercise i Use the variable KWW the knowledge of the world of work test score as a proxy for ability in place of IQ in Example 93 What is the estimated return to education in this case ii Now use IQ and KWW together as proxy variables What happens to the estimated return to education iii In part ii are IQ and KWW individually significant Are they jointly significant C3 Use the data from JTRAIN for this exercise i Consider the simple regression model log1scrap2 5 b0 1 b1grant 1 u where scrap is the firm scrap rate and grant is a dummy variable indicating whether a firm received a job training grant Can you think of some reasons why the unobserved factors in u might be correlated with grant ii Estimate the simple regression model using the data for 1988 You should have 54 observations Does receiving a job training grant significantly lower a firms scrap rate iii Now add as an explanatory variable log1scrap872 How does this change the estimated effect of grant Interpret the coefficient on grant Is it statistically significant at the 5 level against the onesided alternative H1 bgrant 0 iv Test the null hypothesis that the parameter on log1scrap872 is one against the twosided alternative Report the pvalue for the test v Repeat parts iii and iv using heteroskedasticityrobust standard errors and briefly discuss any notable differences C4 Use the data for the year 1990 in INFMRT for this exercise i Reestimate equation 943 but now include a dummy variable for the observation on the District of Columbia called DC Interpret the coefficient on DC and comment on its size and significance ii Compare the estimates and standard errors from part i with those from equation 944 What do you conclude about including a dummy variable for a single observation C5 Use the data in RDCHEM to further examine the effects of outliers on OLS estimates and to see how LAD is less sensitive to outliers The model is rdintens 5 b0 1 b1sales 1 b2sales2 1 b3profmarg 1 u where you should first change sales to be in billions of dollars to make the estimates easier to interpret i Estimate the above equation by OLS both with and without the firm having annual sales of almost 40 billion Discuss any notable differences in the estimated coefficients ii Estimate the same equation by LAD again with and without the largest firm Discuss any important differences in estimated coefficients iii Based on your findings in i and ii would you say OLS or LAD is more resilient to outliers C6 Redo Example 410 by dropping schools where teacher benefits are less than 1 of salary i How many observations are lost ii Does dropping these observations have any important effects on the estimated tradeoff Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 308 PART 1 Regression Analysis with CrossSectional Data C7 Use the data in LOANAPP for this exercise i How many observations have obrat 40 that is other debt obligations more than 40 of total income ii Reestimate the model in part iii of Computer Exercise C8 excluding observations with obrat 40 What happens to the estimate and t statistic on white iii Does it appear that the estimate of bwhite is overly sensitive to the sample used C8 Use the data in TWOYEAR for this exercise i The variable stotal is a standardized test variable which can act as a proxy variable for unobserved ability Find the sample mean and standard deviation of stotal ii Run simple regressions of jc and univ on stotal Are both college education variables statistically related to stotal Explain iii Add stotal to equation 417 and test the hypothesis that the returns to two and fouryear colleges are the same against the alternative that the return to fouryear colleges is greater How do your findings compare with those from Section 44 iv Add stotal2 to the equation estimated in part iii Does a quadratic in the test score variable seem necessary v Add the interaction terms stotaljc and stotaluniv to the equation from part iii Are these terms jointly significant vi What would be your final model that controls for ability through the use of stotal Justify your answer C9 In this exercise you are to compare OLS and LAD estimates of the effects of 401k plan eligibility on net financial assets The model is nettfa 5 b0 1 b1inc 1 b2inc2 1 b3age 1 b4age2 1 b5male 1 b6e401k 1 u i Use the data in 401KSUBS to estimate the equation by OLS and report the results in the usual form Interpret the coefficient on e401k ii Use the OLS residuals to test for heteroskedasticity using the BreuschPagan test Is u independent of the explanatory variables iii Estimate the equation by LAD and report the results in the same form as for OLS Interpret the LAD estimate of b6 iv Reconcile your findings from parts i and iii C10 You need to use two data sets for this exercise JTRAIN2 and JTRAIN3 The former is the outcome of a job training experiment The file JTRAIN3 contains observational data where individuals themselves largely determine whether they participate in job training The data sets cover the same time period i In the data set JTRAIN2 what fraction of the men received job training What is the fraction in JTRAIN3 Why do you think there is such a big difference ii Using JTRAIN2 run a simple regression of re78 on train What is the estimated effect of participating in job training on real earnings iii Now add as controls to the regression in part ii the variables re74 re75 educ age black and hisp Does the estimated effect of job training on re78 change much How come Hint Remember that these are experimental data iv Do the regressions in parts ii and iii using the data in JTRAIN3 reporting only the estimated coefficients on train along with their t statistics What is the effect now of controlling for the extra factors and why v Define avgre 5 1re74 1 re7522 Find the sample averages standard deviations and minimum and maximum values in the two data sets Are these data sets representative of the same populations in 1978 vi Almost 96 of men in the data set JTRAIN2 have avgre less than 10000 Using only these men run the regression re78 on train re74 re75 educ age black hisp Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 309 and report the training estimate and its t statistic Run the same regression for JTRAIN3 using only men with avgre 10 For the subsample of lowincome men how do the estimated training effects compare across the experimental and nonexperimental data sets vii Now use each data set to run the simple regression re78 on train but only for men who were unemployed in 1974 and 1975 How do the training estimates compare now viii Using your findings from the previous regressions discuss the potential importance of having comparable populations underlying comparisons of experimental and nonexperimental estimates C11 Use the data in MURDER only for the year 1993 for this question although you will need to first obtain the lagged murder rate say mrdrte21 i Run the regression of mrdrte on exec unem What are the coefficient and t statistic on exec Does this regression provide any evidence for a deterrent effect of capital punishment ii How many executions are reported for Texas during 1993 Actually this is the sum of executions for the current and past two years How does this compare with the other states Add a dummy variable for Texas to the regression in part i Is its t statistic unusually large From this does it appear Texas is an outlier iii To the regression in part i add the lagged murder rate What happens to b exec and its statistical significance iv For the regression in part iii does it appear Texas is an outlier What is the effect on b exec from dropping Texas from the regression C12 Use the data in ELEM9495 to answer this question See also Computer Exercise C10 in Chapter 4 i Using all of the data run the regression lavgsal on bs lenrol lstaff and lunch Report the coefficient on bs along with its usual and heteroskedasticityrobust standard errors What do you conclude about the economic and statistical significance of b bs ii Now drop the four observations with bs 5 that is where average benefits are supposedly more than 50 of average salary What is the coefficient on bs Is it statistically significant using the heteroskedasticityrobust standard error iii Verify that the four observations with bs 5 are 68 1127 1508 and 1670 Define four dummy variables for each of these observations You might call them d68 d1127 d1508 and d1670 Add these to the regression from part i and verify that the OLS coefficients and standard errors on the other variables are identical to those in part ii Which of the four dummies has a t statistic statistically different from zero at the 5 level iv Verify that in this data set the data point with the largest studentized residual largest t statistic on the dummy variable in part iii has a large influence on the OLS estimates That is run OLS using all observations except the one with the large studentized residual Does dropping in turn each of the other observations with bs 5 have important effects v What do you conclude about the sensitivity of OLS to a single observation even with a large sample size vi Verify that the LAD estimator is not sensitive to the inclusion of the observation identified in part iii C13 Use the data in CEOSAL2 to answer this question i Estimate the model lsalary 5 b0 1 b1lsales 1 b2lmktval 1 b3ceoten 1 b4ceoten2 1 u by OLS using all of the observations where lsalary lsales and lmktvale are all natural logarithms Report the results in the usual form with the usual OLS standard errors You may verify that the heteroskedasticityrobust standard errors are similar ii In the regression from part i obtain the studentized residuals call these stri How many studentized residuals are above 196 in absolute value If the studentized residuals were independent draws from a standard normal distribution about how many would you expect to be above two in absolute value with 177 draws Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 310 PART 1 Regression Analysis with CrossSectional Data iii Reestimate the equation in part i by OLS using only the observations with 0stri0 196 How do the coefficients compare with those in part i iv Estimate the equation in part i by LAD using all of the data Is the estimate of b1 closer to the OLS estimate using the full sample or the restricted sample What about for b3 v Evaluate the following statement Dropping outliers based on extreme values of studentized residuals makes the resulting OLS estimates closer to the LAD estimates on the full sample C14 Use the data in ECONMATH to answer this question The population model is score 5 b0 1 b1act 1 u i For how many students is the ACT score missing What is the fraction of the sample Define a new variable actmiss which equals one if act is missing and zero otherwise ii Create a new variable say act0 which is the act score when act is reported and zero when act is missing Find the average of act0 and compare it with the average for act iii Run the simple regression of score on act using only the complete cases What do you obtain for the slope coefficient and its heteroskedasticityrobust standard error iv Run the simple regression of score on act0 using all of the cases Compare the slope coefficient with that in part iii and comment v Now use all of the cases and run the regression scorei on act0i actmissi What is the slope estimate on act0i How does it compare with the answers in parts iii and iv vi Comparing regressions iii and v does using all cases and adding the missing data estimator improve estimation of b1 vii If you add the variable colgpa to the regressions in parts iii and v does this change your answer to part vi Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 311 N ow that we have a solid understanding of how to use the multiple regression model for crosssectional applications we can turn to the econometric analysis of time series data Since we will rely heavily on the method of ordinary least squares most of the work con cerning mechanics and inference has already been done However as we noted in Chapter 1 time series data have certain characteristics that crosssectional data do not and these can require spe cial attention when applying OLS Chapter 10 covers basic regression analysis and gives attention to problems unique to time series data We provide a set of GaussMarkov and classical linear model assumptions for time series applications The problems of functional form dummy variables trends and seasonality are also discussed Because certain time series models necessarily violate the GaussMarkov assumptions Chapter 11 describes the nature of these violations and presents the large sample properties of ordinary least squares As we can no longer assume random sampling we must cover conditions that restrict the temporal correlation in a time series in order to ensure that the usual asymptotic analysis is valid Chapter 12 turns to an important new problem serial correlation in the error terms in time series regressions We discuss the consequences ways of testing and methods for dealing with serial correlation Chapter 12 also contains an explanation of how heteroskedasticity can arise in time series models Part 2 Regression Analysis with Time Series Data Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 312 I n this chapter we begin to study the properties of OLS for estimating linear regression models using time series data In Section 101 we discuss some conceptual differences between time series and crosssectional data Section 102 provides some examples of time series regressions that are often estimated in the empirical social sciences We then turn our attention to the finite sample prop erties of the OLS estimators and state the GaussMarkov assumptions and the classical linear model assumptions for time series regression Although these assumptions have features in common with those for the crosssectional case they also have some significant differences that we will need to highlight In addition we return to some issues that we treated in regression with crosssectional data such as how to use and interpret the logarithmic functional form and dummy variables The important top ics of how to incorporate trends and account for seasonality in multiple regression are taken up in Section 105 101 The Nature of Time Series Data An obvious characteristic of time series data that distinguishes them from crosssectional data is tem poral ordering For example in Chapter 1 we briefly discussed a time series data set on employment the minimum wage and other economic variables for Puerto Rico In this data set we must know that the data for 1970 immediately precede the data for 1971 For analyzing time series data in the social sciences we must recognize that the past can affect the future but not vice versa unlike in the Star Trek universe To emphasize the proper ordering of time series data Table 101 gives a partial listing Basic Regression Analysis with Time Series Data c h a p t e r 10 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 313 of the data on US inflation and unemployment rates from various editions of the Economic Report of the President including the 2004 Report Tables B42 and B64 Another difference between crosssectional and time series data is more subtle In Chapters 3 and 4 we studied statistical properties of the OLS estimators based on the notion that samples were randomly drawn from the appropriate population Understanding why crosssectional data should be viewed as random outcomes is fairly straightforward a different sample drawn from the population will generally yield different values of the independent and dependent variables such as education experience wage and so on Therefore the OLS estimates computed from different random samples will generally differ and this is why we consider the OLS estimators to be random variables How should we think about randomness in time series data Certainly economic time series sat isfy the intuitive requirements for being outcomes of random variables For example today we do not know what the Dow Jones Industrial Average will be at the close of the next trading day We do not know what the annual growth in output will be in Canada during the coming year Since the outcomes of these variables are not foreknown they should clearly be viewed as random variables Formally a sequence of random variables indexed by time is called a stochastic process or a time series process Stochastic is a synonym for random When we collect a time series data set we obtain one possible outcome or realization of the stochastic process We can only see a single realization because we cannot go back in time and start the process over again This is analogous to crosssectional analysis where we can collect only one random sample However if certain condi tions in history had been different we would generally obtain a different realization for the stochastic process and this is why we think of time series data as the outcome of random variables The set of all possible realizations of a time series process plays the role of the population in crosssectional analysis The sample size for a time series data set is the number of time periods over which we observe the variables of interest 102 Examples of Time Series Regression Models In this section we discuss two examples of time series models that have been useful in empirical time series analysis and that are easily estimated by ordinary least squares We will study additional mod els in Chapter 11 TAblE 101 Partial Listing of Data on US Inflation and Unemployment Rates 19482003 Year Inflation Unemployment 1948 81 38 1949 212 59 1950 13 53 1951 79 33 1998 16 45 1999 22 42 2000 34 40 2001 28 47 2002 16 58 2003 23 60 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 314 102a Static Models Suppose that we have time series data available on two variables say y and z where yt and zt are dated contemporaneously A static model relating y to z is yt 5 b0 1 b1zt 1 ut t 5 1 2 p n 101 The name static model comes from the fact that we are modeling a contemporaneous relationship between y and z Usually a static model is postulated when a change in z at time t is believed to have an immediate effect on y Dyt 5 b1Dzt when Dut 5 0 Static regression models are also used when we are interested in knowing the tradeoff between y and z An example of a static model is the static Phillips curve given by inft 5 b0 1 b1unemt 1 ut 102 where inft is the annual inflation rate and unemt is the annual unemployment rate This form of the Phillips curve assumes a constant natural rate of unemployment and constant inflationary expecta tions and it can be used to study the contemporaneous tradeoff between inflation and unemployment See for example Mankiw 1994 Section 112 Naturally we can have several explanatory variables in a static regression model Let mrdrtet denote the murders per 10000 people in a particular city during year t let convrtet denote the murder conviction rate let unemt be the local unemployment rate and let yngmlet be the fraction of the popu lation consisting of males between the ages of 18 and 25 Then a static multiple regression model explaining murder rates is mrdrtet 5 b0 1 b1convrtet 1 b2unemt 1 b3yngmlet 1 ut 103 Using a model such as this we can hope to estimate for example the ceteris paribus effect of an increase in the conviction rate on a particular criminal activity 102b Finite Distributed Lag Models In a finite distributed lag FDL model we allow one or more variables to affect y with a lag For example for annual observations consider the model gfrt 5 a0 1 d0pet 1 d1pet21 1 d2pet22 1 ut 104 where gfrt is the general fertility rate children born per 1000 women of childbearing age and pet is the real dollar value of the personal tax exemption The idea is to see whether in the aggregate the decision to have children is linked to the tax value of having a child Equation 104 recognizes that for both biological and behavioral reasons decisions to have children would not immediately result from changes in the personal exemption Equation 104 is an example of the model yt 5 a0 1 d0zt 1 d1zt21 1 d2zt22 1 ut 105 which is an FDL of order two To interpret the coefficients in 105 suppose that z is a constant equal to c in all time periods before time t At time t z increases by one unit to c 1 1 and then reverts to its previous level at time t 1 1 That is the increase in z is temporary More precisely p zt22 5 c zt21 5 c zt 5 c 1 1 zt11 5 c zt12 5 c p Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 315 To focus on the ceteris paribus effect of z on y we set the error term in each time period to zero Then yt21 5 a0 1 d0c 1 d1c 1 d2c yt 5 a0 1 d01c 1 12 1 d1c 1 d2c yt11 5 a0 1 d0c 1 d11c 1 12 1 d2c yt12 5 a0 1 d0c 1 d1c 1 d21c 1 12 yt13 5 a0 1 d0c 1 d1c 1 d2c and so on From the first two equations yt 2 yt21 5 d0 which shows that d0 is the immediate change in y due to the oneunit increase in z at time t d0 is usually called the impact propensity or impact multiplier Similarly d1 5 yt11 2 yt21 is the change in y one period after the temporary change and d2 5 yt12 2 yt21 is the change in y two periods after the change At time t 1 3 y has reverted back to its initial level yt13 5 yt21 This is because we have assumed that only two lags of z appear in 105 When we graph the dj as a function of j we obtain the lag distribution which summarizes the dynamic effect that a temporary increase in z has on y A possible lag distribution for the FDL of order two is given in Figure 101 Of course we would never know the parameters dj instead we will esti mate the dj and then plot the estimated lag distribution The lag distribution in Figure 101 implies that the largest effect is at the first lag The lag distri bution has a useful interpretation If we standardize the initial value of y at yt21 5 0 the lag distribu tion traces out all subsequent values of y due to a oneunit temporary increase in z We are also interested in the change in y due to a permanent increase in z Before time t z equals the constant c At time t z increases permanently to c 1 1 zs 5 c s t and zs 5 c 1 1 s t Again setting the errors to zero we have yt21 5 a0 1 d0c 1 d1c 1 d2c yt 5 a0 1 d01c 1 12 1 d1c 1 d2c yt11 5 a0 1 d01c 1 12 1 d11c 1 12 1 d2c yt12 5 a0 1 d01c 1 12 1 d11c 1 12 1 d21c 1 12 1 0 coefficient 2 3 4 lag j FiguRE 101 A lag distribution with two nonzero lags The maximum effect is at the first lag Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 316 and so on With the permanent increase in z after one period y has increased by d0 1 d1 and after two periods y has increased by d0 1 d1 1 d2 There are no further changes in y after two periods This shows that the sum of the coefficients on current and lagged z d0 1 d1 1 d2 is the longrun change in y given a permanent increase in z and is called the longrun propensity LRP or longrun multiplier The LRP is often of interest in distributed lag models As an example in equation 104 d0 measures the immediate change in fertility due to a one dollar increase in pe As we mentioned earlier there are reasons to believe that d0 is small if not zero But d1 or d2 or both might be positive If pe permanently increases by one dollar then after two years gfr will have changed by d0 1 d1 1 d2 This model assumes that there are no further changes after two years Whether this is actually the case is an empirical matter An FDL of order q is written as yt 5 a0 1 d0zt 1 d1zt21 1 p 1 dqzt2q 1 ut 106 This contains the static model as a special case by setting d1 d2 p dq equal to zero Sometimes a primary purpose for estimating a distributed lag model is to test whether z has a lagged effect on y The impact propensity is always the coefficient on the contemporaneous z d0 Occasionally we omit zt from 106 in which case the impact propensity is zero In the general case the lag distribution can be plotted by graphing the estimated dj as a function of j For any horizon h we can define the cumulative effect as d0 1 d1 1 p 1 dh which is interpreted as the change in the expected outcome h periods after a permanent oneunit increase in x Once the dj have been estimated one may plot the estimated cumulative effects as a function of h The LRP is the cumula tive effect after all changes have taken place it is simply the sum of all of the coefficients on the zt2j LRP 5 d0 1 d1 1 p 1 dq 107 Because of the often substantial correlation in z at different lagsthat is due to multicollinearity in 106it can be difficult to obtain precise esti mates of the individual dj Interestingly even when the dj cannot be precisely estimated we can often get good estimates of the LRP We will see an example later We can have more than one explanatory variable appearing with lags or we can add contemporaneous variables to an FDL model For example the average education level for women of childbearing age could be added to 104 which allows us to account for changing education levels for women 102c A Convention about the Time Index When models have lagged explanatory variables and as we will see in the next chapter for models with lagged y confusion can arise concerning the treatment of initial observations For example if in 105 we assume that the equation holds starting at t 5 1 then the explanatory variables for the first time period are z1 z0 and z21 Our convention will be that these are the initial values in our sample so that we can always start the time index at t 5 1 In practice this is not very important because regression packages automatically keep track of the observations available for estimating models with lags But for this and the next two chapters we need some convention concerning the first time period being represented by the regression equation In an equation for annual data suppose that intt 5 16 1 48 inft 2 15 inft21 1 32 inft22 1 utr where int is an interest rate and inf is the inflation rate What are the impact and long run propensities Exploring FurthEr 101 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 317 103 Finite Sample Properties of OLS under Classical Assumptions In this section we give a complete listing of the finite sample or small sample properties of OLS under standard assumptions We pay particular attention to how the assumptions must be altered from our crosssectional analysis to cover time series regressions 103a Unbiasedness of OLS The first assumption simply states that the time series process follows a model that is linear in its parameters Assumption TS1 Linear in Parameters The stochastic process 5 1xt1 xt2 p xtk yt2 t 5 1 2 p n6 follows the linear model yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ut 108 where 5ut t 5 1 2 p n6 is the sequence of errors or disturbances Here n is the number of observations time periods In the notation xtj t denotes the time period and j is as usual a label to indicate one of the k explanatory variables The terminology used in crosssectional regression applies here yt is the dependent variable explained variable or regressand the xtj are the independent variables explana tory variables or regressors We should think of Assumption TS1 as being essentially the same as Assumption MLR1 the first crosssectional assumption but we are now specifying a linear model for time series data The examples covered in Section 102 can be cast in the form of 108 by appropriately defining xtj For example equation 105 is obtained by setting xt1 5 zt xt2 5 zt21 and xt3 5 zt22 To state and discuss several of the remaining assumptions we let xt 5 1xt1 xt2 p xtk2 denote the set of all independent variables in the equation at time t Further X denotes the collection of all independent variables for all time periods It is useful to think of X as being an array with n rows and k columns This reflects how time series data are stored in econometric software packages the tth row of X is xt consisting of all independent variables for time period t Therefore the first row of X corresponds to t 5 1 the second row to t 5 2 and the last row to t 5 n An example is given in Table 102 using n 5 8 and the explanatory variables in equation 103 TAblE 102 Example of X for the Explanatory Variables in Equation 103 t convrte unem yngmle 1 46 074 12 2 42 071 12 3 42 063 11 4 47 062 09 5 48 060 10 6 50 059 11 7 55 058 12 8 56 059 13 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 318 Naturally as with crosssectional regression we need to rule out perfect collinearity among the regressors Assumption TS2 No Perfect Collinearity In the sample and therefore in the underlying time series process no independent variable is constant nor a perfect linear combination of the others We discussed this assumption at length in the context of crosssectional data in Chapter 3 The issues are essentially the same with time series data Remember Assumption TS2 does allow the explanatory variables to be correlated but it rules out perfect correlation in the sample The final assumption for unbiasedness of OLS is the time series analog of Assumption MLR4 and it also obviates the need for random sampling in Assumption MLR2 Assumption TS3 Zero Conditional Mean For each t the expected value of the error ut given the explanatory variables for all time periods is zero Mathematically E1ut0X2 5 0 t 5 1 2 p n 109 This is a crucial assumption and we need to have an intuitive grasp of its meaning As in the cross sectional case it is easiest to view this assumption in terms of uncorrelatedness Assumption TS3 implies that the error at time t ut is uncorrelated with each explanatory variable in every time period The fact that this is stated in terms of the conditional expectation means that we must also correctly specify the functional relationship between yt and the explanatory variables If ut is independent of X and E1ut2 5 0 then Assumption TS3 automatically holds Given the crosssectional analysis from Chapter 3 it is not surprising that we require ut to be uncorrelated with the explanatory variables also dated at time t in conditional mean terms E1ut0xt1 p xtk2 5 E1ut0xt2 5 0 1010 When 1010 holds we say that the xtj are contemporaneously exogenous Equation 1010 implies that ut and the explanatory variables are contemporaneously uncorrelated Corr1xtjut2 5 0 for all j Assumption TS3 requires more than contemporaneous exogeneity ut must be uncorrelated with xsj even when s 2 t This is a strong sense in which the explanatory variables must be exogenous and when TS3 holds we say that the explanatory variables are strictly exogenous In Chapter 11 we will demonstrate that 1010 is sufficient for proving consistency of the OLS estimator But to show that OLS is unbiased we need the strict exogeneity assumption In the crosssectional case we did not explicitly state how the error term for say person i ui is related to the explanatory variables for other people in the sample This was unnecessary because with random sampling Assumption MLR2 ui is automatically independent of the explanatory vari ables for observations other than i In a time series context random sampling is almost never appro priate so we must explicitly assume that the expected value of ut is not related to the explanatory variables in any time periods It is important to see that Assumption TS3 puts no restriction on correlation in the independent variables or in the ut across time Assumption TS3 only says that the average value of ut is unrelated to the independent variables in all time periods Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 319 Anything that causes the unobservables at time t to be correlated with any of the explanatory variables in any time period causes Assumption TS3 to fail Two leading candidates for failure are omitted variables and measurement error in some of the regressors But the strict exogeneity assump tion can also fail for other less obvious reasons In the simple static regression model yt 5 b0 1 b1zt 1 ut Assumption TS3 requires not only that ut and zt are uncorrelated but that ut is also uncorrelated with past and future values of z This has two implications First z can have no lagged effect on y If z does have a lagged effect on y then we should estimate a distributed lag model A more subtle point is that strict exogeneity excludes the possibility that changes in the error term today can cause future changes in z This effectively rules out feedback from y to future values of z For example consider a simple static model to explain a citys murder rate in terms of police officers per capita mrdrtet 5 b0 1 b1polpct 1 ut It may be reasonable to assume that ut is uncorrelated with polpct and even with past values of polpct for the sake of argument assume this is the case But suppose that the city adjusts the size of its police force based on past values of the murder rate This means that say polpct11 might be correlated with ut since a higher ut leads to a higher mrdrtet If this is the case Assumption TS3 is generally violated There are similar considerations in distributed lag models Usually we do not worry that ut might be correlated with past z because we are controlling for past z in the model But feedback from u to future z is always an issue Explanatory variables that are strictly exogenous cannot react to what has happened to y in the past A factor such as the amount of rainfall in an agricultural production function satisfies this requirement rainfall in any future year is not influenced by the output during the current or past years But something like the amount of labor input might not be strictly exogenous as it is chosen by the farmer and the farmer may adjust the amount of labor based on last years yield Policy variables such as growth in the money supply expenditures on welfare and highway speed limits are often influenced by what has happened to the outcome variable in the past In the social sciences many explanatory variables may very well violate the strict exogeneity assumption Even though Assumption TS3 can be unrealistic we begin with it in order to conclude that the OLS estimators are unbiased Most treatments of static and FDL models assume TS3 by making the stronger assumption that the explanatory variables are nonrandom or fixed in repeated samples The nonrandomness assumption is obviously false for time series observations Assumption TS3 has the advantage of being more realistic about the random nature of the xtj while it isolates the necessary assumption about how ut and the explanatory variables are related in order for OLS to be unbiased UNbiasedNess of oLs Under Assumptions TS1 TS2 and TS3 the OLS estimators are unbiased conditional on X and therefore unconditionally as well when the expectations exist E1b j2 5 bj j 5 0 1 p k thEorEm 101 The proof of this theorem is essentially the same as that for Theorem 31 in Chapter 3 and so we omit it When comparing Theorem 101 to Theorem 31 we have been able to drop the random sampling assumption by assuming that for each t ut has zero mean given the explanatory variables at all time peri ods If this assumption does not hold OLS cannot be shown to be unbiased In the FDL model yt 5 a0 1 d0zt 1 d1zt21 1 ut what do we need to assume about the sequence 5z0 z1 p zn6 in order for Assumption TS3 to hold Exploring FurthEr 102 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 320 The analysis of omitted variables bias which we covered in Section 33 is essentially the same in the time series case In particular Table 32 and the discussion surrounding it can be used as before to determine the directions of bias due to omitted variables 103b The Variances of the OLS Estimators and the GaussMarkov Theorem We need to add two assumptions to round out the GaussMarkov assumptions for time series regres sions The first one is familiar from crosssectional analysis Assumption TS4 Homoskedasticity Conditional on X the variance of ut is the same for all t Var1ut0X2 5 Var1ut2 5 s2 t 5 1 2 p n This assumption means that Var1ut0X2 cannot depend on Xit is sufficient that ut and X are independentand that Var1ut2 is constant over time When TS4 does not hold we say that the errors are heteroskedastic just as in the crosssectional case For example consider an equation for deter mining threemonth Tbill rates 1i3t2 based on the inflation rate 1inft2 and the federal deficit as a per centage of gross domestic product 1deft2 i3t 5 b0 1 b1inft 1 b2deft 1 ut 1011 Among other things Assumption TS4 requires that the unobservables affecting interest rates have a constant variance over time Since policy regime changes are known to affect the variability of inter est rates this assumption might very well be false Further it could be that the variability in interest rates depends on the level of inflation or relative size of the deficit This would also violate the homo skedasticity assumption When Var1ut0X2 does depend on X it often depends on the explanatory variables at time t xt In Chapter 12 we will see that the tests for heteroskedasticity from Chapter 8 can also be used for time series regressions at least under certain assumptions The final GaussMarkov assumption for time series analysis is new Assumption TS5 No serial Correlation Conditional on X the errors in two different time periods are uncorrelated Corr1utus0X2 5 0 for all t 2 s The easiest way to think of this assumption is to ignore the conditioning on X Then Assumption TS5 is simply Corr1utus2 5 0 for all t 2 s 1012 This is how the no serial correlation assumption is stated when X is treated as nonrandom When considering whether Assumption TS5 is likely to hold we focus on equation 1012 because of its simple interpretation When 1012 is false we say that the errors in 108 suffer from serial correlation or auto correlation because they are correlated across time Consider the case of errors from adjacent time periods Suppose that when ut21 0 then on average the error in the next time period ut is also positive Then Corr1utut212 0 and the errors suffer from serial correlation In equation 1011 this means that if interest rates are unexpectedly high for this period then they are likely to be above Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 321 average for the given levels of inflation and deficits for the next period This turns out to be a reasonable characterization for the error terms in many time series applications which we will see in Chapter 12 For now we assume TS5 Importantly Assumption TS5 assumes nothing about temporal correlation in the independent variables For example in equation 1011 inft is almost certainly correlated across time But this has nothing to do with whether TS5 holds A natural question that arises is in Chapters 3 and 4 why did we not assume that the errors for different crosssectional observations are uncorrelated The answer comes from the random sampling assumption under random sampling ui and uh are independent for any two observations i and h It can also be shown that under random sampling the errors for different observations are independ ent conditional on the explanatory variables in the sample Thus for our purposes we consider serial correlation only to be a potential problem for regressions with times series data In Chapters 13 and 14 the serial correlation issue will come up in connection with panel data analysis Assumptions TS1 through TS5 are the appropriate GaussMarkov assumptions for time series applications but they have other uses as well Sometimes TS1 through TS5 are satisfied in cross sectional applications even when random sampling is not a reasonable assumption such as when the crosssectional units are large relative to the population Suppose that we have a crosssectional data set at the city level It might be that correlation exists across cities within the same state in some of the explanatory variables such as property tax rates or per capita welfare payments Correlation of the explanatory variables across observations does not cause problems for verifying the GaussMarkov assumptions provided the error terms are uncorrelated across cities However in this chapter we are primarily interested in applying the GaussMarkov assumptions to time series regression problems oLs saMPLiNg VariaNCes Under the time series GaussMarkov Assumptions TS1 through TS5 the variance of b j conditional on X is Var1b j0X2 5 s23SSTj11 2 R2 j 2 4 j 5 1 p k 1013 where SSTj is the total sum of squares of xtj and R2 j is the Rsquared from the regression of xj on the other independent variables thEorEm 102 Equation 1013 is the same variance we derived in Chapter 3 under the crosssectional Gauss Markov assumptions Because the proof is very similar to the one for Theorem 32 we omit it The discussion from Chapter 3 about the factors causing large variances including multicollinearity among the explanatory variables applies immediately to the time series case The usual estimator of the error variance is also unbiased under Assumptions TS1 through TS5 and the GaussMarkov Theorem holds UNbiased estiMatioN of s2 Under Assumptions TS1 through TS5 the estimator s 2 5 SSRdf is an unbiased estimator of s2 where df 5 n 2 k 2 1 thEorEm 103 gaUssMarkoV tHeoreM Under Assumptions TS1 through TS5 the OLS estimators are the best linear unbiased estimators conditional on X thEorEm 104 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 322 The bottom line here is that OLS has the same desirable finite sample properties under TS1 through TS5 that it has under MLR1 through MLR5 103c Inference under the Classical Linear Model Assumptions In order to use the usual OLS standard errors t statistics and F statistics we need to add a final assumption that is analogous to the normality assumption we used for crosssectional analysis Assumption TS6 Normality The errors ut are independent of X and are independently and identically distributed as Normal10s22 Assumption TS6 implies TS3 TS4 and TS5 but it is stronger because of the independence and normality assumptions In the FDL model yt 5 a0 1 d0 zt 1 d1zt21 1 ut explain the nature of any multicollinearity in the explanatory variables Exploring FurthEr 103 NorMaL saMPLiNg distribUtioNs Under Assumptions TS1 through TS6 the CLM assumptions for time series the OLS estimators are normally distributed conditional on X Further under the null hypothesis each t statistic has a t distri bution and each F statistic has an F distribution The usual construction of confidence intervals is also valid thEorEm 105 The implications of Theorem 105 are of utmost importance It implies that when Assumptions TS1 through TS6 hold everything we have learned about estimation and inference for crosssectional regressions applies directly to time series regressions Thus t statistics can be used for testing statistical significance of individual explanatory variables and F statistics can be used to test for joint significance Just as in the crosssectional case the usual inference procedures are only as good as the underlying assumptions The classical linear model assumptions for time series data are much more restrictive than those for crosssectional datain particular the strict exogeneity and no serial correlation assumptions can be unrealistic Nevertheless the CLM framework is a good starting point for many applications exaMPLe 101 static Phillips Curve To determine whether there is a tradeoff on average between unemployment and inflation we can test H0 b1 5 0 against H1 b1 0 in equation 102 If the classical linear model assumptions hold we can use the usual OLS t statistic We use the file PHILLIPS to estimate equation 102 restricting ourselves to the data through 1996 In later exercises for example Computer Exercises C12 and C10 in Chapter 11 you are asked to use all years through 2003 In Chapter 18 we use the years 1997 through 2003 in various forecast ing exercises The simple regression estimates are inft 5 142 1 468 unemt 11722 12892 1014 n 5 49 R2 5 053 R2 5 033 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 323 This equation does not suggest a tradeoff between unem and inf b 1 0 The t statistic for b 1 is about 162 which gives a pvalue against a twosided alternative of about 11 Thus if anything there is a positive relationship between inflation and unemployment There are some problems with this analysis that we cannot address in detail now In Chapter 12 we will see that the CLM assumptions do not hold In addition the static Phillips curve is probably not the best model for determining whether there is a shortrun tradeoff between inflation and unem ployment Macroeconomists generally prefer the expectations augmented Phillips curve a simple example of which is given in Chapter 11 As a second example we estimate equation 1011 using annual data on the US economy exaMPLe 102 effects of inflation and deficits on interest rates The data in INTDEF come from the 2004 Economic Report of the President Tables B73 and B79 and span the years 1948 through 2003 The variable i3 is the threemonth Tbill rate inf is the annual inflation rate based on the consumer price index CPI and def is the federal budget deficit as a per centage of GDP The estimated equation is i3t 5 173 1 606 inft 1 513 deft 10432 10822 11182 1015 n 5 56 R2 5 602 R2 5 587 These estimates show that increases in inflation or the relative size of the deficit increase shortterm interest rates both of which are expected from basic economics For example a ceteris paribus one percentage point increase in the inflation rate increases i3 by 606 points Both inf and def are very statistically significant assuming of course that the CLM assumptions hold 104 Functional Form Dummy Variables and Index Numbers All of the functional forms we learned about in earlier chapters can be used in time series regressions The most important of these is the natural logarithm time series regressions with constant percentage effects appear often in applied work exaMPLe 103 Puerto rican employment and the Minimum Wage Annual data on the Puerto Rican employment rate minimum wage and other variables are used by CastilloFreeman and Freeman 1992 to study the effects of the US minimum wage on employment in Puerto Rico A simplified version of their model is log1prepopt2 5 b0 1 b1log1mincovt2 1 b2log1usgnpt2 1 ut 1016 where prepopt is the employment rate in Puerto Rico during year t ratio of those working to total population usgnpt is real US gross national product in billions of dollars and mincov meas ures the importance of the minimum wage relative to average wages In particular mincov avgminavgwageavgcov where avgmin is the average minimum wage avgwage is the average overall wage and avgcov is the average coverage rate the proportion of workers actually covered by the minimum wage law Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 324 Using the data in PRMINWGE for the years 1950 through 1987 gives log1prepopt2 5 2105 2 154 log1mincovt2 2 012 log1usgnpt2 10772 10652 10892 1017 n 5 38 R2 5 661 R2 5 641 The estimated elasticity of prepop with respect to mincov is 2154 and it is statistically significant with t 5 2237 Therefore a higher minimum wage lowers the employment rate something that classical economics predicts The GNP variable is not statistically significant but this changes when we account for a time trend in the next section We can use logarithmic functional forms in distributed lag models too For example for quar terly data suppose that money demand 1Mt2 and gross domestic product 1GDPt2 are related by log1Mt2 5 a0 1 d0log1GDPt2 1 d1log1GDPt212 1 d2log1GDPt222 1 d3log1GDPt232 1 d4log1GDPt242 1 ut The impact propensity in this equation d0 is also called the shortrun elasticity it meas ures the immediate percentage change in money demand given a 1 increase in GDP The LRP d0 1 d1 1 p 1 d4 is sometimes called the longrun elasticity it measures the percentage increase in money demand after four quarters given a permanent 1 increase in GDP Binary or dummy independent variables are also quite useful in time series applications Since the unit of observation is time a dummy variable represents whether in each time period a certain event has occurred For example for annual data we can indicate in each year whether a Democrat or a Republican is president of the United States by defining a variable democt which is unity if the president is a Democrat and zero otherwise Or in looking at the effects of capital punishment on murder rates in Texas we can define a dummy variable for each year equal to one if Texas had capital punishment during that year and zero otherwise Often dummy variables are used to isolate certain periods that may be systematically different from other periods covered by a data set exaMPLe 104 effects of Personal exemption on fertility rates The general fertility rate gfr is the number of children born to every 1000 women of childbearing age For the years 1913 through 1984 the equation gfrt 5 b0 1 b1 pet 1 b2ww2t 1 b3 pillt 1 ut explains gfr in terms of the average real dollar value of the personal tax exemption pe and two binary variables The variable ww2 takes on the value unity during the years 1941 through 1945 when the United States was involved in World War II The variable pill is unity from 1963 onward when the birth control pill was made available for contraception Using the data in FERTIL3 which were taken from the article by Whittington Alm and Peters 1990 gfrt 5 9868 1 083 pet 2 2424 ww2t 2 3159 pillt 13212 10302 17462 14082 1018 n 5 72 R2 5 473 R2 5 450 Each variable is statistically significant at the 1 level against a twosided alternative We see that the fertil ity rate was lower during World War II given pe there were about 24 fewer births for every 1000 women of childbearing age which is a large reduction From 1913 through 1984 gfr ranged from about 65 to 127 Similarly the fertility rate has been substantially lower since the introduction of the birth control pill Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 325 The variable of economic interest is pe The average pe over this time period is 10040 ranging from zero to 24383 The coefficient on pe implies that a 1200 increase in pe increases gfr by about one birth per 1000 women of childbearing age This effect is hardly trivial In Section 102 we noted that the fertility rate may react to changes in pe with a lag Estimating a distributed lag model with two lags gives gfrt 5 9587 1 073 pet 2 0058 pet21 1 034 pet22 2 2212 ww2t 2 3130 pillt 13282 11262 115572 11262 110732 13982 1019 n 5 70 R2 5 499 R2 5 459 In this regression we only have 70 observations because we lose two when we lag pe twice The coef ficients on the pe variables are estimated very imprecisely and each one is individually insignificant It turns out that there is substantial correlation between pet pet21 and pet22 and this multicollinearity makes it difficult to estimate the effect at each lag However pet pet21 and pet22 are jointly signifi cant the F statistic has a pvalue 5 012 Thus pe does have an effect on gfr as we already saw in 1018 but we do not have good enough estimates to determine whether it is contemporaneous or with a one or twoyear lag or some of each Actually pet21 and pet22 are jointly insignificant in this equation 1pvalue 5 952 so at this point we would be justified in using the static model But for illustrative purposes let us obtain a confidence interval for the LRP in this model The estimated LRP in 1019 is 073 2 0058 1 034 101 However we do not have enough information in 1019 to obtain the standard error of this estimate To obtain the standard error of the estimated LRP we use the trick suggested in Section 44 Let u0 5 d0 1 d1 1 d2 denote the LRP and write d0 in terms of u0 d1 and d2 as d0 5 u0 2 d1 2 d2 Next substitute for d0 in the model gfrt 5 a0 1 d0 pet 1 d1pet21 1 d2 pet22 1 p to get gfrt 5 a0 1 1u0 2 d1 2 d22pet 1 d1pet21 1 d2 pet22 1 p 5 a0 1 u0 pet 1 d11pet21 2 pet2 1 d21pet22 2 pet2 1 p From this last equation we can obtain u 0 and its standard error by regressing gfrt on pet 1pet21 2 pet2 1pet22 2 pet2 ww2t and pillt The coefficient and associated standard error on pet are what we need Running this regression gives u 0 5 101 as the coefficient on pet as we already knew and se1u 02 5 030 which we could not compute from 1019 Therefore the t statistic for u 0 is about 337 so u 0 is statistically different from zero at small significance levels Even though none of the d j is individually significant the LRP is very significant The 95 confidence interval for the LRP is about 041 to 160 Whittington Alm and Peters 1990 allow for further lags but restrict the coefficients to help alleviate the multicollinearity problem that hinders estimation of the individual dj See Problem 6 for an example of how to do this For estimating the LRP which would seem to be of primary interest here such restrictions are unnecessary Whittington Alm and Peters also control for additional vari ables such as average female wage and the unemployment rate Binary explanatory variables are the key component in what is called an event study In an event study the goal is to see whether a particular event influences some outcome Economists who study industrial organization have looked at the effects of certain events on firm stock prices For example Rose 1985 studied the effects of new trucking regulations on the stock prices of trucking companies A simple version of an equation used for such event studies is Rf t 5 b0 1 b1Rm t 1 b2dt 1 ut Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 326 where Rf t is the stock return for firm f during period t usually a week or a month Rm t is the market return usually computed for a broad stock market index and dt is a dummy variable indicating when the event occurred For example if the firm is an airline dt might denote whether the airline experi enced a publicized accident or near accident during week t Including Rm t in the equation controls for the possibility that broad market movements might coincide with airline accidents Sometimes mul tiple dummy variables are used For example if the event is the imposition of a new regulation that might affect a certain firm we might include a dummy variable that is one for a few weeks before the regulation was publicly announced and a second dummy variable for a few weeks after the regulation was announced The first dummy variable might detect the presence of inside information Before we give an example of an event study we need to discuss the notion of an index num ber and the difference between nominal and real economic variables An index number typically aggregates a vast amount of information into a single quantity Index numbers are used regularly in time series analysis especially in macroeconomic applications An example of an index num ber is the index of industrial production IIP computed monthly by the Board of Governors of the Federal Reserve The IIP is a measure of production across a broad range of industries and as such its magnitude in a particular year has no quantitative meaning In order to interpret the magnitude of the IIP we must know the base period and the base value In the 1997 Economic Report of the President ERP the base year is 1987 and the base value is 100 Setting IIP to 100 in the base period is just a convention it makes just as much sense to set IIP 5 1 in 1987 and some indexes are defined with 1 as the base value Because the IIP was 1077 in 1992 we can say that industrial production was 77 higher in 1992 than in 1987 We can use the IIP in any two years to compute the percentage difference in industrial output during those two years For example because IIP 5 614 in 1970 and IIP 5 857 in 1979 industrial production grew by about 396 during the 1970s It is easy to change the base period for any index number and sometimes we must do this to give index numbers reported with different base years a common base year For example if we want to change the base year of the IIP from 1987 to 1982 we simply divide the IIP for each year by the 1982 value and then multiply by 100 to make the base period value 100 Generally the formula is newindext 5 1001oldindext oldindexnewbase2 1020 where oldindexnewbase is the original value of the index in the new base year For example with base year 1987 the IIP in 1992 is 1077 if we change the base year to 1982 the IIP in 1992 becomes 100110778192 5 1315 because the IIP in 1982 was 819 Another important example of an index number is a price index such as the CPI We already used the CPI to compute annual inflation rates in Example 101 As with the industrial production index the CPI is only meaningful when we compare it across different years or months if we are using monthly data In the 1997 ERP CPI 5 388 in 1970 and CPI 5 1307 in 1990 Thus the gen eral price level grew by almost 237 over this 20year period In 1997 the CPI is defined so that its average in 1982 1983 and 1984 equals 100 thus the base period is listed as 19821984 In addition to being used to compute inflation rates price indexes are necessary for turning a time series measured in nominal dollars or current dollars into real dollars or constant dollars Most economic behavior is assumed to be influenced by real not nominal variables For example classical labor economics assumes that labor supply is based on the real hourly wage not the nominal wage Obtaining the real wage from the nominal wage is easy if we have a price index such as the CPI We must be a little careful to first divide the CPI by 100 so that the value in the base year is 1 Then if w denotes the average hourly wage in nominal dollars and p 5 CPI100 the real wage is simply wp This wage is measured in dollars for the base period of the CPI For example in Table B45 in the 1997 ERP average hourly earnings are reported in nominal terms and in 1982 dollars which means that the CPI used in computing the real wage had the base year 1982 This table reports that the nominal hourly wage in 1960 was 209 but measured in 1982 dollars the wage was 679 The real hourly wage had peaked in 1973 at 855 in 1982 dollars and had fallen to 740 by 1995 Thus Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 327 there was a nontrivial decline in real wages over those 22 years If we compare nominal wages from 1973 and 1995 we get a very misleading picture 394 in 1973 and 1144 in 1995 Because the real wage fell the increase in the nominal wage was due entirely to inflation Standard measures of economic output are in real terms The most important of these is gross domestic product or GDP When growth in GDP is reported in the popular press it is always real GDP growth In the 2012 ERP Table B2 GDP is reported in billions of 2005 dollars We used a similar measure of output real gross national product in Example 103 Interesting things happen when real dollar variables are used in combination with natural loga rithms Suppose for example that average weekly hours worked are related to the real wage as log1hours2 5 b0 1 b1log1wp2 1 u Using the fact that log1wp2 5 log1w2 2 log1p2 we can write this as log1hours2 5 b0 1 b1log1w2 1 b2log1p2 1 u 1021 but with the restriction that b2 5 2b1 Therefore the assumption that only the real wage influences labor supply imposes a restriction on the parameters of model 1021 If b2 2 2b1 then the price level has an effect on labor supply something that can happen if workers do not fully understand the distinction between real and nominal wages There are many practical aspects to the actual computation of index numbers but it would take us too far afield to cover those here Detailed discussions of price indexes can be found in most interme diate macroeconomic texts such as Mankiw 1994 Chapter 2 For us it is important to be able to use index numbers in regression analysis As mentioned earlier since the magnitudes of index numbers are not especially informative they often appear in logarithmic form so that regression coefficients have percentage change interpretations We now give an example of an event study that also uses index numbers exaMPLe 105 antidumping filings and Chemical imports Krupp and Pollard 1996 analyzed the effects of antidumping filings by US chemical industries on imports of various chemicals We focus here on one industrial chemical barium chloride a cleaning agent used in various chemical processes and in gasoline production The data are contained in the file BARIUM In the early 1980s US barium chloride producers believed that China was offering its US imports an unfairly low price an action known as dumping and the barium chloride indus try filed a complaint with the US International Trade Commission ITC in October 1983 The ITC ruled in favor of the US barium chloride industry in October 1984 There are several questions of interest in this case but we will touch on only a few of them First were imports unusually high in the period immediately preceding the initial filing Second did imports change noticeably after an antidumping filing Finally what was the reduction in imports after a decision in favor of the US industry To answer these questions we follow Krupp and Pollard by defining three dummy variables befile6 is equal to 1 during the six months before filing affile6 indicates the six months after fil ing and afdec6 denotes the six months after the positive decision The dependent variable is the volume of imports of barium chloride from China chnimp which we use in logarithmic form We include as explanatory variables all in logarithmic form an index of chemical production chempi to control for overall demand for barium chloride the volume of gasoline production gas another demand variable and an exchange rate index rtwex which measures the strength of the dollar against several other currencies The chemical production index was defined to be 100 in June 1977 The analysis here differs somewhat from Krupp and Pollard in that we use natural logarithms of all variables except the dummy variables of course and we include all three dummy variables in the same regression Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 328 Using monthly data from February 1978 through December 1988 gives the following log1chnimp2 5 21780 1 312 log1chempi2 1 196 log1gas2 121052 1482 19072 1 983 log1rtwex2 1 060 befile6 2 032 affile6 2 565 afdec6 1022 14002 12612 12642 12862 n 5 131 R2 5 305 R2 5 271 The equation shows that befile6 is statistically insignificant so there is no evidence that Chinese imports were unusually high during the six months before the suit was filed Further although the esti mate on affile6 is negative the coefficient is small indicating about a 32 fall in Chinese imports and it is statistically very insignificant The coefficient on afdec6 shows a substantial fall in Chinese imports of barium chloride after the decision in favor of the US industry which is not surprising Since the effect is so large we compute the exact percentage change 1003exp125652 2 14 2432 The coefficient is statistically significant at the 5 level against a twosided alternative The coefficient signs on the control variables are what we expect an increase in overall chemical production increases the demand for the cleaning agent Gasoline production does not affect Chinese imports significantly The coefficient on logrtwex shows that an increase in the value of the dollar relative to other currencies increases the demand for Chinese imports as is predicted by economic theory In fact the elasticity is not statistically different from 1 Why Interactions among qualitative and quantitative variables are also used in time series analysis An example with practical importance follows exaMPLe 106 election outcomes and economic Performance Fair 1996 summarizes his work on explaining presidential election outcomes in terms of economic performance He explains the proportion of the twoparty vote going to the Democratic candidate using data for the years 1916 through 1992 every four years for a total of 20 observations We esti mate a simplified version of Fairs model using variable names that are more descriptive than his demvote 5 b0 1 b1partyWH 1 b2incum 1 b3partyWH gnews 1 b4partyWH inf 1 u where demvote is the proportion of the twoparty vote going to the Democratic candidate The explan atory variable partyWH is similar to a dummy variable but it takes on the value 1 if a Democrat is in the White House and 1 if a Republican is in the White House Fair uses this variable to impose the restriction that the effects of a Republican or a Democrat being in the White House have the same magnitude but the opposite sign This is a natural restriction because the party shares must sum to one by definition It also saves two degrees of freedom which is important with so few observa tions Similarly the variable incum is defined to be 1 if a Democratic incumbent is running 1 if a Republican incumbent is running and zero otherwise The variable gnews is the number of quarters during the administrations first 15 quarters when the quarterly growth in real per capita output was above 29 at an annual rate and inf is the average annual inflation rate over the first 15 quarters of the administration See Fair 1996 for precise definitions Economists are most interested in the interaction terms partyWHgnews and partyWHinf Since partyWH equals 1 when a Democrat is in the White House b3 measures the effect of good economic news on the party in power we expect b3 0 Similarly b4 measures the effect that inflation has on the party in power Because inflation during an administration is considered to be bad news we expect b4 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 329 The estimated equation using the data in FAIR is demvote 5 481 2 0435 partyWH 1 0544 incum 10122 104052 102342 1 0108 partyWH gnews 2 0077 partyWH inf 1023 100412 100332 n 5 20 R2 5 663 R2 5 573 All coefficients except that on partyWH are statistically significant at the 5 level Incumbency is worth about 54 percentage points in the share of the vote Remember demvote is measured as a proportion Further the economic news variable has a positive effect one more quarter of good news is worth about 11 percentage points Inflation as expected has a negative effect if average annual inflation is say two percentage points higher the party in power loses about 15 percentage points of the twoparty vote We could have used this equation to predict the outcome of the 1996 presidential election between Bill Clinton the Democrat and Bob Dole the Republican The independent candidate Ross Perot is excluded because Fairs equation is for the twoparty vote only Because Clinton ran as an incum bent partyWH 1 and incum 1 To predict the election outcome we need the variables gnews and inf During Clintons first 15 quarters in office the annual growth rate of per capita real GDP exceeded 29 three times so gnews 3 Further using the GDP price deflator reported in Table B4 in the 1997 ERP the average annual inflation rate computed using Fairs formula from the fourth quarter in 1991 to the third quarter in 1996 was 3019 Plugging these into 1023 gives demvote 5 481 2 0435 1 0544 1 0108132 2 0077130192 5011 Therefore based on information known before the election in November Clinton was predicted to receive a very slight majority of the twoparty vote about 501 In fact Clinton won more handily his share of the twoparty vote was 5465 105 Trends and Seasonality 105a Characterizing Trending Time Series Many economic time series have a common tendency of growing over time We must recognize that some series contain a time trend in order to draw causal inference using time series data Ignoring the fact that two sequences are trending in the same or opposite directions can lead us to falsely conclude that changes in one variable are actually caused by changes in another variable In many cases two time series processes appear to be correlated only because they are both trending over time for rea sons related to other unobserved factors Figure 102 contains a plot of labor productivity output per hour of work in the United States for the years 1947 through 1987 This series displays a clear upward trend which reflects the fact that workers have become more productive over time Other series at least over certain time periods have clear downward trends Because positive trends are more common we will focus on those during our discussion What kind of statistical models adequately capture trending behavior One popular formulation is to write the series 5yt6 as yt 5 a0 1 a1t 1 et t 5 1 2 p 1024 where in the simplest case 5et6 is an independent identically distributed iid sequence with E1et2 5 0 and Var1et2 5 s2 e Note how the parameter a1 multiplies time t resulting in a linear time trend Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 330 Interpreting a1 in 1024 is simple holding all other factors those in e1 fixed a1 measures the change in yt from one period to the next due to the passage of time We can write this mathematically by defining the change in et from period t21 to t as Det 5 et 2 et21 Equation 1024 implies that if Det 5 0 then Dyt 5 yt 2 yt21 5 a1 Another way to think about a sequence that has a linear time trend is that its average value is a linear function of time E1yt2 5 a0 1 a1t 1025 If a1 0 then on average yt is growing over time and therefore has an upward trend If a1 0 then yt has a downward trend The values of yt do not fall exactly on the line in 1025 due to randomness but the expected values are on the line Unlike the mean the variance of yt is constant across time Var1yt2 5 Var1et2 5 s2 e If 5et6 is an iid sequence then 5yt6 is an inde pendent though not identically distributed sequence A more realistic characterization of trending time series allows 5et6 to be correlated over time but this does not change the flavor of a linear time trend In fact what is important for regression analysis under the classical linear model assumptions is that E5yt6 is linear in t When we cover large sample properties of OLS in Chapter 11 we will have to discuss how much temporal correlation in 5et6 is allowed Many economic time series are better approximated by an exponential trend which follows when a series has the same average growth rate from period to period Figure 103 plots data on annual nominal imports for the United States during the years 1948 through 1995 ERP 1997 Table B101 In the early years we see that the change in imports over each year is relatively small whereas the change increases as time passes This is consistent with a constant average growth rate the percentage change is roughly the same in each period output per hour 1967 1987 year 1947 50 80 110 FiguRE 102 Output per labor hour in the United States during the years 19471987 1977 100 In Example 104 we used the general fertil ity rate as the dependent variable in an FDL model From 1950 through the mid1980s the gfr has a clear downward trend Can a linear trend with a1 0 be realistic for all future time periods Explain Exploring FurthEr 104 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 331 In practice an exponential trend in a time series is captured by modeling the natural logarithm of the series as a linear trend assuming that yt 0 log1yt2 5 b0 1 b1t 1 et t 5 1 2 p 1026 Exponentiating shows that yt itself has an exponential trend yt 5 exp1b0 1 b1t 1 et2 Because we will want to use exponentially trending time series in linear regression models 1026 turns out to be the most convenient way for representing such series How do we interpret b1 in 1026 Remember that for small changes Dlog1yt2 5 log1yt2 2 log1yt212 is approximately the proportionate change in yt Dlog1yt2 1yt 2 yt212yt21 1027 The righthand side of 1027 is also called the growth rate in y from period t21 to period t To turn the growth rate into a percentage we simply multiply by 100 If yt follows 1026 then taking changes and setting Det 5 0 Dlog1yt2 5 b1 for all t 1028 In other words b1 is approximately the average per period growth rate in yt For example if t denotes year and b1 5 027 then yt grows about 27 per year on average Although linear and exponential trends are the most common time trends can be more compli cated For example instead of the linear trend model in 1024 we might have a quadratic time trend yt 5 a0 1 a1t 1 a2t2 1 et 1029 If a1 and a2 are positive then the slope of the trend is increasing as is easily seen by computing the approximate slope holding et fixed Dyt Dt a1 1 2a2t 1030 US imports 1972 1995 year 1948 100 400 750 7 FiguRE 103 Nominal US imports during the years 19481995 in billions of US dollars Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 332 If you are familiar with calculus you recognize the righthand side of 1030 as the derivative of a0 1 a1t 1 a2t2 with respect to t If a1 0 but a2 0 the trend has a hump shape This may not be a very good description of certain trending series because it requires an increasing trend to be followed eventually by a decreasing trend Nevertheless over a given time span it can be a flexible way of modeling time series that have more complicated trends than either 1024 or 1026 105b Using Trending Variables in Regression Analysis Accounting for explained or explanatory variables that are trending is fairly straightforward in regres sion analysis First nothing about trending variables necessarily violates the classical linear model Assumptions TS1 through TS6 However we must be careful to allow for the fact that unobserved trending factors that affect yt might also be correlated with the explanatory variables If we ignore this possibility we may find a spurious relationship between yt and one or more explanatory variables The phenomenon of finding a relationship between two or more trending variables simply because each is growing over time is an example of a spurious regression problem Fortunately adding a time trend eliminates this problem For concreteness consider a model where two observed factors xt1 and xt2 affect yt In addition there are unobserved factors that are systematically growing or shrinking over time A model that captures this is yt 5 b0 1 b1xt1 1 b2xt2 1 b3t 1 ut 1031 This fits into the multiple linear regression framework with xt3 5 t Allowing for the trend in this equation explicitly recognizes that yt may be growing 1b3 02 or shrinking 1b3 02 over time for reasons essentially unrelated to xt1 and xt2 If 1031 satisfies assumptions TS1 TS2 and TS3 then omitting t from the regression and regressing yt on xt1 xt2 will generally yield biased estimators of b1 and b2 we have effectively omitted an important variable t from the regression This is especially true if xt1 and xt2 are themselves trending because they can then be highly correlated with t The next example shows how omitting a time trend can result in spurious regression exaMPLe 107 Housing investment and Prices The data in HSEINV are annual observations on housing investment and a housing price index in the United States for 1947 through 1988 Let invpc denote real per capita housing investment in thou sands of dollars and let price denote a housing price index equal to 1 in 1982 A simple regression in constant elasticity form which can be thought of as a supply equation for housing stock gives log1invpc2 5 2550 1 1241 log1price2 10432 13822 1032 n 5 42 R2 5 208 R2 5 189 The elasticity of per capita investment with respect to price is very large and statistically significant it is not statistically different from one We must be careful here Both invpc and price have upward trends In particular if we regress loginvpc on t we obtain a coefficient on the trend equal to 0081 1standard error 5 00182 the regression of logprice on t yields a trend coefficient equal to 0044 1standard error 5 00042 Although the standard errors on the trend coefficients are not neces sarily reliablethese regressions tend to contain substantial serial correlationthe coefficient esti mates do reveal upward trends Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 333 To account for the trending behavior of the variables we add a time trend log1invpc2 5 2913 2 381 log1price2 1 0098 t 11362 16792 100352 1033 n 5 42 R2 5 341 R2 5 307 The story is much different now the estimated price elasticity is negative and not statistically dif ferent from zero The time trend is statistically significant and its coefficient implies an approxi mate 1 increase in invpc per year on average From this analysis we cannot conclude that real per capita housing investment is influenced at all by price There are other factors captured in the time trend that affect invpc but we have not modeled these The results in 1032 show a spurious relationship between invpc and price due to the fact that price is also trending upward over time In some cases adding a time trend can make a key explanatory variable more significant This can happen if the dependent and independent variables have different kinds of trends say one upward and one downward but movement in the independent variable about its trend line causes movement in the dependent variable away from its trend line exaMPLe 108 fertility equation If we add a linear time trend to the fertility equation 1018 we obtain gfrt 5 11177 1 279 pet 2 3559 ww2t 1 997 pillt 2 115 t 13362 10402 16302 166262 1192 1034 n 5 72 R2 5 662 R2 5 642 The coefficient on pe is more than triple the estimate from 1018 and it is much more statistically significant Interestingly pill is not significant once an allowance is made for a linear trend As can be seen by the estimate gfr was falling on average over this period other factors being equal Since the general fertility rate exhibited both upward and downward trends during the period from 1913 through 1984 we can see how robust the estimated effect of pe is when we use a quadratic trend gfrt 5 12409 1 348 pet 2 3588 ww2t 2 1012 pillt 2 253 t 1 0196 t2 14362 10402 15712 16342 1392 100502 1035 n 5 72 R2 5 727 R2 5 706 The coefficient on pe is even larger and more statistically significant Now pill has the expected nega tive effect and is marginally significant and both trend terms are statistically significant The quad ratic trend is a flexible way to account for the unusual trending behavior of gfr You might be wondering in Example 108 why stop at a quadratic trend Nothing prevents us from adding say t3 as an independent variable and in fact this might be warranted see Computer Exercise C6 But we have to be careful not to get carried away when including trend terms in a model We want relatively simple trends that capture broad movements in the dependent variable that are not explained by the independent variables in the model If we include enough polynomial terms in t then we can track any series pretty well But this offers little help in finding which explanatory variables affect yt Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 334 105c A Detrending Interpretation of Regressions with a Time Trend Including a time trend in a regression model creates a nice interpretation in terms of detrending the original data series before using them in regression analysis For concreteness we focus on model 1031 but our conclusions are much more general When we regress yt on xt1 xt2 and t we obtain the fitted equation yt 5 b 0 1 b 1xt1 1 b 2xt2 1 b 3t 1036 We can extend the FrischWaugh result on the partialling out interpretation of OLS that we covered in Section 32 to show that b 1 and b 2 can be obtained as follows i Regress each of yt xt1 and xt2 on a constant and the time trend t and save the residuals say y t x t1 x t2 t 5 1 2 p n For example y t 5 yt 2 a 0 2 a 1t Thus we can think of y t as being linearly detrended In detrending yt we have estimated the model yt 5 a0 1 a1t 1 et by OLS the residuals from this regression et 5 y t have the time trend removed at least in the sam ple A similar interpretation holds for x t1 and x t2 ii Run the regression of y t on x t1 x t2 1037 No intercept is necessary but including an intercept affects nothing the intercept will be estimated to be zero This regression exactly yields b 1 and b 2 from 1036 This means that the estimates of primary interest b 1 and b 2 can be interpreted as coming from a regression without a time trend but where we first detrend the dependent variable and all other inde pendent variables The same conclusion holds with any number of independent variables and if the trend is quadratic or of some other polynomial degree If t is omitted from 1036 then no detrending occurs and yt might seem to be related to one or more of the xtj simply because each contains a trend we saw this in Example 107 If the trend term is statistically significant and the results change in important ways when a time trend is added to a regression then the initial results without a trend should be treated with suspicion The interpretation of b 1 and b 2 shows that it is a good idea to include a trend in the regression if any independent variable is trending even if yt is not If yt has no noticeable trend but say xt1 is growing over time then excluding a trend from the regression may make it look as if xt1 has no effect on yt even though movements of xt1 about its trend may affect yt This will be captured if t is included in the regression exaMPLe 109 Puerto rican employment When we add a linear trend to equation 1017 the estimates are log1prepopt2 5 2870 2 169 log1mincovt2 1 106 log1usgnpt2 11302 10442 10182 1038 2 032 t 10052 n 5 38 R2 5 847 R2 5 834 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 335 The coefficient on logusgnp has changed dramatically from 2012 and insignificant to 106 and very significant The coefficient on the minimum wage has changed only slightly although the stand ard error is notably smaller making logmincov more significant than before The variable prepopt displays no clear upward or downward trend but logusgnp has an upward linear trend A regression of logusgnp on t gives an estimate of about 03 so that usgnp is grow ing by about 3 per year over the period We can think of the estimate 106 as follows when usgnp increases by 1 above its longrun trend prepop increases by about 106 105d Computing RSquared When the Dependent Variable Is Trending Rsquareds in time series regressions are often very high especially compared with typical Rsquareds for crosssectional data Does this mean that we learn more about factors affecting y from time series data Not necessarily On one hand time series data often come in aggregate form such as average hourly wages in the US economy and aggregates are often easier to explain than outcomes on indi viduals families or firms which is often the nature of crosssectional data But the usual and adjusted Rsquareds for time series regressions can be artificially high when the dependent variable is trend ing Remember that R2 is a measure of how large the error variance is relative to the variance of y The formula for the adjusted Rsquared shows this directly R2 5 1 2 1s 2 us 2 y2 where s 2 u is the unbiased estimator of the error variance s 2 y 5 SST1n 2 12 and SST 5 a n t511yt 2 y2 2 Now estimating the error variance when yt is trending is no problem provided a time trend is included in the regression However when E1yt2 follows say a linear time trend see 1024 SST1n 2 12 is no longer an unbiased or consistent estimator of Var1yt2 In fact SST1n 2 12 can substantially overestimate the variance in yt because it does not account for the trend in yt When the dependent variable satisfies linear quadratic or any other polynomial trends it is easy to compute a goodnessoffit measure that first nets out the effect of any time trend on yt The simplest method is to compute the usual Rsquared in a regression where the dependent variable has already been detrended For example if the model is 1031 then we first regress yt on t and obtain the residuals y t Then we regress y t on xt1 xt2 and t 1039 The Rsquared from this regression is 1 2 SSR a n t51y2 t 1040 where SSR is identical to the sum of squared residuals from 1036 Since a n t51y2 t a n t511yt 2 y2 2 and usually the inequality is strict the Rsquared from 1040 is no greater than and usually less than the Rsquared from 1036 The sum of squared residuals is identical in both regressions When yt contains a strong linear time trend 1040 can be much less than the usual Rsquared The Rsquared in 1040 better reflects how well xt1 and xt2 explain yt because it nets out the effect of the time trend After all we can always explain a trending variable with some sort of trend but this does not mean we have uncovered any factors that cause movements in yt An adjusted Rsquared can also be computed based on 1040 divide SSR by 1n 2 42 because this is the df in 1036 and divide a n t51y2 t by 1n 2 22 as there are two trend parameters estimated in detrending yt Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 336 PART 2 Regression Analysis with Time Series Data In general SSR is divided by the df in the usual regression that includes any time trends and a n t51 y2 t is divided by 1n 2 p2 where p is the number of trend parameters estimated in detrending yt Wooldridge 1991a provides detailed suggestions for degreesoffreedom corrections but a compu tationally simple approach is fine as an approximation use the adjusted Rsquared from the regres sion y t on t t2 p tp xt1 p xtk This requires us only to remove the trend from yt to obtain y t and then we can use y t to compute the usual kinds of goodnessoffit measures exaMPLe 1010 Housing investment In Example 107 we saw that including a linear time trend along with logprice in the housing investment equation had a substantial effect on the price elasticity But the Rsquared from regres sion 1033 taken literally says that we are explaining 341 of the variation in loginvpc This is misleading If we first detrend loginvpc and regress the detrended variable on log price and t the Rsquared becomes 008 and the adjusted Rsquared is actually negative Thus movements in logprice about its trend have virtually no explanatory power for movements in loginvpc about its trend This is consistent with the fact that the t statistic on logprice in equation 1033 is very small Before leaving this subsection we must make a final point In computing the Rsquared form of an F statistic for testing multiple hypotheses we just use the usual Rsquareds without any detrend ing Remember the Rsquared form of the F statistic is just a computational device and so the usual formula is always appropriate 105e Seasonality If a time series is observed at monthly or quarterly intervals or even weekly or daily it may exhibit seasonality For example monthly housing starts in the Midwest are strongly influenced by weather Although weather patterns are somewhat random we can be sure that the weather during January will usually be more inclement than in June and so housing starts are generally higher in June than in January One way to model this phenomenon is to allow the expected value of the series yt to be different in each month As another example retail sales in the fourth quarter are typically higher than in the previous three quarters because of the Christmas holiday Again this can be captured by allowing the average retail sales to differ over the course of a year This is in addition to possibly allowing for a trending mean For example retail sales in the most recent first quarter were higher than retail sales in the fourth quarter from 30 years ago because retail sales have been steadily growing Nevertheless if we compare average sales within a typical year the seasonal holiday factor tends to make sales larger in the fourth quarter Even though many monthly and quarterly data series display seasonal patterns not all of them do For example there is no noticeable seasonal pattern in monthly interest or inflation rates In addi tion series that do display seasonal patterns are often seasonally adjusted before they are reported for public use A seasonally adjusted series is one that in principle has had the seasonal factors removed from it Seasonal adjustment can be done in a variety of ways and a careful discussion is beyond the scope of this text See Harvey 1990 and Hylleberg 1992 for detailed treatments Seasonal adjustment has become so common that it is not possible to get seasonally unadjusted data in many cases Quarterly US GDP is a leading example In the annual Economic Report of the President many macroeconomic data sets reported at monthly frequencies at least for the most recent years and those that display seasonal patterns are all seasonally adjusted The major sources for macroeconomic time series including Citibase also seasonally adjust many of the series Thus the scope for using our own seasonal adjustment is often limited Sometimes we do work with seasonally unadjusted data and it is useful to know that simple methods are available for dealing with seasonality in regression models Generally we can include a Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 337 set of seasonal dummy variables to account for seasonality in the dependent variable the independ ent variables or both The approach is simple Suppose that we have monthly data and we think that seasonal patterns within a year are roughly constant across time For example since Christmas always comes at the same time of year we can expect retail sales to be on average higher in months late in the year than in earlier months Or since weather patterns are broadly similar across years housing starts in the Midwest will be higher on average during the summer months than the winter months A general model for monthly data that captures these phenomena is yt 5 b0 1 d1febt 1 d2mart 1 d3aprt 1 p 1 d11dect 1041 1 b1xt1 1 p 1 bkxtk 1 ut where febt mart p dect are dummy variables indi cating whether time period t corresponds to the appropriate month In this formulation January is the base month and b0 is the intercept for January If there is no seasonality in yt once the xtj have been controlled for then d1 through d11 are all zero This is easily tested via an F test exaMPLe 1011 effects of antidumping filings In Example 105 we used monthly data in the file BARIUM that have not been seasonally adjusted Therefore we should add seasonal dummy variables to make sure none of the important conclusions change It could be that the months just before the suit was filed are months where imports are higher or lower on average than in other months When we add the 11 monthly dummy variables as in 1041 and test their joint significance we obtain pvalue 5 59 and so the seasonal dummies are jointly insignificant In addition nothing important changes in the estimates once statistical signifi cance is taken into account Krupp and Pollard 1996 actually used three dummy variables for the seasons fall spring and summer with winter as the base season rather than a full set of monthly dummies the outcome is essentially the same If the data are quarterly then we would include dummy variables for three of the four quarters with the omitted category being the base quarter Sometimes it is useful to interact seasonal dummies with some of the xtj to allow the effect of xtj on yt to differ across the year Just as including a time trend in a regression has the interpretation of initially detrending the data including seasonal dummies in a regression can be interpreted as deseasonalizing the data For concreteness consider equation 1041 with k 5 2 The OLS slope coefficients b 1 and b 2 on x1 and x2 can be obtained as follows i Regress each of yt xt1 and xt2 on a constant and the monthly dummies febt mart p dect and save the residuals say y t x t1 and x t2 for all t 5 1 2 p n For example y t 5 yt 2 a 0 2 a 1 febt 2 a 2mart 2 p 2 a 11dect This is one method of deseasonalizing a monthly time series A similar interpretation holds for x t1 and x t2 ii Run the regression without the monthly dummies of y t on x t1 and x t2 just as in 1037 This gives b 1 and b 2 In some cases if yt has pronounced seasonality a better goodnessoffit measure is an Rsquared based on the deseasonalized yt This nets out any seasonal effects that are not explained by the xtj In equation 1041 what is the intercept for March Explain why seasonal dummy variables satisfy the strict exogeneity assumption Exploring FurthEr 105 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 338 Wooldridge 1991a suggests specific degreesoffreedom adjustments or one may simply use the adjusted Rsquared where the dependent variable has been deseasonalized Time series exhibiting seasonal patterns can be trending as well in which case we should esti mate a regression model with a time trend and seasonal dummy variables The regressions can then be interpreted as regressions using both detrended and deseasonalized series Goodnessoffit statistics are discussed in Wooldridge 1991a essentially we detrend and deseasonalize yt by regressing on both a time trend and seasonal dummies before computing Rsquared or adjusted Rsquared Summary In this chapter we have covered basic regression analysis with time series data Under assumptions that parallel those for crosssectional analysis OLS is unbiased under TS1 through TS3 OLS is BLUE under TS1 through TS5 and the usual OLS standard errors t statistics and F statistics can be used for statistical inference under TS1 through TS6 Because of the temporal correlation in most time series data we must explicitly make assumptions about how the errors are related to the explanatory variables in all time periods and about the temporal correlation in the errors themselves The classical linear model assumptions can be pretty restrictive for time series applications but they are a natural starting point We have applied them to both static regression and finite distributed lag models Logarithms and dummy variables are used regularly in time series applications and in event studies We also discussed index numbers and time series measured in terms of nominal and real dollars Trends and seasonality can be easily handled in a multiple regression framework by including time and seasonal dummy variables in our regression equations We presented problems with the usual Rsquared as a goodnessoffit measure and suggested some simple alternatives based on detrending or deseasonalizing ClassiCal linear Model assuMptions for tiMe series regression Following is a summary of the six classical linear model CLM assumptions for time series regression applications Assumptions TS1 through TS5 are the time series versions of the GaussMarkov assump tions which implies that OLS is BLUE and has the usual sampling variances We only needed TS1 TS2 and TS3 to establish unbiasedness of OLS As in the case of crosssectional regression the normality assumption TS6 was used so that we could perform exact statistical inference for any sample size assumption ts1 linear in parameters The stochastic process 5 1xt1 xt2 p xtk yt2 t 5 1 2 p n6 follows the linear model yt 5 b0 1 b1xt1 1 b2xt2 1 p 1 bkxtk 1 ut where 5ut t 5 1 2 p n6 is the sequence of errors or disturbances Here n is the number of observations time periods assumption ts2 no perfect Collinearity In the sample and therefore in the underlying time series process no independent variable is constant nor a perfect linear combination of the others assumption ts3 Zero Conditional Mean For each t the expected value of the error ut given the explanatory variables for all time periods is zero Mathematically E1ut0X2 5 0 t 5 1 2 p n Assumption TS3 replaces MLR4 for crosssectional regression and it also means we do not have to make the random sampling assumption MLR2 Remember Assumption TS3 implies that the error in each time period t is uncorrelated with all explanatory variables in all time periods including of course time period t PART 2 Regression Analysis with Time Series Data Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 339 Assumption TS4 Homoskedasticity Conditional on X the variance of ut is the same for all t Var1ut0X2 5 Var1ut2 5 s2 t 5 1 2 p n Assumption TS5 No Serial Correlation Conditional on X the errors in two different time periods are uncorrelated Corr1ut us0X2 5 0 for all t 2 s Recall that we added the no serial correlation assumption along with the homoskedasticity assump tion to obtain the same variance formulas that we derived for crosssectional regression under random sampling As we will see in Chapter 12 Assumption TS5 is often violated in ways that can make the usual statistical inference very unreliable Assumption TS6 Normality The errors ut are independent of X and are independently and identically distributed as Normal10s22 Key Terms Autocorrelation Base Period Base Value Contemporaneously Exogenous Cumulative Effect Deseasonalizing Detrending Event Study Exponential Trend Finite Distributed Lag FDL Model Growth Rate Impact Multiplier Impact Propensity Index Number Lag Distribution Linear Time Trend LongRun Elasticity LongRun Multiplier LongRun Propensity LRP Seasonal Dummy Variables Seasonality Seasonally Adjusted Serial Correlation ShortRun Elasticity Spurious Regression Problem Static Model Stochastic Process Strictly Exogenous Time Series Process Time Trend Problems 1 Decide if you agree or disagree with each of the following statements and give a brief explanation of your decision i Like crosssectional observations we can assume that most time series observations are independently distributed ii The OLS estimator in a time series regression is unbiased under the first three GaussMarkov assumptions iii A trending variable cannot be used as the dependent variable in multiple regression analysis iv Seasonality is not an issue when using annual time series observations 2 Let gGDPt denote the annual percentage change in gross domestic product and let intt denote a short term interest rate Suppose that gGDPt is related to interest rates by gGDPt 5 a0 1 d0intt 1 d1intt21 1 ut where ut is uncorrelated with intt intt21 and all other past values of interest rates Suppose that the Federal Reserve follows the policy rule intt 5 g0 1 g11gGDPt21 2 32 1 vt where g1 0 When last years GDP growth is above 3 the Fed increases interest rates to prevent an overheated economy If vt is uncorrelated with all past values of intt and ut argue that intt must be correlated with ut21 Hint Lag the first equation for one time period and substitute for gGDPt21 in the second equation Which GaussMarkov assumption does this violate Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 340 PART 2 Regression Analysis with Time Series Data 3 Suppose yt follows a second order FDL model yt 5 a0 1 d0zt 1 d1zt21 1 d2zt22 1 ut Let zp denote the equilibrium value of zt and let yp be the equilibrium value of yt such that yp 5 a0 1 d0zp 1 d1zp 1 d2zp Show that the change in yp due to a change in zp equals the longrun propensity times the change in zp Dyp 5 LRP Dzp This gives an alternative way of interpreting the LRP 4 When the three event indicators befile6 affile6 and afdec6 are dropped from equation 1022 we obtain R2 5 281 and R2 5 264 Are the event indicators jointly significant at the 10 level 5 Suppose you have quarterly data on new housing starts interest rates and real per capita income Specify a model for housing starts that accounts for possible trends and seasonality in the variables 6 In Example 104 we saw that our estimates of the individual lag coefficients in a distributed lag model were very imprecise One way to alleviate the multicollinearity problem is to assume that the dj follow a relatively simple pattern For concreteness consider a model with four lags yt 5 a0 1 d0zt 1 d1zt21 1 d2zt22 1 d3zt23 1 d4zt24 1 ut Now let us assume that the dj follow a quadratic in the lag j dj 5 g0 1 g1j 1 g2 j2 for parameters g0 g1 and g2 This is an example of a polynomial distributed lag PDL model i Plug the formula for each dj into the distributed lag model and write the model in terms of the parameters gh for h 5 0 1 2 ii Explain the regression you would run to estimate the gh iii The polynomial distributed lag model is a restricted version of the general model How many restrictions are imposed How would you test these Hint Think F test 7 In Example 104 we wrote the model that explicitly contains the longrun propensity u0 as gfrt 5 a0 1 u0pet 1 d11pet21 2 pet2 1 d21pet22 2 pet2 1 u where we omit the other explanatory variables for simplicity As always with multiple regression anal ysis u0 should have a ceteris paribus interpretation Namely if pet increases by one dollar holding 1pet21 2 pet2 and 1pet22 2 pet2 fixed gfrt should change by u0 i If 1pet21 2 pet2 and 1pet22 2 pet2 are held fixed but pet is increasing what must be true about changes in pet21 and pet22 ii How does your answer in part i help you to interpret u0 in the above equation as the LRP 8 In the linear model given in equation 108 the explanatory variables xt 5 1xt1 p xtk2 are said to be sequentially exogenous sometimes called weakly exogenous if E1ut0xt xt21 p x12 5 0 t 5 1 2 p so that the errors are unpredictable given current and all past values of the explanatory variables i Explain why sequential exogeneity is implied by strict exogeneity ii Explain why contemporaneous exogeneity is implied by sequential exogeneity Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 341 iii Are the OLS estimators generally unbiased under the sequential exogeneity assumption Explain iv Consider a model to explain the annual rate of HIV infections HIVrate as a distributed lag of per capita condom usage pccon for a state region or province E1HIVratet0pccont pccontt21 p 2 5 a0 1 d0pccont 1 d1pccont21 1 d2pccont22 1 d3pccont23 Explain why this model satisfies the sequential exogeneity assumption Does it seem likely that strict exogeneity holds too Computer Exercises C1 In October 1979 the Federal Reserve changed its policy of using finely tuned interest rate adjustments and instead began targeting the money supply Using the data in INTDEF define a dummy variable equal to 1 for years after 1979 Include this dummy in equation 1015 to see if there is a shift in the interest rate equation after 1979 What do you conclude C2 Use the data in BARIUM for this exercise i Add a linear time trend to equation 1022 Are any variables other than the trend statistically significant ii In the equation estimated in part i test for joint significance of all variables except the time trend What do you conclude iii Add monthly dummy variables to this equation and test for seasonality Does including the monthly dummies change any other estimates or their standard errors in important ways C3 Add the variable log prgnp to the minimum wage equation in 1038 Is this variable significant Interpret the coefficient How does adding log prgnp affect the estimated minimum wage effect C4 Use the data in FERTIL3 to verify that the standard error for the LRP in equation 1019 is about 030 C5 Use the data in EZANDERS for this exercise The data are on monthly unemployment claims in Anderson Township in Indiana from January 1980 through November 1988 In 1984 an enterprise zone EZ was located in Anderson as well as other cities in Indiana See Papke 1994 for details i Regress loguclms on a linear time trend and 11 monthly dummy variables What was the overall trend in unemployment claims over this period Interpret the coefficient on the time trend Is there evidence of seasonality in unemployment claims ii Add ez a dummy variable equal to one in the months Anderson had an EZ to the regression in part i Does having the enterprise zone seem to decrease unemployment claims By how much You should use formula 710 from Chapter 7 iii What assumptions do you need to make to attribute the effect in part ii to the creation of an EZ C6 Use the data in FERTIL3 for this exercise i Regress gfrt on t and t2 and save the residuals This gives a detrended gfrt say gf t ii Regress gf t on all of the variables in equation 1035 including t and t2 Compare the Rsquared with that from 1035 What do you conclude iii Reestimate equation 1035 but add t3 to the equation Is this additional term statistically significant C7 Use the data set CONSUMP for this exercise i Estimate a simple regression model relating the growth in real per capita consumption of nondurables and services to the growth in real per capita disposable income Use the change in the logarithms in both cases Report the results in the usual form Interpret the equation and discuss statistical significance Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 342 PART 2 Regression Analysis with Time Series Data ii Add a lag of the growth in real per capita disposable income to the equation from part i What do you conclude about adjustment lags in consumption growth iii Add the real interest rate to the equation in part i Does it affect consumption growth C8 Use the data in FERTIL3 for this exercise i Add pet23 and pet24 to equation 1019 Test for joint significance of these lags ii Find the estimated longrun propensity and its standard error in the model from part i Compare these with those obtained from equation 1019 iii Estimate the polynomial distributed lag model from Problem 6 Find the estimated LRP and compare this with what is obtained from the unrestricted model C9 Use the data in VOLAT for this exercise The variable rsp500 is the monthly return on the Standard Poors 500 stock market index at an annual rate This includes price changes as well as dividends The variable i3 is the return on threemonth Tbills and pcip is the percentage change in industrial production these are also at an annual rate i Consider the equation rsp500t 5 b0 1 b1pcipt 1 b2i3t 1 ut What signs do you think b1 and b2 should have ii Estimate the previous equation by OLS reporting the results in standard form Interpret the signs and magnitudes of the coefficients iii Which of the variables is statistically significant iv Does your finding from part iii imply that the return on the SP 500 is predictable Explain C10 Consider the model estimated in 1015 use the data in INTDEF i Find the correlation between inf and def over this sample period and comment ii Add a single lag of inf and def to the equation and report the results in the usual form iii Compare the estimated LRP for the effect of inflation with that in equation 1015 Are they vastly different iv Are the two lags in the model jointly significant at the 5 level C11 The file TRAFFIC2 contains 108 monthly observations on automobile accidents traffic laws and some other variables for California from January 1981 through December 1989 Use this data set to answer the following questions i During what month and year did Californias seat belt law take effect When did the highway speed limit increase to 65 miles per hour ii Regress the variable logtotacc on a linear time trend and 11 monthly dummy variables using January as the base month Interpret the coefficient estimate on the time trend Would you say there is seasonality in total accidents iii Add to the regression from part ii the variables wkends unem spdlaw and beltlaw Discuss the coefficient on the unemployment variable Does its sign and magnitude make sense to you iv In the regression from part iii interpret the coefficients on spdlaw and beltlaw Are the estimated effects what you expected Explain v The variable prcfat is the percentage of accidents resulting in at least one fatality Note that this variable is a percentage not a proportion What is the average of prcfat over this period Does the magnitude seem about right vi Run the regression in part iii but use prcfat as the dependent variable in place of logtotacc Discuss the estimated effects and significance of the speed and seat belt law variables C12 i Estimate equation 102 using all the data in PHILLIPS and report the results in the usual form How many observations do you have now ii Compare the estimates from part i with those in equation 1014 In particular does adding the extra years help in obtaining an estimated tradeoff between inflation and unemployment Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 343 iii Now run the regression using only the years 1997 through 2003 How do these estimates differ from those in equation 1014 Are the estimates using the most recent seven years precise enough to draw any firm conclusions Explain iv Consider a simple regression setup in which we start with n time series observations and then split them into an early time period and a later time period In the first time period we have n1 observations and in the second period n2 observations Draw on the previous parts of this exercise to evaluate the following statement Generally we can expect the slope estimate using all n observations to be roughly equal to a weighted average of the slope estimates on the early and later subsamples where the weights are n1n and n2n respectively C13 Use the data in MINWAGE for this exercise In particular use the employment and wage series for sec tor 232 Mens and Boys Furnishings The variable gwage232 is the monthly growth change in logs in the average wage in sector 232 gemp232 is the growth in employment in sector 232 gmwage is the growth in the federal minimum wage and gcpi is the growth in the urban Consumer Price Index i Run the regression gwage232 on gmwage gcpi Do the sign and magnitude of b gmwage make sense to you Explain Is gmwage statistically significant ii Add lags 1 through 12 of gmwage to the equation in part i Do you think it is necessary to include these lags to estimate the longrun effect of minimum wage growth on wage growth in sector 232 Explain iii Run the regression gemp232 on gmwage gcpi Does minimum wage growth appear to have a contemporaneous effect on gemp232 iv Add lags 1 through 12 to the employment growth equation Does growth in the minimum wage have a statistically significant effect on employment growth either in the short run or long run Explain C14 Use the data in APPROVAL to answer the following questions The data set consists of 78 months of data during the presidency of George W Bush The data end in July 2007 before Bush left office In addition to economic variables and binary indicators of various events it includes an approval rate approve collected by Gallup Caution One should also attempt Computer Exercise C14 in Chapter 11 to gain a more complete understanding of the econometric issues involved in analyzing these data i What is the range of the variable approve What is its average value ii Estimate the model approvet 5 b0 1 b1lcpifoodt 1 b2lrgaspricet 1 b3unemployt 1 ut where the first two variables are in logarithmic form and report the estimates in the usual way iii Interpret the coefficients in the estimates from part ii Comment on the signs and sizes of the effects as well as statistical significance iv Add the binary variables sep11 and iraqinvade to the equation from part ii Interpret the coefficients on the dummy variables Are they statistically significant v Does adding the dummy variables in part iv change the other estimates much Are any of the coefficients in part iv hard to rationalize vi Add lsp500 to the regression in part iv Controlling for other factors does the stock market have an important effect on the presidential approval rating Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 344 I n Chapter 10 we discussed the finite sample properties of OLS for time series data under increasingly stronger sets of assumptions Under the full set of classical linear model assumptions for time series TS1 through TS6 OLS has exactly the same desirable properties that we derived for crosssectional data Likewise statistical inference is carried out in the same way as it was for crosssectional analysis From our crosssectional analysis in Chapter 5 we know that there are good reasons for studying the large sample properties of OLS For example if the error terms are not drawn from a normal dis tribution then we must rely on the central limit theorem CLT to justify the usual OLS test statistics and confidence intervals Large sample analysis is even more important in time series contexts This is somewhat ironic given that large time series samples can be difficult to come by but we often have no choice other than to rely on large sample approximations In Section 103 we explained how the strict exogene ity assumption TS3 might be violated in static and distributed lag models As we will show in Section 112 models with lagged dependent variables must violate Assumption TS3 Unfortunately large sample analysis for time series problems is fraught with many more difficul ties than it was for crosssectional analysis In Chapter 5 we obtained the large sample properties of OLS in the context of random sampling Things are more complicated when we allow the observa tions to be correlated across time Nevertheless the major limit theorems hold for certain although not all time series processes The key is whether the correlation between the variables at different time periods tends to zero quickly enough Time series that have substantial temporal correlation require special attention in regression analysis This chapter will alert you to certain issues pertaining to such series in regression analysis Further Issues in Using OLS with Time Series Data c h a p t e r 11 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 345 111 Stationary and Weakly Dependent Time Series In this section we present the key concepts that are needed to apply the usual large sample approxi mations in regression analysis with time series data The details are not as important as a general understanding of the issues 111a Stationary and Nonstationary Time Series Historically the notion of a stationary process has played an important role in the analysis of time series A stationary time series process is one whose probability distributions are stable over time in the following sense If we take any collection of random variables in the sequence and then shift that sequence ahead h time periods the joint probability distribution must remain unchanged A formal definition of stationarity follows Stationary Stochastic Process The stochastic process 5xt t 5 1 2 p6 is stationary if for every col lection of time indices 1 t1 t2 p tm the joint distribution of 1xt1 xt2 p xtm2 is the same as the joint distribution of 1xt11h xt21h p xtm1h2 for all integers h 1 This definition is a little abstract but its meaning is pretty straightforward One implication by choosing m 5 1 and t1 5 1 is that xt has the same distribution as x1 for all t 5 2 3 p In other words the sequence 5xt t 5 1 2 p6 is identically distributed Stationarity requires even more For example the joint distribution of 1x1 x22 the first two terms in the sequence must be the same as the joint distribution of 1xt xt112 for any t 1 Again this places no restrictions on how xt and xt11 are related to one another indeed they may be highly correlated Stationarity does require that the nature of any correlation between adjacent terms is the same across all time periods A stochastic process that is not stationary is said to be a nonstationary process Since station arity is an aspect of the underlying stochastic process and not of the available single realization it can be very difficult to determine whether the data we have collected were generated by a stationary process However it is easy to spot certain sequences that are not stationary A process with a time trend of the type covered in Section 105 is clearly nonstationary at a minimum its mean changes over time Sometimes a weaker form of stationarity suffices If 5xt t 5 1 2 p6 has a finite second moment that is E1xt 22 for all t then the following definition applies Covariance Stationary Process A stochastic process 5xt t 5 1 2 p6 with a finite second moment 3E1xt 22 4 is covariance stationary if i E1xt2 is constant ii Var1xt2 is constant and iii for any t h 1 Cov1xt xt1h2 depends only on h and not on t Covariance stationarity focuses only on the first two moments of a stochastic process the mean and variance of the process are constant across time and the covariance between xt and xt1h depends only on the distance between the two terms h and not on the location of the initial time period t It follows imme diately that the correlation between xt and xt1h also depends only on h If a stationary process has a finite second moment then it must be covariance stationary but the converse is certainly not true Sometimes to emphasize that stationarity is a stronger requirement than covariance stationarity the former is referred to as strict stationarity Because strict stationarity simplifies the statements of some of our subsequent assumptions stationarity for us will always mean the strict form Suppose that 5yt t 5 1 2 p6 is generated by yt 5 d0 1 d1t 1 et where d1 2 0 and 5et t 5 1 2 p6 is an iid sequence with mean zero and variance s2 e i Is 5yt6 covari ance stationary ii Is yt 2 E1yt2 covariance stationary Exploring FurthEr 111 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 346 How is stationarity used in time series econometrics On a technical level stationarity simplifies statements of the law of large numbers LLN and the CLT although we will not worry about formal statements in this chapter On a practical level if we want to understand the relationship between two or more variables using regression analysis we need to assume some sort of stability over time If we allow the relationship between two variables say yt and xt to change arbitrarily in each time period then we cannot hope to learn much about how a change in one variable affects the other variable if we only have access to a single time series realization In stating a multiple regression model for time series data we are assuming a certain form of stationarity in that the bj do not change over time Further Assumptions TS4 and TS5 imply that the variance of the error process is constant over time and that the correlation between errors in two adjacent periods is equal to zero which is clearly constant over time 111b Weakly Dependent Time Series Stationarity has to do with the joint distributions of a process as it moves through time A very differ ent concept is that of weak dependence which places restrictions on how strongly related the random variables xt and xt1h can be as the time distance between them h gets large The notion of weak dependence is most easily discussed for a stationary time series loosely speaking a stationary time series process 5xt t 5 1 2 p6 is said to be weakly dependent if xt and xt1h are almost independ ent as h increases without bound A similar statement holds true if the sequence is nonstationary but then we must assume that the concept of being almost independent does not depend on the starting point t The description of weak dependence given in the previous paragraph is necessarily vague We cannot formally define weak dependence because there is no definition that covers all cases of inter est There are many specific forms of weak dependence that are formally defined but these are well beyond the scope of this text See White 1984 Hamilton 1994 and Wooldridge 1994b for advanced treatments of these concepts For our purposes an intuitive notion of the meaning of weak dependence is sufficient Covariance stationary sequences can be characterized in terms of correlations a covariance stationary time series is weakly dependent if the correlation between xt and xt1h goes to zero sufficiently quickly as h S Because of covariance stationarity the correlation does not depend on the starting point t In other words as the variables get farther apart in time the correlation between them becomes smaller and smaller Covariance stationary sequences where Corr1xt xt1h2 S 0 as h S are said to be asymptotically uncorrelated Intuitively this is how we will usually characterize weak depend ence Technically we need to assume that the correlation converges to zero fast enough but we will gloss over this Why is weak dependence important for regression analysis Essentially it replaces the assump tion of random sampling in implying that the LLN and the CLT hold The most wellknown CLT for time series data requires stationarity and some form of weak dependence thus stationary weakly dependent time series are ideal for use in multiple regression analysis In Section 112 we will argue that OLS can be justified quite generally by appealing to the LLN and the CLT Time series that are not weakly dependentexamples of which we will see in Section 113do not generally satisfy the CLT which is why their use in multiple regression analysis can be tricky The simplest example of a weakly dependent time series is an independent identically distrib uted sequence a sequence that is independent is trivially weakly dependent A more interesting exam ple of a weakly dependent sequence is xt 5 et 1 a1et21 t 5 1 2 p 111 where 5et t 5 0 1 p6 is an iid sequence with zero mean and variance s2 e The process 5xt6 is called a moving average process of order one MA1 xt is a weighted average of et and et21 in the next Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 347 period we drop et21 and then xt11 depends on et11 and et Setting the coefficient of et to 1 in 111 is done without loss of generality In equation 111 we use xt and et as generic labels for time series processes They need have nothing to do with the explanatory variables or errors in a time series regression model although both the explanatory variables and errors could be MA1 processes Why is an MA1 process weakly dependent Adjacent terms in the sequence are correlated because xt11 5 et11 1 a1et Cov1xt xt112 5 a1Var1et2 5 a1s2 e Because Var1xt2 5 11 1 a2 12s2 e Corr1xt xt112 5 a111 1 a2 12 For example if a1 5 5 then Corr1xt xt112 5 4 The maximum posi tive correlation occurs when a1 5 1 in which case Corr1xt xt112 5 5 However once we look at variables in the sequence that are two or more time periods apart these variables are uncorrelated because they are independent For example xt12 5 et12 1 a1et11 is independent of xt because 5et6 is independent across t Due to the identical distribution assumption on the et 5xt6 in 111 is actually stationary Thus an MA1 is a stationary weakly dependent sequence and the LLN and the CLT can be applied to 5xt6 A more popular example is the process yt 5 r1yt21 1 et t 5 1 2 p 112 The starting point in the sequence is y01at t 5 02 and 5et t 5 1 2 p6 is an iid sequence with zero mean and variance s2 e We also assume that the et are independent of y0 and that E1y02 5 0 This is called an autoregressive process of order one AR1 The crucial assumption for weak dependence of an AR1 process is the stability condition 0r10 1 Then we say that 5yt6 is a stable AR1 process To see that a stable AR1 process is asymptotically uncorrelated it is useful to assume that the process is covariance stationary In fact it can generally be shown that 5yt6 is strictly stationary but the proof is somewhat technical Then we know that E1yt2 5 E1yt212 and from 112 with r1 2 1 this can happen only if E1yt2 5 0 Taking the variance of 112 and using the fact that et and yt21 are independent and therefore uncorrelated Var1yt2 5 r2 1Var1yt212 1 Var1et2 and so under covari ance stationarity we must have s2 y 5 r2 1s2 y 1 s2 e Since r2 1 1 by the stability condition we can easily solve for s2 y s2 y 5 s2 e11 2 r2 12 113 Now we can find the covariance between yt and yt1h for h 1 Using repeated substitution yt1h 5 r1yt1h21 1 et1h 5 r11r1yt1h22 1 et1h212 1 et1h 5 r2 1yt1h22 1 r1et1h21 1 et1h 5 p 5 r2 1yt 1 rh21 1 et11 1 p 1 r1et1h21 1 et1h Because E1yt2 5 0 for all t we can multiply this last equation by yt and take expectations to obtain Cov1yt yt1h2 Using the fact that et1j is uncorrelated with yt for all j 1 gives Cov1yt yt1h2 5 E1ytyt1h2 5 rh 1E1y2 t 2 1 rh21 1 E1ytet112 1 p 1 E1ytet1h2 5 rh 1E1y2 t 2 5 rh 1s2 y Because sy is the standard deviation of both yt and yt1h we can easily find the correlation between yt and yt1h for any h 1 Corr1yt yt1h2 5 Cov1yt yt1h21sysy2 5 rh 1 114 In particular Corr1yt yt112 5 r1 so r1 is the correlation coefficient between any two adjacent terms in the sequence Equation 114 is important because it shows that although yt and yt1h are correlated for any h 1 this correlation gets very small for large h because 0r10 1 rh 1 S 0 as h S Even when r1 is largesay 9 which implies a very high positive correlation between adjacent termsthe correlation between yt and yt1h tends to zero fairly rapidly For example Corr1yt yt152 5 591 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 348 Corr1yt yt1102 5 349 and Corr1yt yt1202 5 122 If t indexes year this means that the correlation between the outcome of two y that are 20 years apart is about 122 When r1 is smaller the correlation dies out much more quickly You might try r1 5 5 to verify this This analysis heuristically demonstrates that a stable AR1 process is weakly dependent The AR1 model is especially important in multiple regression analysis with time series data We will cover additional applications in Chapter 12 and the use of it for forecasting in Chapter 18 There are many other types of weakly dependent time series including hybrids of autoregressive and moving average processes But the previous examples work well for our purposes Before ending this section we must emphasize one point that often causes confusion in time series econometrics A trending series though certainly nonstationary can be weakly dependent In fact in the simple linear time trend model in Chapter 10 see equation 1024 the series 5yt6 was actually independent A series that is stationary about its time trend as well as weakly dependent is often called a trendstationary process Notice that the name is not completely descriptive because we assume weak dependence along with stationarity Such processes can be used in regression analy sis just as in Chapter 10 provided appropriate time trends are included in the model 112 Asymptotic Properties of OLS In Chapter 10 we saw some cases in which the classical linear model assumptions are not satisfied for certain time series problems In such cases we must appeal to large sample properties of OLS just as with crosssectional analysis In this section we state the assumptions and main results that justify OLS more generally The proofs of the theorems in this chapter are somewhat difficult and therefore omitted See Wooldridge 1994b Assumption TS19 Linearity and Weak Dependence We assume the model is exactly as in Assumption TS1 but now we add the assumption that 5 1xt yt2 t 5 1 2 p6 is stationary and weakly dependent In particular the LLN and the CLT can be applied to sample averages The linear in parameters requirement again means that we can write the model as yt 5 b0 1 b1xt1 1 p 1 bk xtk 1 ut 115 where the bj are the parameters to be estimated Unlike in Chapter 10 the xtj can include lags of the dependent variable As usual lags of explanatory variables are also allowed We have included stationarity in Assumption TS1r for convenience in stating and interpreting assumptions If we were carefully working through the asymptotic properties of OLS as we do in Appendix E stationarity would also simplify those derivations But stationarity is not at all critical for OLS to have its standard asymptotic properties As mentioned in Section 111 by assuming the bj are constant across time we are already assuming some form of stability in the distributions over time The important extra restriction in Assumption TS1r as compared with Assumption TS1 is the weak dependence assumption In Section 111 we spent some effort discussing weak dependence for a time series process because it is by no means an innocuous assumption Technically Assumption TS1r requires weak dependence on multiple time series yt and elements of xt and this entails putting restrictions on the joint distribution across time The details are not particularly important and are anyway beyond the scope of this text see Wooldridge 1994 It is more important to understand the kinds of persistent time series processes that violate the weak dependence requirement and we will turn to that in the next section There we also discuss the use of persistent processes in multiple regression models Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 349 Naturally we still rule out perfect collinearity Assumption TS29 No Perfect Collinearity Same as Assumption TS2 Assumption TS39 Zero Conditional Mean The explanatory variables xt 5 1xt1 xt2 p xtk2 are contemporaneously exogenous as in equation 1010 E1ut0xt2 5 0 This is the most natural assumption concerning the relationship between ut and the explanatory vari ables It is much weaker than Assumption TS3 because it puts no restrictions on how ut is related to the explanatory variables in other time periods We will see examples that satisfy TS3r shortly By stationarity if contemporaneous exogeneity holds for one time period it holds for them all Relaxing stationarity would simply require us to assume the condition holds for all t 5 1 2 For certain purposes it is useful to know that the following consistency result only requires ut to have zero unconditional mean and to be uncorrelated with each xtj E1ut2 5 0 Cov1xtj ut2 5 0 j 5 1 p k 116 We will work mostly with the zero conditional mean assumption because it leads to the most straight forward asymptotic analysis CoNsisteNCy of oLs Under TS1r TS2r and TS3r the OLS estimators are consistent plim b j 5 bj j 5 0 1 p k thEorEm 111 There are some key practical differences between Theorems 101 and 111 First in Theorem 111 we conclude that the OLS estimators are consistent but not necessarily unbiased Second in Theorem 111 we have weakened the sense in which the explanatory variables must be exogenous but weak dependence is required in the underlying time series Weak dependence is also crucial in obtaining approximate distributional results which we cover later exaMPLe 111 static Model Consider a static model with two explanatory variables yt 5 b0 1 b1zt1 1 b2zt2 1 ut 117 Under weak dependence the condition sufficient for consistency of OLS is E1ut0zt1 zt22 5 0 118 This rules out omitted variables that are in ut and are correlated with either zt1 or zt2 Also no function of zt1 or zt2 can be correlated with ut and so Assumption TS3r rules out misspecified functional form just as in the crosssectional case Other problems such as measurement error in the variables zt1 or zt2 can cause 118 to fail Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 350 Importantly Assumption TS3r does not rule out correlation between say ut21 and zt1 This type of correlation could arise if zt1 is related to past yt21 such as zt1 5 d0 1 d1yt21 1 vt 119 For example zt1 might be a policy variable such as monthly percentage change in the money supply and this change might depend on last months rate of inflation 1yt212 Such a mechanism generally causes zt1 and ut21 to be correlated as can be seen by plugging in for yt21 This kind of feedback is allowed under Assumption TS3r exaMPLe 112 finite Distributed Lag Model In the finite distributed lag model yt 5 a0 1 d0zt 1 d1zt21 1 d2zt22 1 ut 1110 a very natural assumption is that the expected value of ut given current and all past values of z is zero E1ut0zt zt21 zt22 zt23 p2 5 0 1111 This means that once zt zt21 and zt22 are included no further lags of z affect E1yt0zt zt21 zt22 zt23 p2 if this were not true we would put further lags into the equation For example yt could be the annual percentage change in investment and zt a measure of interest rates during year t When we set xt 5 1zt zt21 zt222 Assumption TS3r is then satisfied OLS will be consistent As in the previous example TS3r does not rule out feedback from y to future values of z The previous two examples do not necessarily require asymptotic theory because the explanatory variables could be strictly exogenous The next example clearly violates the strict exogeneity assump tion therefore we can only appeal to large sample properties of OLS exaMPLe 113 aR1 Model Consider the AR1 model yt 5 b0 1 b1yt21 1 ut 1112 where the error ut has a zero expected value given all past values of y E1ut0yt21 yt22 p2 5 0 1113 Combined these two equations imply that E1yt0yt21 yt22 p2 5 E1yt0yt212 5 b0 1 b1yt21 1114 This result is very important First it means that once y lagged one period has been controlled for no further lags of y affect the expected value of yt This is where the name first order originates Second the relationship is assumed to be linear Because xt contains only yt21 equation 1113 implies that Assumption TS3r holds By contrast the strict exogeneity assumption needed for unbiasedness Assumption TS3 does not hold Since the set of explanatory variables for all time periods includes all of the values on y except the last 1y0 y1 p yn212 Assumption TS3 requires that for all t ut is uncorrelated with each of y0 y1 p yn21 This cannot be true In fact because ut is uncorrelated with yt21 under 1113 ut and yt must be cor related In fact it is easily seen that Cov1yt ut2 5 Var1ut2 0 Therefore a model with a lagged dependent variable cannot satisfy the strict exogeneity Assumption TS3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 351 For the weak dependence condition to hold we must assume that 0b10 1 as we discussed in Section 111 If this condition holds then Theorem 111 implies that the OLS estimator from the regression of yt on yt21 produces consistent estimators of b0 and b1 Unfortunately b 1 is biased and this bias can be large if the sample size is small or if b1 is near 1 For b1 near 1 b 1 can have a severe downward bias In moderate to large samples b 1 should be a good estimator of b1 When using the standard inference procedures we need to impose versions of the homoskedas ticity and no serial correlation assumptions These are less restrictive than their classical linear model counterparts from Chapter 10 Assumption TS49 Homoskedasticity The errors are contemporaneously homoskedastic that is Var1ut0xt2 5 s2 Assumption TS59 No serial Correlation For all t 2 s E1ut us0xt xs2 5 0 In TS4r note how we condition only on the explanatory variables at time t compare to TS4 In TS5r we condition only on the explanatory variables in the time periods coinciding with ut and us As stated this assumption is a little difficult to interpret but it is the right condition for studying the large sample properties of OLS in a variety of time series regressions When considering TS5r we often ignore the conditioning on xt and xs and we think about whether ut and us are uncorrelated for all t 2 s Serial correlation is often a problem in static and finite distributed lag regression models nothing guarantees that the unobservables ut are uncorrelated over time Importantly Assumption TS5r does hold in the AR1 model stated in equations 1112 and 1113 Since the explanatory variable at time t is yt21 we must show that E1utus0yt21 ys212 5 0 for all t 2 s To see this suppose that s t The other case follows by symmetry Then since us 5 ys 2 b0 2 b1ys21 us is a function of y dated before time t But by 1113 E1ut0us yt21 ys212 5 0 and so E1utus0us yt21 ys212 5 usE1ut0yt21 ys212 5 0 By the law of iterated expectations see Appendix B E1utus0yt21 ys212 5 0 This is very important as long as only one lag belongs in 1112 the errors must be serially uncorrelated We will discuss this feature of dynamic models more generally in Section 114 We now obtain an asymptotic result that is practically identical to the crosssectional case asyMPtotiC NoRMaLity of oLs Under TS1r through TS5r the OLS estimators are asymptotically normally distributed Further the usual OLS standard errors t statistics F statistics and LM statistics are asymptotically valid thEorEm 112 This theorem provides additional justification for at least some of the examples estimated in Chapter 10 even if the classical linear model assumptions do not hold OLS is still consistent and the usual inference procedures are valid Of course this hinges on TS1r through TS5r being true In the next section we discuss ways in which the weak dependence assumption can fail The problems of serial correlation and heteroskedasticity are treated in Chapter 12 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 352 exaMPLe 114 efficient Markets Hypothesis We can use asymptotic analysis to test a version of the efficient markets hypothesis EMH Let yt be the weekly percentage return from Wednesday close to Wednesday close on the New York Stock Exchange composite index A strict form of the EMH states that information observable to the market prior to week t should not help to predict the return during week t If we use only past information on y the EMH is stated as E1yt0yt21 yt22 p2 5 E1yt2 1115 If 1115 is false then we could use information on past weekly returns to predict the current return The EMH presumes that such investment opportunities will be noticed and will disappear almost instantaneously One simple way to test 1115 is to specify the AR1 model in 1112 as the alternative model Then the null hypothesis is easily stated as H0 b1 5 0 Under the null hypothesis Assumption TS3r is true by 1115 and as we discussed earlier serial correlation is not an issue The homoskedastic ity assumption is Var1yt0yt212 5 Var1yt2 5 s2 which we just assume is true for now Under the null hypothesis stock returns are serially uncorrelated so we can safely assume that they are weakly dependent Then Theorem 112 says we can use the usual OLS t statistic for b 1 to test H0 b1 5 0 against H1 b1 2 0 The weekly returns in NYSE are computed using data from January 1976 through March 1989 In the rare case that Wednesday was a holiday the close at the next trading day was used The aver age weekly return over this period was 196 in percentage form with the largest weekly return being 845 and the smallest being 21532 during the stock market crash of October 1987 Estimation of the AR1 model gives returnt 5 180 1 059 returnt2l 10812 10382 1116 n 5 689 R2 5 0035 R2 5 0020 The t statistic for the coefficient on returnt21 is about 155 and so H0 b1 5 0 cannot be rejected against the twosided alternative even at the 10 significance level The estimate does suggest a slight positive correlation in the NYSE return from one week to the next but it is not strong enough to warrant rejection of the EMH In the previous example using an AR1 model to test the EMH might not detect correlation between weekly returns that are more than one week apart It is easy to estimate models with more than one lag For example an autoregressive model of order two or AR2 model is yt 5 b0 1 b1yt21 1 b2yt22 1 ut 1117 E1ut0yt21 yt22 p2 5 0 There are stability conditions on b1 and b2 that are needed to ensure that the AR2 process is weakly dependent but this is not an issue here because the null hypothesis states that the EMH holds H0 b1 5 b2 5 0 1118 If we add the homoskedasticity assumption Var1ut0yt21 yt222 5 s2 we can use a standard F sta tistic to test 1118 If we estimate an AR2 model for returnt we obtain returnt 5 186 1 060 returnt21 2 038 returnt22 10812 10382 10382 n 5 688 R2 5 0048 R2 5 0019 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 353 where we lose one more observation because of the additional lag in the equation The two lags are individually insignificant at the 10 level They are also jointly insignificant using R2 5 0048 we find the F statistic is approximately F 5 165 the pvalue for this F statistic with 2 and 685 degrees of freedom is about 193 Thus we do not reject 1118 at even the 15 significance level exaMPLe 115 expectations augmented Phillips Curve A linear version of the expectations augmented Phillips curve can be written as inft 2 inf e t 5 b11unemt 2 m02 1 et where m0 is the natural rate of unemployment and inf e t is the expected rate of inflation formed in year t 2 1 This model assumes that the natural rate is constant something that macroeconomists ques tion The difference between actual unemployment and the natural rate is called cyclical unemploy ment while the difference between actual and expected inflation is called unanticipated inflation The error term et is called a supply shock by macroeconomists If there is a tradeoff between unantici pated inflation and cyclical unemployment then b1 0 For a detailed discussion of the expecta tions augmented Phillips curve see Mankiw 1994 Section 112 To complete this model we need to make an assumption about inflationary expectations Under adaptive expectations the expected value of current inflation depends on recently observed inflation A particularly simple formulation is that expected inflation this year is last years inflation inf e t 5 inft21 See Section 181 for an alternative formulation of adaptive expectations Under this assumption we can write inft 2 inft21 5 b0 1 b1unemt 1 et or Dinft 5 b0 1 b1unemt 1 et where Dinft 5 inft 2 inft21 and b0 5 2b1m0 b0 is expected to be positive since b1 0 and m0 0 Therefore under adaptive expectations the expectations augmented Phillips curve relates the change in inflation to the level of unemployment and a supply shock et If et is uncorrelated with unemt as is typically assumed then we can consistently estimate b0 and b1 by OLS We do not have to assume that say future unemployment rates are unaffected by the current supply shock We assume that TS1r through TS5r hold Using the data through 1996 in PHILLIPS we estimate Dinft 5 303 2 543 unemt 11382 12302 1119 n 5 48 R2 5 108 R2 5 088 The tradeoff between cyclical unemployment and unanticipated inflation is pronounced in equa tion 1119 a onepoint increase in unem lowers unanticipated inflation by over onehalf of a point The effect is statistically significant twosided pvalue 023 We can contrast this with the static Phillips curve in Example 101 where we found a slightly positive relationship between inflation and unemployment Because we can write the natural rate as m0 5 b012b12 we can use 1119 to obtain our own estimate of the natural rate m 0 5 b 012b 12 5 303543 558 Thus we estimate the natural rate to be about 56 which is well within the range suggested by macroeconomists historically 5 to 6 is a common range cited for the natural rate of unemployment A standard error of this estimate is difficult to obtain because we have a nonlinear function of the OLS estimators Wooldridge 2010 Chapter 3 contains the theory for general nonlinear functions In the current application the standard error is 657 which leads to an asymptotic 95 confidence interval based on the standard normal distribution of about 429 to 687 for the natural rate Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 354 Under Assumptions TS1r through TS5r we can show that the OLS estimators are asymptoti cally efficient in the class of estimators described in Theorem 53 but we replace the crosssectional observation index i with the time series index t Finally models with trending explanatory variables can effectively satisfy Assumptions TS1r through TS5r provided they are trend stationary As long as time trends are included in the equations when needed the usual inference procedures are asymptotically valid 113 Using Highly Persistent Time Series in Regression Analysis The previous section shows that provided the time series we use are weakly dependent usual OLS inference procedures are valid under assumptions weaker than the classical linear model assumptions Unfortunately many economic time series cannot be characterized by weak dependence Using time series with strong dependence in regression analysis poses no problem if the CLM assumptions in Chapter 10 hold But the usual inference procedures are very susceptible to violation of these assump tions when the data are not weakly dependent because then we cannot appeal to the LLN and the CLT In this section we provide some examples of highly persistent or strongly dependent time series and show how they can be transformed for use in regression analysis 113a Highly Persistent Time Series In the simple AR1 model 112 the assumption 0r10 1 is crucial for the series to be weakly dependent It turns out that many economic time series are better characterized by the AR1 model with r1 5 1 In this case we can write yt 5 yt21 1 et t 5 1 2 p 1120 where we again assume that 5et t 5 1 2 p6 is independent and identically distributed with mean zero and variance s2 e We assume that the initial value y0 is independent of et for all t 1 The process in 1120 is called a random walk The name comes from the fact that y at time t is obtained by starting at the previous value yt21 and adding a zero mean random variable that is inde pendent of yt21 Sometimes a random walk is defined differently by assuming different properties of the innovations et such as lack of correlation rather than independence but the current definition suffices for our purposes First we find the expected value of yt This is most easily done by using repeated substitution to get yt 5 et 1 et21 1 p 1 e1 1 y0 Taking the expected value of both sides gives E1yt2 5 E1et2 1 E1et212 1 p 1 E1e12 1 E1y02 5 E1y02 for all t 1 Therefore the expected value of a random walk does not depend on t A popular assumption is that y0 5 0the process begins at zero at time zeroin which case E1yt2 5 0 for all t By contrast the variance of a random walk does change with t To compute the variance of a random walk for simplicity we assume that y0 is nonrandom so that Var1y02 5 0 this does not affect any important conclusions Then by the iid assumption for 5et6 Var1yt2 5 Var1et2 1 Var1et212 1 p 1 Var1e12 5 s2 et 1121 Suppose that expectations are formed as inf e t 5 1122inft21 1 1122inft22 What regression would you run to estimate the expectations augmented Phillips curve Exploring FurthEr 112 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 355 In other words the variance of a random walk increases as a linear function of time This shows that the process cannot be stationary Even more importantly a random walk displays highly persistent behavior in the sense that the value of y today is important for determining the value of y in the very distant future To see this write for h periods hence yt1h 5 et1h 1 et1h21 1 p 1 et11 1 yt Now suppose at time t we want to compute the expected value of yt1h given the current value yt Since the expected value of et1j given yt is zero for all j 1 we have E1yt1h0yt2 5 yt for all h 1 1122 This means that no matter how far in the future we look our best prediction of yt1h is todays value yt We can contrast this with the stable AR1 case where a similar argument can be used to show that E1yt1h0yt2 5 rh 1 yt for all h 1 Under stability 0r10 1 and so E1yt1h0yt2 approaches zero as h S the value of yt becomes less and less important and E1yt1h0yt2 gets closer and closer to the unconditional expected value E1yt2 5 0 When h 5 1 equation 1122 is reminiscent of the adaptive expectations assumption we used for the inflation rate in Example 115 if inflation follows a random walk then the expected value of inft given past values of inflation is simply inft21 Thus a random walk model for inflation justifies the use of adaptive expectations We can also see that the correlation between yt and yt1h is close to one for large t when 5yt6 fol lows a random walk If Var1y02 5 0 it can be shown that Corr1yt yt1h2 5 t1t 1 h2 Thus the correlation depends on the starting point t so that 5yt6 is not covariance stationary Further although for fixed t the correlation tends to zero as h S it does not do so very quickly In fact the larger t is the more slowly the correlation tends to zero as h gets large If we choose h to be something largesay h 5 100we can always choose a large enough t such that the correlation between yt and yt1h is arbitrarily close to one If h 5 100 and we want the correlation to be greater than 95 then t 1000 does the trick Therefore a random walk does not satisfy the requirement of an asymptotically uncorrelated sequence Figure 111 plots two realizations of a random walk generated from a computer with initial value y0 5 0 and et Normal 10 12 Generally it is not easy to look at a time series plot and determine whether it is a random walk Next we will discuss an informal method for making the distinction between weakly and highly dependent sequences we will study formal statistical tests in Chapter 18 A series that is generally thought to be well characterized by a random walk is the threemonth Tbill rate Annual data are plotted in Figure 112 for the years 1948 through 1996 A random walk is a special case of what is known as a unit root process The name comes from the fact that r1 5 1 in the AR1 model A more general class of unit root processes is generated as in 1120 but 5et6 is now allowed to be a general weakly dependent series For example 5et6 could itself follow an MA1 or a stable AR1 process When 5et6 is not an iid sequence the properties of the random walk we derived earlier no longer hold But the key feature of 5yt6 is preserved the value of y today is highly correlated with y even in the distant future From a policy perspective it is often important to know whether an economic time series is highly persistent or not Consider the case of gross domestic product in the United States If GDP is asymptotically uncorrelated then the level of GDP in the coming year is at best weakly related to what GDP was say 30 years ago This means a policy that affected GDP long ago has very little lasting impact On the other hand if GDP is strongly dependent then next years GDP can be highly correlated with the GDP from many years ago Then we should recognize that a policy that causes a discrete change in GDP can have longlasting effects Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 356 10 t yt 25 0 5 0 50 5 FiguRE 111 Two realizations of the random walk yt 5 yt21 1 et with y0 5 0 et Normal10 12 and n 5 50 1 year interest rate 1972 8 14 1948 1996 FiguRE 112 The US threemonth Tbill rate for the years 19481996 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 357 It is extremely important not to confuse trending and highly persistent behaviors A series can be trending but not highly persistent as we saw in Chapter 10 Further factors such as interest rates inflation rates and unemployment rates are thought by many to be highly persistent but they have no obvious upward or downward trend However it is often the case that a highly persistent series also contains a clear trend One model that leads to this behavior is the random walk with drift yt 5 a0 1 yt21 1 et t 5 1 2 p 1123 where 5et t 5 1 2 p6 and y0 satisfy the same properties as in the random walk model What is new is the parameter a0 which is called the drift term Essentially to generate yt the constant a0 is added along with the random noise et to the previous value yt21 We can show that the expected value of yt follows a linear time trend by using repeated substitution yt 5 a0t 1 et 1 et21 1 p 1 e1 1 y0 Therefore if y0 5 0 E1yt2 5 a0t the expected value of yt is growing over time if a0 0 and shrink ing over time if a0 0 By reasoning as we did in the pure random walk case we can show that E1yt1h0yt2 5 a0h 1 yt and so the best prediction of yt1h at time t is yt plus the drift a0h The variance of yt is the same as it was in the pure random walk case Figure 113 contains a realization of a random walk with drift where n 5 50 y0 5 0 a0 5 2 and the et are Normal0 9 random variables As can be seen from this graph yt tends to grow over time but the series does not regularly return to the trend line 0 t yt 25 50 100 0 50 FiguRE 113 A realization of the random walk with drift yt 5 2 1 yt21 1 et with y0 5 0 et s Normal0 9 and n 5 50 The dashed line is the expected value of yt E1yt2 5 2t Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 358 A random walk with drift is another example of a unit root process because it is the special case r1 5 1 in an AR1 model with an intercept yt 5 a0 1 r1yt21 1 et When r1 5 1 and 5et6 is any weakly dependent process we obtain a whole class of highly persistent time series processes that also have linearly trending means 113b Transformations on Highly Persistent Time Series Using time series with strong persistence of the type displayed by a unit root process in a regression equation can lead to very misleading results if the CLM assumptions are violated We will study the spurious regression problem in more detail in Chapter 18 but for now we must be aware of potential problems Fortunately simple transformations are available that render a unit root process weakly dependent Weakly dependent processes are said to be integrated of order zero or I0 Practically this means that nothing needs to be done to such series before using them in regression analysis averages of such sequences already satisfy the standard limit theorems Unit root processes such as a random walk with or without drift are said to be integrated of order one or I1 This means that the first difference of the process is weakly dependent and often stationary A time series that is I1 is often said to be a differencestationary process although the name is somewhat misleading with its emphasis on stationarity after differencing rather than weak dependence in the differences The concept of an I1 process is easiest to see for a random walk With 5yt6 generated as in 1120 for t 5 1 2 p Dyt 5 yt 2 yr21 5 et t 5 2 3 p 1124 therefore the firstdifferenced series 5Dyt t 5 2 3 p6 is actually an iid sequence More generally if 5yt6 is generated by 1124 where 5et6 is any weakly dependent process then 5Dyt6 is weakly depend ent Thus when we suspect processes are integrated of order one we often first difference in order to use them in regression analysis we will see some examples later Incidentally the symbol Δ can mean change as well as difference In actual data sets if an original variable is named y then its change or difference is often denoted cy or dy For example the change in price might be denoted cprice Many time series yt that are strictly positive are such that log1yt2 is integrated of order one In this case we can use the first difference in the logs Dlog1yt2 5 log1yt2 2 log1yt212 in regression analy sis Alternatively since Dlog1yt2 1yt 2 yt212yt21 1125 we can use the proportionate or percentage change in yt directly this is what we did in Example 114 where rather than stating the EMH in terms of the stock price pt we used the weekly percentage change returnt 5 1003 1pt 2 pt212pt214 The quantity in equation 1125 is often called the growth rate meas ured as a proportionate change When using a particular data set it is important to know how the growth rates are measuredwhether as a proportionate or a percentage change Sometimes if an original variable is y its growth rate is denoted gy so that for each t gyt 5 log1yt2 2 log1yt212 or gyt 5 1yt 2 yt212yt21 Often these quantities are multiplied by 100 to turn a proportionate change into a percentage change Differencing time series before using them in regression analysis has another benefit it removes any linear time trend This is easily seen by writing a linearly trending variable as yt 5 g0 1 g1t 1 vt where vt has a zero mean Then Dyt 5 g1 1 Dvt and so E1Dyt2 5 g1 1 E1Dvt2 5 g1 In other words E1Dyt2 is constant The same argument works for Dlog1yt2 when log1yt2 follows a linear time trend Therefore rather than including a time trend in a regression we can instead difference those variables that show obvious trends Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 359 113c Deciding Whether a Time Series Is I1 Determining whether a particular time series realization is the outcome of an I1 versus an I0 pro cess can be quite difficult Statistical tests can be used for this purpose but these are more advanced we provide an introductory treatment in Chapter 18 There are informal methods that provide useful guidance about whether a time series process is roughly characterized by weak dependence A very simple tool is motivated by the AR1 model if 0r10 1 then the process is I0 but it is I1 if r1 5 1 Earlier we showed that when the AR1 pro cess is stable r1 5 Corr1yt yt212 Therefore we can estimate r1 from the sample correlation between yt and yt21 This sample correlation coefficient is called the first order autocorrelation of 5yt6 we denote this by r 1 By applying the LLN r 1 can be shown to be consistent for r1 provided 0r10 1 However r 1 is not an unbiased estimator of r1 We can use the value of r 1 to help decide whether the process is I1 or I0 Unfortunately because r 1 is an estimate we can never know for sure whether r1 1 Ideally we could compute a confidence interval for r1 to see if it excludes the value r1 5 1 but this turns out to be rather difficult the sampling distributions of the estimator of r 1 are extremely different when r1 is close to one and when r1 is much less than one In fact when r1 is close to one r 1 can have a severe downward bias In Chapter 18 we will show how to test H0 r1 5 1 against H1 r1 1 For now we can only use r 1 as a rough guide for determining whether a series needs to be differenced No hardandfast rule exists for making this choice Most economists think that differencing is warranted if r 1 9 some would difference when r 1 8 exaMPLe 116 fertility equation In Example 104 we explained the general fertility rate g fr in terms of the value of the personal exemption pe The first order autocorrelations for these series are very large r 1 5 977 for g fr and r 1 5 964 for pe These autocorrelations are highly suggestive of unit root behavior and they raise serious questions about our use of the usual OLS t statistics for this example back in Chapter 10 Remember the t statistics only have exact t distributions under the full set of classical linear model assumptions To relax those assumptions in any way and apply asymptotics we generally need the underlying series to be I0 processes We now estimate the equation using first differences and drop the dummy variable for simplicity Dgfr 5 2785 2 043 Dpe 15022 10282 1126 n 5 71 R2 5 032 R2 5 018 Now an increase in pe is estimated to lower g fr contemporaneously although the estimate is not sta tistically different from zero at the 5 level This gives very different results than when we estimated the model in levels and it casts doubt on our earlier analysis If we add two lags of Dpe things improve Dgfr 5 2964 2 036 Dpe 2 014 Dpe21 1 110 Dpe22 14682 10272 10282 10272 1127 n 5 69 R2 5 233 R2 5 197 Even though Dpe and Dpe21 have negative coefficients their coefficients are small and jointly insig nificant 1pvalue 5 282 The second lag is very significant and indicates a positive relationship between changes in pe and subsequent changes in g fr two years hence This makes more sense than having a contemporaneous effect See Computer Exercise C5 for further analysis of the equation in first differences Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 360 When the series in question has an obvious upward or downward trend it makes more sense to obtain the first order autocorrelation after detrending If the data are not detrended the autoregressive correlation tends to be overestimated which biases toward finding a unit root in a trending process exaMPLe 117 Wages and Productivity The variable hrwage is average hourly wage in the US economy and outphr is output per hour One way to estimate the elasticity of hourly wage with respect to output per hour is to estimate the equation log1hrwaget2 5 b0 1 b1log1outphrt2 1 b2t 1 ut where the time trend is included because log1hrwaget2 and log1outphrt2 both display clear upward linear trends Using the data in EARNS for the years 1947 through 1987 we obtain log1hrwaget2 5 2533 1 164 log1outphrt2 2 018 t 1372 1092 10022 1128 n 5 41 R2 5 971 R2 5 970 We have reported the usual goodnessoffit measures here it would be better to report those based on the detrended dependent variable as in Section 105 The estimated elasticity seems too large a 1 increase in productivity increases real wages by about 164 Because the standard error is so small the 95 confidence interval easily excludes a unit elasticity US workers would prob ably have trouble believing that their wages increase by more than 15 for every 1 increase in productivity The regression results in 1128 must be viewed with caution Even after linearly detrending loghrwage the first order autocorrelation is 967 and for detrended logoutphr r 1 5 945 These suggest that both series have unit roots so we reestimate the equation in first differences and we no longer need a time trend Dlog1hrwaget2 5 20036 1 809 Dlog1outphr2 100422 11732 1129 n 5 40 R2 5 364 R2 5 348 Now a 1 increase in productivity is estimated to increase real wages by about 81 and the esti mate is not statistically different from one The adjusted Rsquared shows that the growth in output explains about 35 of the growth in real wages See Computer Exercise C2 for a simple distributed lag version of the model in first differences In the previous two examples both the dependent and independent variables appear to have unit roots In other cases we might have a mixture of processes with unit roots and those that are weakly dependent though possibly trending An example is given in Computer Exercise C1 114 Dynamically Complete Models and the Absence of Serial Correlation In the AR1 model in 1112 we showed that under assumption 1113 the errors ut must be serially uncorrelated in the sense that Assumption TS5r is satisfied assuming that no serial correlation exists is practically the same thing as assuming that only one lag of y appears in E1yt0yt21 yt22 p2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 361 Can we make a similar statement for other regression models The answer is yes although the assumptions required for the errors to be serially uncorrelated might be implausible Consider for example the simple static regression model yt 5 b0 1 b1zt 1 ut 1130 where yt and zt are contemporaneously dated For consistency of OLS we only need E1ut0zt2 5 0 Generally the 5ut6 will be serially correlated However if we assume that E1ut0zt yt21 zt21 p2 5 0 1131 then as we will show generally later Assumption TS5r holds In particular the 5ut6 are serially uncorrelated Naturally assumption 1131 implies that zt is contemporaneously exogenous that is E1ut0zt2 5 0 To gain insight into the meaning of 1131 we can write 1130 and 1131 equivalently as E1yt0zt yt21 zt21 p2 5 E1yt0zt2 5 b0 1 b1zt 1132 where the first equality is the one of current interest It says that once zt has been controlled for no lags of either y or z help to explain current y This is a strong requirement and is implausible when the lagged dependent variable has predictive power which is often the case if it is false then we can expect the errors to be serially correlated Next consider a finite distributed lag model with two lags yt 5 b0 1 b1zt 1 b2zt21 1 b3zt22 1 ut 1133 Since we are hoping to capture the lagged effects that z has on y we would naturally assume that 1133 captures the distributed lag dynamics E1yt0zt zt21 zt22 zt23 p2 5 E1yt0zt zt21 zt222 1134 that is at most two lags of z matter If 1131 holds we can make a stronger statement once we have controlled for z and its two lags no lags of y or additional lags of z affect current y E1yt0zt yt21 zt21 p2 5 E1yt0zt zt21 zt222 1135 Equation 1135 is more likely than 1132 but it still rules out lagged y having extra predictive power for current y Next consider a model with one lag of both y and z yt 5 b0 1 b1zt 1 b2yt21 1 b3zt21 1 ut Since this model includes a lagged dependent variable 1131 is a natural assumption as it implies that E1yt0zt yt21 zt21 yt22 p2 5 E1yt0zt yt21 zt212 in other words once zt yt21 and zt21 have been controlled for no further lags of either y or z affect current y In the general model yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ut 1136 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 362 where the explanatory variables xt 5 1xt1 p xtk2 may or may not contain lags of y or z 1131 becomes E1ut0xt yt21 xt21 p2 5 0 1137 Written in terms of yt E1yt0xt yt21 xt21 p2 5 E1yt0xt2 1138 In other words whatever is in xt enough lags have been included so that further lags of y and the explanatory variables do not matter for explaining yt When this condition holds we have a dynami cally complete model As we saw earlier dynamic completeness can be a very strong assumption for static and finite distributed lag models Once we start putting lagged y as explanatory variables we often think that the model should be dynamically complete We will touch on some exceptions to this claim in Chapter 18 Since 1137 is equivalent to E1ut0xt ut21 xt21 ut22 p2 5 0 1139 we can show that a dynamically complete model must satisfy Assumption TS5r This derivation is not crucial and can be skipped without loss of continuity For concreteness take s t Then by the law of iterated expectations see Appendix B E1utus0xt xs2 5 E3E1utus0xt xs us2 0xt xs4 5 E3usE1ut0xt xs us2 0xt xs4 where the second equality follows from E1utus0xt xs us2 5 usE1ut0xt xs us2 Now since s t 1xt xs us2 is a subset of the conditioning set in 1139 Therefore 1139 implies that E1ut0xt xs us2 5 0 and so E1utus0xt xs2 5 E1us 00xt xs2 5 0 which says that Assumption TS5r holds Since specifying a dynamically complete model means that there is no serial correlation does it fol low that all models should be dynamically complete As we will see in Chapter 18 for forecasting pur poses the answer is yes Some think that all models should be dynamically complete and that serial corre lation in the errors of a model is a sign of misspecifi cation This stance is too rigid Sometimes we really are interested in a static model such as a Phillips curve or a finite distributed lag model such as measuring the longrun percentage change in wages given a 1 increase in productivity In the next chapter we will show how to detect and correct for serial correlation in such models exaMPLe 118 fertility equation In equation 1127 we estimated a distributed lag model for Dgfr on Dpe allowing for two lags of Dpe For this model to be dynamically complete in the sense of 1138 neither lags of Dgfr nor fur ther lags of Dpe should appear in the equation We can easily see that this is false by adding Dgfr21 the coefficient estimate is 300 and its t statistic is 284 Thus the model is not dynamically complete in the sense of 1138 If 1133 holds where ut 5 et 1 a1et21 and where 5et6 is an iid sequence with mean zero and variance s2 e can equation 1133 be dynamically complete Exploring FurthEr 113 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 363 What should we make of this We will postpone an interpretation of general models with lagged dependent variables until Chapter 18 But the fact that 1127 is not dynamically complete suggests that there may be serial correlation in the errors We will see how to test and correct for this in Chapter 12 The notion of dynamic completeness should not be confused with a weaker assumption concern ing including the appropriate lags in a model In the model 1136 the explanatory variables xt are said to be sequentially exogenous if E1ut0xt xt21 p2 5 E1ut2 5 0 t 5 1 2 p 1140 As discussed in Problem 8 in Chapter 10 sequential exogeneity is implied by strict exogeneity and sequential exogeneity implies contemporaneous exogeneity Further because 1xt xt21 p2 is a subset of 1xt yt21 xt21 p2 sequential exogeneity is implied by dynamic completeness If xt contains yt21 the dynamic completeness and sequential exogeneity are the same condition The key point is that when xt does not contain yt21 sequential exogeneity allows for the possibility that the dynamics are not complete in the sense of capturing the relationship between yt and all past values of y and other explan atory variables But in finite distributed lag modelssuch as that estimated in equation 1127we may not care whether past y has predictive power for current y We are primarily interested in whether we have included enough lags of the explanatory variables to capture the distributed lag dynamics For example if we assume E1yt0zt zt21 zt22 zt23 p2 5 E1yt0zt zt21 zt222 5 a0 1 d0zt 1 d1zt21 1 d2zt22 then the regressors xt 5 1zt zt21 zt222 are sequentially exogenous because we have assumed that two lags suffice for the distributed lag dynamics But typically the model would not be dynamically com plete in the sense that E1yt0zt yt21 zt21 yt22 zt22 p2 5 E1yt0zt zt21 zt222 and we may not care In addition the explanatory variables in an FDL model may or may not be strictly exogenous 115 The Homoskedasticity Assumption for Time Series Models The homoskedasticity assumption for time series regressions particularly TS4r looks very similar to that for crosssectional regressions However since xt can contain lagged y as well as lagged explana tory variables we briefly discuss the meaning of the homoskedasticity assumption for different time series regressions In the simple static model say yt 5 b0 1 b1zt 1 ut 1141 Assumption TS4r requires that Var1ut0zt2 5 s2 Therefore even though E1yt0zt2 is a linear function of zt Var1yt0zt2 must be constant This is pretty straightforward In Example 114 we saw that for the AR1 model in 1112 the homoskedasticity assumption is Var1ut0yt212 5 Var1yt0yt212 5 s2 even though E1yt0yt212 depends on yt21 Var1yt0yt212 does not Thus the spread in the distribution of yt cannot depend on yt21 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 364 PART 2 Regression Analysis with Time Series Data Hopefully the pattern is clear now If we have the model yt 5 b0 1 b1zt 1 b2yt21 1 b3zt21 1 ut the homoskedasticity assumption is Var1ut0zt yt21 zt212 5 Var1yt0zt yt21 zt212 5 s2 so that the variance of ut cannot depend on zt yt21 or zt21 or some other function of time Generally whatever explanatory variables appear in the model we must assume that the variance of yt given these explanatory variables is constant If the model contains lagged y or lagged explanatory vari ables then we are explicitly ruling out dynamic forms of heteroskedasticity something we study in Chapter 12 But in a static model we are only concerned with Var1yt0zt2 In equation 1141 no direct restrictions are placed on say Var1yt0yt212 Summary In this chapter we have argued that OLS can be justified using asymptotic analysis provided certain condi tions are met Ideally the time series processes are stationary and weakly dependent although stationarity is not crucial Weak dependence is necessary for applying the standard large sample results particularly the central limit theorem Processes with deterministic trends that are weakly dependent can be used directly in regression anal ysis provided time trends are included in the model as in Section 105 A similar statement holds for processes with seasonality When the time series are highly persistent they have unit roots we must exercise extreme caution in using them directly in regression models unless we are convinced the CLM assumptions from Chapter 10 hold An alternative to using the levels is to use the first differences of the variables For most highly persistent economic time series the first difference is weakly dependent Using first differences changes the nature of the model but this method is often as informative as a model in levels When data are highly persistent we usually have more faith in firstdifference results In Chapter 18 we will cover some recent more advanced methods for using I1 variables in multiple regression analysis When models have complete dynamics in the sense that no further lags of any variable are needed in the equation we have seen that the errors will be serially uncorrelated This is useful because certain models such as autoregressive models are assumed to have complete dynamics In static and distributed lag models the dynamically complete assumption is often false which generally means the errors will be serially correlated We will see how to address this problem in Chapter 12 The AsympToTic GAussmArkov AssumpTions for Time series reGression Following is a summary of the five assumptions that we used in this chapter to perform largesample infer ence for time series regressions Recall that we introduced this new set of assumptions because the time series versions of the classical linear model assumptions are often violated especially the strict exogene ity no serial correlation and normality assumptions A key point in this chapter is that some sort of weak dependence is required to ensure that the central limit theorem applies We only used Assumptions TS1r through TS3r for consistency not unbiasedness of OLS When we add TS4r and TS5r we can use the usual confidence intervals t statistics and F statistics as being approximately valid in large samples Un like the GaussMarkov and classical linear model assumptions there is no historically significant name at tached to Assumptions TS1r to TS5r Nevertheless the assumptions are the analogs to the GaussMarkov assumptions that allow us to use standard inference As usual for largesample analysis we dispense with the normality assumption entirely Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 365 Assumption TS19 Linearity and Weak Dependence The stochastic process 5 1xt1 xt2 p xtk yt2 t 5 1 2 p n6 is stationary weakly dependent and follows the linear model yt 5 b0 1 b1xt1 1 b2xt2 1 p 1 bkxtk 1 ut where 5ut t 5 1 2 p n6 is the sequence of errors or disturbances Here n is the number of observations time periods Assumption TS29 No Perfect Collinearity In the sample and therefore in the underlying time series process no independent variable is constant nor a perfect linear combination of the others Assumption TS39 Zero Conditional Mean The explanatory variables are contemporaneously exogenous that is E1ut0xt1 p xtk2 5 0 Remember TS3r is notably weaker than the strict exogeneity Assumption TS3r Assumption TS49 Homoskedasticity The errors are contemporaneously homoskedastic that is Var1ut0xt2 5 s2 where xt is shorthand for 1xt1 xt2 p xtk2 Assumption TS59 No Serial Correlation For all t 2 s E1ut us0xt xs2 5 0 Key Terms Asymptotically Uncorrelated Autoregressive Process of Order One AR1 Contemporaneously Exogenous Contemporaneously Homoskedastic Covariance Stationary DifferenceStationary Process Dynamically Complete Model First Difference First Order Autocorrelation Growth Rate Highly Persistent Integrated of Order One I1 Integrated of Order Zero I0 Moving Average Process of Order One MA1 Nonstationary Process Random Walk Random Walk with Drift Sequentially Exogenous Serially Uncorrelated Stable AR1 Process Stationary Process Strongly Dependent TrendStationary Process Unit Root Process Weakly Dependent Problems 1 Let 5xt t 5 1 2 p6 be a covariance stationary process and define gh 5 Cov1xt xt1h2 for h 0 Therefore g0 5 Var1xt2 Show that Corr1xt xt1h2 5 ghg0 2 Let 5et t 5 21 0 1 p6 be a sequence of independent identically distributed random variables with mean zero and variance one Define a stochastic process by xt 5 et 2 1122et21 1 1122et22 t 5 1 2 p i Find E1xt2 and Var1xt2 Do either of these depend on t ii Show that Corr1xt xt112 5 212 and Corr1xt xt122 5 13 Hint It is easiest to use the formula in Problem 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 366 PART 2 Regression Analysis with Time Series Data iii What is Corr1xt xt1h2 for h 2 iv Is 5xt6 an asymptotically uncorrelated process 3 Suppose that a time series process 5yt6 is generated by yt 5 z 1 et for all t 5 1 2 p where 5et6 is an iid sequence with mean zero and variance s2 e The random variable z does not change over time it has mean zero and variance s2 z Assume that each et is uncorrelated with z i Find the expected value and variance of yt Do your answers depend on t ii Find Cov1yt yt1h2 for any t and h Is 5yt6 covariance stationary iii Use parts i and ii to show that Corr1yt yt1h2 5 s2 z1s2 z 1 s2 e2 for all t and h iv Does yt satisfy the intuitive requirement for being asymptotically uncorrelated Explain 4 Let 5yt t 5 1 2 p6 follow a random walk as in 1120 with y0 5 0 Show that Corr1yt yt1h2 5 t1t 1 h2 for t 1 h 0 5 For the US economy let gprice denote the monthly growth in the overall price level and let gwage be the monthly growth in hourly wages These are both obtained as differences of logarithms gprice 5 Dlog1price2 and gwage 5 Dlog1wage24 Using the monthly data in WAGEPRC we estimate the following distributed lag model gprice 5 200093 1 119 gwage 1 097 gwage21 1 040 gwage22 1000572 10522 10392 10392 1 038 gwage23 1 081 gwage24 1 107 gwage25 1 095 gwage26 10392 10392 10392 10392 1 104 gwage27 1 103 gwage28 1 159 gwage29 1 110 gwage210 10392 10392 10392 10392 1 103 gwage211 1 016 gwage212 10392 10522 n 5 273 R2 5 317 R2 5 283 i Sketch the estimated lag distribution At what lag is the effect of gwage on gprice largest Which lag has the smallest coefficient ii For which lags are the t statistics less than two iii What is the estimated longrun propensity Is it much different than one Explain what the LRP tells us in this example iv What regression would you run to obtain the standard error of the LRP directly v How would you test the joint significance of six more lags of gwage What would be the dfs in the F distribution Be careful here you lose six more observations 6 Let hy6t denote the threemonth holding yield in percent from buying a sixmonth Tbill at time 1t 2 12 and selling it at time t three months hence as a threemonth Tbill Let hy3t21 be the three month holding yield from buying a threemonth Tbill at time 1t 2 12 At time 1t 2 12 hy3t21 is known whereas hy6t is unknown because p3t the price of threemonth Tbills is unknown at time 1t 2 12 The expectations hypothesis EH says that these two different threemonth investments should be the same on average Mathematically we can write this as a conditional expectation E1hy6t0It212 5 hy3t21 where It21 denotes all observable information up through time t 1 This suggests estimating the model hy6t 5 b0 1 b1hy3t21 1 ut and testing H0 b1 5 1 We can also test H0 b0 5 0 but we often allow for a term premium for buying assets with different maturities so that b0 2 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 367 i Estimating the previous equation by OLS using the data in INTQRT spaced every three months gives hy6t 5 2058 1 1104 hy3t21 10702 10392 n 5 123 R2 5 866 Do you reject H0 b1 5 1 against H0 b1 2 1 at the 1 significance level Does the estimate seem practically different from one ii Another implication of the EH is that no other variables dated as t 2 1 or earlier should help explain hy6t once hy3t21 has been controlled for Including one lag of the spread between six month and threemonth Tbill rates gives hy6t 5 2123 1 1053 hy3t21 1 480 1r6t21 2 r3t212 10672 10392 11092 n 5 123 R2 5 885 Now is the coefficient on hy3t21 statistically different from one Is the lagged spread term significant According to this equation if at time t 2 1 r6 is above r3 should you invest in sixmonth or threemonth Tbills iii The sample correlation between hy3t and hy3t21 is 914 Why might this raise some concerns with the previous analysis iv How would you test for seasonality in the equation estimated in part ii 7 A partial adjustment model is yp t 5 g0 1 g1xt 1 et yt 2 yt21 5 l1yp t 2 yt212 1 at where yp t is the desired or optimal level of y and yt is the actual observed level For example yp t is the desired growth in firm inventories and xt is growth in firm sales The parameter g1 measures the effect of xt on yp t The second equation describes how the actual y adjusts depending on the relationship between the desired y in time t and the actual y in time t 1 The parameter l measures the speed of adjustment and satisfies 0 l 1 i Plug the first equation for yp t into the second equation and show that we can write yt 5 b0 1 b1yt21 1 b2xt 1 ut In particular find the bj in terms of the gj and l and find ut in terms of et and at Therefore the partial adjustment model leads to a model with a lagged dependent variable and a contemporaneous x ii If E1et0xt yt21 xt21 p2 5 E1at0xt yt21 xt21 p2 5 0 and all series are weakly dependent how would you estimate the bj iii If b 1 5 7 and b 2 5 2 what are the estimates of g1 and l 8 Suppose that the equation yt 5 a 1 dt 1 b1xt1 1 p 1 bkxtk 1 ut satisfies the sequential exogeneity assumption in equation 1140 i Suppose you difference the equation to obtain Dyt 5 d 1 b1Dxt1 1 p 1 bkDxtk 1 Dut How come applying OLS on the differenced equation does not generally result in consistent estimators of the bj Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 368 PART 2 Regression Analysis with Time Series Data ii What assumption on the explanatory variables in the original equation would ensure that OLS on the differences consistently estimates the bj iii Let zt1 ztk be a set of explanatory variables dated contemporaneously with yt If we specify the static regression model yt 5 b0 1 b1zt1 1 p 1 bkztk 1 ut describe what we need to assume for xt 5 zt to be sequentially exogenous Do you think the assumptions are likely to hold in economic applications Computer Exercises C1 Use the data in HSEINV for this exercise i Find the first order autocorrelation in loginvpc Now find the autocorrelation after linearly detrending loginvpc Do the same for logprice Which of the two series may have a unit root ii Based on your findings in part i estimate the equation log1invpct2 5 b0 1 b1Dlog1pricet2 1 b2t 1 ut and report the results in standard form Interpret the coefficient b 1 and determine whether it is statistically significant iii Linearly detrend log1invpct2 and use the detrended version as the dependent variable in the regression from part ii see Section 105 What happens to R2 iv Now use Dlog1invpct2 as the dependent variable How do your results change from part ii Is the time trend still significant Why or why not C2 In Example 117 define the growth in hourly wage and output per hour as the change in the natu ral log ghrwage 5 Dlog1hrwage2 and goutphr 5 Dlog1outphr2 Consider a simple extension of the model estimated in 1129 ghrwaget 5 b0 1 b1goutphrt 1 b2goutphrt21 1 ut This allows an increase in productivity growth to have both a current and lagged effect on wage growth i Estimate the equation using the data in EARNS and report the results in standard form Is the lagged value of goutphr statistically significant ii If b1 1 b2 5 1 a permanent increase in productivity growth is fully passed on in higher wage growth after one year Test H0 b1 1 b2 5 1 against the twosided alternative Remember one way to do this is to write the equation so that u 5 b1 1 b2 appears directly in the model as in Example 104 from Chapter 10 iii Does goutphrt22 need to be in the model Explain C3 i In Example 114 it may be that the expected value of the return at time t given past returns is a quadratic function of returnt21 To check this possibility use the data in NYSE to estimate returnt 5b0 1b1returnt21 1b2returnt21 2 1ut report the results in standard form ii State and test the null hypothesis that E1returnt0returnt212 does not depend on returnt21 Hint There are two restrictions to test here What do you conclude iii Drop return2 t21 from the model but add the interaction term returnt21 returnt22 Now test the efficient markets hypothesis iv What do you conclude about predicting weekly stock returns based on past stock returns C4 Use the data in PHILLIPS for this exercise but only through 1996 i In Example 115 we assumed that the natural rate of unemployment is constant An alternative form of the expectations augmented Phillips curve allows the natural rate of unemployment to depend on past levels of unemployment In the simplest case the natural rate at time t equals Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 369 unemt21 If we assume adaptive expectations we obtain a Phillips curve where inflation and unemployment are in first differences Dinf 5 b0 1 b1Dunem 1 u Estimate this model report the results in the usual form and discuss the sign size and statistical significance of b 1 ii Which model fits the data better 1119 or the model from part i Explain C5 i Add a linear time trend to equation 1127 Is a time trend necessary in the firstdifference equation ii Drop the time trend and add the variables ww2 and pill to 1127 do not difference these dummy variables Are these variables jointly significant at the 5 level iii Add the linear time trend ww2 and pill all to equation 1127 What happens to the magnitude and statistical significance of the time trend as compared with that in part i What about the coefficient on pill as compared with that in part ii iv Using the model from part iii estimate the LRP and obtain its standard error Compare this to 1019 where gfr and pe appeared in levels rather than in first differences Would you say that the link between fertility and the value of the personal exemption is a particularly robust finding C6 Let invent be the real value inventories in the United States during year t let GDPt denote real gross domestic product and let r3t denote the ex post real interest rate on threemonth Tbills The ex post real interest rate is approximately r3t 5 i3t 2 inft where i3t is the rate on threemonth Tbills and inft is the annual inflation rate see Mankiw 1994 Section 64 The change in inventories cinvent is the inventory investment for the year The accelerator model of inventory investment relates cinven to the cGDP the change in GDP cinvent 5 b0 1 b1cGDPt 1 ut where b1 0 See for example Mankiw 1994 Chapter 17 i Use the data in INVEN to estimate the accelerator model Report the results in the usual form and interpret the equation Is b 1 statistically greater than zero ii If the real interest rate rises then the opportunity cost of holding inventories rises and so an increase in the real interest rate should decrease inventories Add the real interest rate to the accelerator model and discuss the results iii Does the level of the real interest rate work better than the first difference cr3t C7 Use CONSUMP for this exercise One version of the permanent income hypothesis PIH of con sumption is that the growth in consumption is unpredictable Another version is that the change in consumption itself is unpredictable see Mankiw 1994 Chapter 15 for discussion of the PIH Let gct 5 log1ct2 2 log1ct212 be the growth in real per capita consumption of nondurables and services Then the PIH implies that E1gct0It212 5 E1gct2 where It21 denotes information known at time 1t 2 12 in this case t denotes a year i Test the PIH by estimating gct 5 b0 1 b1gct21 1 ut Clearly state the null and alternative hypotheses What do you conclude ii To the regression in part i add the variables gyt21 i3t21 and inft21 Are these new variables individually or jointly significant at the 5 level Be sure to report the appropriate pvalues iii In the regression from part ii what happens to the pvalue for the t statistic on gct21 Does this mean the PIH hypothesis is now supported by the data iv In the regression from part ii what is the F statistic and its associated pvalue for joint significance of the four explanatory variables Does your conclusion about the PIH now agree with what you found in part i Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 370 PART 2 Regression Analysis with Time Series Data C8 Use the data in PHILLIPS for this exercise i Estimate an AR1 model for the unemployment rate Use this equation to predict the unemployment rate for 2004 Compare this with the actual unemployment rate for 2004 You can find this information in a recent Economic Report of the President ii Add a lag of inflation to the AR1 model from part i Is inft21 statistically significant iii Use the equation from part ii to predict the unemployment rate for 2004 Is the result better or worse than in the model from part i iv Use the method from Section 64 to construct a 95 prediction interval for the 2004 unemployment rate Is the 2004 unemployment rate in the interval C9 Use the data in TRAFFIC2 for this exercise Computer Exercise C11 in Chapter 10 previously asked for an analysis of these data i Compute the first order autocorrelation coefficient for the variable prcfat Are you concerned that prcfat contains a unit root Do the same for the unemployment rate ii Estimate a multiple regression model relating the first difference of prcfat Dprcfat to the same variables in part vi of Computer Exercise C11 in Chapter 10 except you should first difference the unemployment rate too Then include a linear time trend monthly dummy variables the weekend variable and the two policy variables do not difference these Do you find any interesting results iii Comment on the following statement We should always first difference any time series we suspect of having a unit root before doing multiple regression because it is the safe strategy and should give results similar to using the levels In answering this you may want to do the regression from part vi of Computer Exercise C11 in Chapter 10 if you have not already C10 Use all the data in PHILLIPS to answer this question You should now use 56 years of data i Reestimate equation 1119 and report the results in the usual form Do the intercept and slope estimates change notably when you add the recent years of data ii Obtain a new estimate of the natural rate of unemployment Compare this new estimate with that reported in Example 115 iii Compute the first order autocorrelation for unem In your opinion is the root close to one iv Use cunem as the explanatory variable instead of unem Which explanatory variable gives a higher Rsquared C11 Okuns Lawsee for example Mankiw 1994 Chapter 2implies the following relationship between the annual percentage change in real GDP pcrgdp and the change in the annual unemploy ment rate cunem pcrgdp 5 3 2 2 cunem If the unemployment rate is stable real GDP grows at 3 annually For each percentage point increase in the unemployment rate real GDP grows by two percentage points less This should not be interpreted in any causal sense it is more like a statistical description To see if the data on the US economy support Okuns Law we specify a model that allows deviations via an error term pcrgdpt 5 b0 1 b1cunemt 1 ut i Use the data in OKUN to estimate the equation Do you get exactly 3 for the intercept and 2 for the slope Did you expect to ii Find the t statistic for testing H0 b1 5 22 Do you reject H0 against the twosided alternative at any reasonable significance level iii Find the t statistic for testing H0 b0 5 3 Do you reject H0 at the 5 level against the twosided alternative Is it a strong rejection iv Find the F statistic and pvalue for testing H0 b0 5 3 b1 5 22 against the alternative that H0 is false Does the test reject at the 10 level Overall would you say the data reject or tend to support Okuns Law Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 371 C12 Use the data in MINWAGE for this exercise focusing on the wage and employment series for sector 232 Mens and Boys Furnishings The variable gwage232 is the monthly growth change in logs in the average wage in sector 232 gemp232 is the growth in employment in sector 232 gmwage is the growth in the federal minimum wage and gcpi is the growth in the urban Consumer Price Index i Find the first order autocorrelation in gwage232 Does this series appear to be weakly dependent ii Estimate the dynamic model gwage232t 5 b0 1 b1gwage232t21 1 b2gmwaget 1 b3gcpit 1 ut by OLS Holding fixed last months growth in wage and the growth in the CPI does an increase in the federal minimum result in a contemporaneous increase in gwage232t Explain iii Now add the lagged growth in employment gemp232t21 to the equation in part ii Is it statistically significant iv Compared with the model without gwage232t21 and gemp232t21 does adding the two lagged variables have much of an effect on the gmwage coefficient v Run the regression of gmwaget on gwage232t21 and gemp232t21 and report the Rsquared Comment on how the value of Rsquared helps explain your answer to part iv C13 Use the data in BEVERIDGE to answer this question The data set includes monthly observations on vacancy rates and unemployment rates for the United States from December 2000 through February 2012 i Find the correlation between urate and urate1 Would you say the correlation points more toward a unit root process or a weakly dependent process ii Repeat part i but with the vacancy rate vrate iii The Beveridge Curve relates the unemployment rate to the vacancy rate with the simplest relationship being linear uratet 5 b0 1 b1vratet 1 ut where b1 0 is expected Estimate b0 and b1 by OLS and report the results in the usual form Do you find a negative relationship iv Explain why you cannot trust the confidence interval for b1 reported by the OLS output in part iii The tools needed to study regressions of this type are presented in Chapter 18 v If you difference urate and vrate before running the regression how does the estimated slope coefficient compare with part iii Is it statistically different from zero This example shows that differencing before running an OLS regression is not always a sensible strategy But we cannot say more until Chapter 18 C14 Use the data in APPROVAL to answer the following questions See also Computer Exercise C14 in Chapter 10 i Compute the first order autocorrelations for the variables approve and lrgasprice Do they seem close enough to unity to worry about unit roots ii Consider the model approvet 5 b0 1 b1lcpifoodt 1 b2lrgaspricet 1 b3unemployt 1 b4sep11t 1 b5iraqinvadet 1 ut where the first two variables are in logarithmic form Given what you found in part i why might you hesitate to estimate this model by OLS iii Estimate the equation in part ii by differencing all variables including the dummy variables How do you interpret your estimate of b2 Is it statistically significant Report the pvalue iv Interpret your estimate of b4 and discuss its statistical significance v Add lsp500 to the model in part ii and estimate the equation by first differencing Discuss what you find for the stock market variable Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 372 I n this chapter we discuss the critical problem of serial correlation in the error terms of a multiple regression model We saw in Chapter 11 that when in an appropriate sense the dynamics of a model have been completely specified the errors will not be serially correlated Thus testing for serial correlation can be used to detect dynamic misspecification Furthermore static and finite dis tributed lag models often have serially correlated errors even if there is no underlying misspecification of the model Therefore it is important to know the consequences and remedies for serial correlation for these useful classes of models In Section 121 we present the properties of OLS when the errors contain serial correlation In Section 122 we demonstrate how to test for serial correlation We cover tests that apply to mod els with strictly exogenous regressors and tests that are asymptotically valid with general regressors including lagged dependent variables Section 123 explains how to correct for serial correlation under the assumption of strictly exogenous explanatory variables while Section 124 shows how using differenced data often eliminates serial correlation in the errors Section 125 covers more recent advances on how to adjust the usual OLS standard errors and test statistics in the presence of very general serial correlation In Chapter 8 we discussed testing and correcting for heteroskedasticity in crosssectional appli cations In Section 126 we show how the methods used in the crosssectional case can be extended to the time series case The mechanics are essentially the same but there are a few subtleties associ ated with the temporal correlation in time series observations that must be addressed In addition we briefly touch on the consequences of dynamic forms of heteroskedasticity Serial Correlation and Heteroskedasticity in Time Series Regressions c h a p t e r 12 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 373 121 Properties of OLS with Serially Correlated Errors 121a Unbiasedness and Consistency In Chapter 10 we proved unbiasedness of the OLS estimator under the first three GaussMarkov assumptions for time series regressions TS1 through TS3 In particular Theorem 101 assumed nothing about serial correlation in the errors It follows that as long as the explanatory variables are strictly exogenous the b j are unbiased regardless of the degree of serial correlation in the errors This is analogous to the observation that heteroskedasticity in the errors does not cause bias in the b j In Chapter 11 we relaxed the strict exogeneity assumption to E1ut0xt2 5 0 and showed that when the data are weakly dependent the b j are still consistent although not necessarily unbiased This result did not hinge on any assumption about serial correlation in the errors 121b Efficiency and Inference Because the GaussMarkov Theorem Theorem 104 requires both homoskedasticity and serially uncorrelated errors OLS is no longer BLUE in the presence of serial correlation Even more impor tantly the usual OLS standard errors and test statistics are not valid even asymptotically We can see this by computing the variance of the OLS estimator under the first four GaussMarkov assumptions and the AR1 serial correlation model for the error terms More precisely we assume that ut 5 rut21 1 et t 5 1 2 p n 121 0r0 1 122 where the et are uncorrelated random variables with mean zero and variance s2 e recall from Chapter 11 that assumption 122 is the stability condition We consider the variance of the OLS slope estimator in the simple regression model yt 5 b0 1 b1xt 1 ut and just to simplify the formula we assume that the sample average of the xt is zero 1x 5 02 Then the OLS estimator b1 of b1 can be written as b1 5 b1 1 SST21 x a n t51 xtut 123 where SSTx 5 g n t51 x2 t Now in computing the variance of b 1 conditional on X we must account for the serial correlation in the ut Var1b 12 5 SST22 x Vara a n t51 xtutb 5 SST22 x a a n t51 x2 tVar1ut2 1 2 a n21 t51 a n2t j51 xtxt1j E1utut1j2 b 124 5 s2SSTx 1 21s2SST2 x2 a n21 t51 a n2t j51 rjxtxt1j where s2 5 Var1ut2 and we have used the fact that E1utut1j2 5 Cov1ut ut1j2 5 rjs2 see equation 114 The first term in equation 124 s2SSTx is the variance of b 1 when r 5 0 which is the familiar OLS variance under the GaussMarkov assumptions If we ignore the serial correlation and estimate the variance in the usual way the variance estimator will usually be biased when r 2 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 374 because it ignores the second term in 124 As we will see through later examples r 0 is most common in which case r j 0 for all j Further the independent variables in regression models are often positively correlated over time so that xtxt1j is positive for most pairs t and t j Therefore in most economic applications the term g n21 t51 g n2t j51 rjxtxt1j is positive and so the usual OLS variance formula s2SSTx understates the true variance of the OLS estimator If r is large or xt has a high degree of positive serial correlationa common casethe bias in the usual OLS variance estimator can be substantial We will tend to think the OLS slope estimator is more precise than it actually is When r 0 r j is negative when j is odd and positive when j is even and so it is difficult to deter mine the sign of g n21 t51 g n2t j51 rjxtxt1j In fact it is pos sible that the usual OLS variance formula actually overstates the true variance of b 1 In either case the usual variance estimator will be biased for Var1b 12 in the presence of serial correlation Because the standard error of b 1 is an estimate of the standard deviation of b 1 using the usual OLS standard error in the presence of serial correlation is invalid Therefore t statistics are no longer valid for testing single hypotheses Since a smaller standard error means a larger t statistic the usual t statistics will often be too large when r 0 The usual F and LM statistics for testing multiple hypotheses are also invalid 121c Goodness of Fit Sometimes one sees the claim that serial correlation in the errors of a time series regression model invalidates our usual goodnessoffit measures Rsquared and adjusted Rsquared Fortunately this is not the case provided the data are stationary and weakly dependent To see why these measures are still valid recall that we defined the population Rsquared in a crosssectional context to be 1 2 s2 us2 y see Section 63 This definition is still appropriate in the context of time series regressions with stationary weakly dependent data the variances of both the error and the dependent variable do not change over time By the law of large numbers R2 and R2 both consistently estimate the population Rsquared The argument is essentially the same as in the crosssectional case in the presence of heteroskedasticity see Section 81 Because there is never an unbiased estimator of the population Rsquared it makes no sense to talk about bias in R2 caused by serial correlation All we can really say is that our goodnessoffit measures are still consistent estimators of the population parameter This argument does not go through if 5yt6 is an I1 process because Var1yt2 grows with t goodness of fit does not make much sense in this case As we discussed in Section 105 trends in the mean of yt or seasonality can and should be accounted for in computing an Rsquared Other departures from stationarity do not cause difficulty in interpreting R2 and R2 in the usual ways 121d Serial Correlation in the Presence of Lagged Dependent Variables Beginners in econometrics are often warned of the dangers of serially correlated errors in the pres ence of lagged dependent variables Almost every textbook on econometrics contains some form of the statement OLS is inconsistent in the presence of lagged dependent variables and serially cor related errors Unfortunately as a general assertion this statement is false There is a version of the statement that is correct but it is important to be very precise To illustrate suppose that the expected value of yt given yt21 is linear E1yt0yt212 5 b0 1 b1yt21 125 Suppose that rather than the AR1 model ut follows the MA1 model ut 5 et 1 aet21 Find Var1b 12 and show that it is different from the usual formula if a 2 0 Exploring FurthEr 121 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 375 where we assume stability 0b10 1 We know we can always write this with an error term as yt 5 b0 1 b1yt21 1 ut 126 E1ut0yt212 5 0 127 By construction this model satisfies the key zero conditional mean Assumption TS3r for consistency of OLS therefore the OLS estimators b 0 and b 1 are consistent It is important to see that without fur ther assumptions the errors 5ut6 can be serially correlated Condition 127 ensures that ut is uncor related with yt21 but ut and yt22 could be correlated Then because ut21 5 yt21 2 b0 2 b1yt22 the covariance between ut and ut21 is 2b1Cov1ut yt222 which is not necessarily zero Thus the errors exhibit serial correlation and the model contains a lagged dependent variable but OLS consistently estimates b0 and b1 because these are the parameters in the conditional expectation 125 The serial correlation in the errors will cause the usual OLS statistics to be invalid for testing purposes but it will not affect consistency So when is OLS inconsistent if the errors are serially correlated and the regressors contain a lagged dependent variable This happens when we write the model in error form exactly as in 126 but then we assume that 5ut6 follows a stable AR1 model as in 121 and 122 where E1et0ut21 ut22 p2 5 E1et0yt21 yt22 p2 5 0 128 Because et is uncorrelated with yt21 by assumption Cov1yt21 ut2 5 rCov1yt21 ut212 which is not zero unless r 5 0 This causes the OLS estimators of b0 and b1 from the regression of yt on yt21 to be inconsistent We now see that OLS estimation of 126 when the errors ut also follow an AR1 model leads to inconsistent estimators However the correctness of this statement makes it no less wrongheaded We have to ask What would be the point in estimating the parameters in 126 when the errors fol low an AR1 model It is difficult to think of cases where this would be interesting At least in 125 the parameters tell us the expected value of yt given yt21 When we combine 126 and 121 we see that yt really follows a second order autoregressive model or AR2 model To see this write ut21 5 yt21 2 b0 2 b1yt22 and plug this into ut 5 rut21 1 et Then 126 can be rewritten as yt 5 b0 1 b1yt21 1 r1yt21 2 b0 2 b1yt222 1 et 5 b011 2 r2 1 1b1 1 r2yt21 2 rb1yt22 1 et 5 a0 1 a1yt21 1 a2yt22 1 et where a0 5 b011 2 r2 a1 5 b1 1 r and a2 5 2rb1 Given 128 it follows that E1yt0yt21 yt22 p2 5 E1yt0yt21 yt222 5 a0 1 a1yt21 1 a2yt22 129 This means that the expected value of yt given all past y depends on two lags of y It is equation 129 that we would be interested in using for any practical purpose including forecasting as we will see in Chapter 18 We are especially interested in the parameters aj Under the appropriate stability conditions for an AR2 modelwhich we will cover in Section 123OLS estimation of 129 pro duces consistent and asymptotically normal estimators of the aj The bottom line is that you need a good reason for having both a lagged dependent variable in a model and a particular model of serial correlation in the errors Often serial correlation in the errors of a dynamic model simply indicates that the dynamic regression function has not been completely specified in the previous example we should add yt22 to the equation In Chapter 18 we will see examples of models with lagged dependent variables where the errors are serially correlated and are also correlated with yt21 But even in these cases the errors do not follow an autoregressive process Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 376 122 Testing for Serial Correlation In this section we discuss several methods of testing for serial correlation in the error terms in the multiple linear regression model yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ut We first consider the case when the regressors are strictly exogenous Recall that this requires the error ut to be uncorrelated with the regressors in all time periods see Section 103 so among other things it rules out models with lagged dependent variables 122a A t Test for AR1 Serial Correlation with Strictly Exogenous Regressors Although there are numerous ways in which the error terms in a multiple regression model can be serially correlated the most popular modeland the simplest to work withis the AR1 model in equations 121 and 122 In the previous section we explained the implications of performing OLS when the errors are serially correlated in general and we derived the variance of the OLS slope estimator in a simple regression model with AR1 errors We now show how to test for the presence of AR1 serial correlation The null hypothesis is that there is no serial correlation Therefore just as with tests for heteroskedasticity we assume the best and require the data to provide reasonably strong evidence that the ideal assumption of no serial correlation is violated We first derive a largesample test under the assumption that the explanatory variables are strictly exogenous the expected value of ut given the entire history of independent variables is zero In addi tion in 121 we must assume that E1et0ut21 ut22 p2 5 0 1210 and Var1et0ut212 5 Var1et2 5 s2 e 1211 These are standard assumptions in the AR1 model which follow when 5et6 is an iid sequence and they allow us to apply the largesample results from Chapter 11 for dynamic regression As with testing for heteroskedasticity the null hypothesis is that the appropriate GaussMarkov assumption is true In the AR1 model the null hypothesis that the errors are serially uncorrelated is H0 r 5 0 1212 How can we test this hypothesis If the ut were observed then under 1210 and 1211 we could immediately apply the asymptotic normality results from Theorem 112 to the dynamic regression model ut 5 rut21 1 et t 5 2 p n 1213 Under the null hypothesis r 5 0 5ut6 is clearly weakly dependent In other words we could esti mate r from the regression of ut on ut21 for all t 5 2 p n without an intercept and use the usual t statistic for r This does not work because the errors ut are not observed Nevertheless just as with testing for heteroskedasticity we can replace ut with the corresponding OLS residual u t Since u t depends on the OLS estimators b 0 b 1 p b k it is not obvious that using u t for ut in the regression has no effect on the distribution of the t statistic Fortunately it turns out that because of the strict exogeneity assumption the largesample distribution of the t statistic is not affected by using the OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 377 residuals in place of the errors A proof is well beyond the scope of this text but it follows from the work of Wooldridge 1991b We can summarize the asymptotic test for AR1 serial correlation very simply Testing for AR1 Serial Correlation with Strictly Exogenous Regressors i Run the OLS regression of yt on xt1 p xtk and obtain the OLS residuals u t for all t 5 1 2 p n ii Run the regression of u t on u t21 for all t 5 2 p n 1214 obtaining the coefficient r on u t21 and its t statistic tr This regression may or may not contain an intercept the t statistic for r will be slightly affected but it is asymptotically valid either way iii Use tr to test H0 r 5 0 against H1 r 2 0 in the usual way Actually since r 0 is often expected a priori the alternative can be H1 r 0 Typically we conclude that serial correla tion is a problem to be dealt with only if H0 is rejected at the 5 level As always it is best to report the pvalue for the test In deciding whether serial correlation needs to be addressed we should remember the differ ence between practical and statistical significance With a large sample size it is possible to find serial correlation even though r is practically small when r is close to zero the usual OLS inference procedures will not be far off see equation 124 Such outcomes are somewhat rare in time series applications because time series data sets are usually small ExamplE 121 Testing for aR1 Serial Correlation in the phillips Curve In Chapter 10 we estimated a static Phillips curve that explained the inflationunemployment tradeoff in the United States see Example 101 In Chapter 11 we studied a particular expectations aug mented Phillips curve where we assumed adaptive expectations see Example 115 We now test the error term in each equation for serial correlation Since the expectations augmented curve uses Dinft 5 inft 2 inft21 as the dependent variable we have one fewer observation For the static Phillips curve the regression in 1214 yields r 5 573 t 5 493 and pvalue 5 000 with 48 observations through 1996 This is very strong evidence of positive first order serial correlation One consequence of this is that the standard errors and t statistics from Chapter 10 are not valid By contrast the test for AR1 serial correlation in the expectations aug mented curve gives r 5 2036 t 5 2287 and pvalue 5 775 with 47 observations there is no evidence of AR1 serial correlation in the expectations augmented Phillips curve Although the test from 1214 is derived from the AR1 model the test can detect other kinds of serial correlation Remember r is a consistent estimator of the correlation between ut and ut21 Any serial correlation that causes adjacent errors to be correlated can be picked up by this test On the other hand it does not detect serial correlation where adjacent errors are uncorrelated Corr1ut ut212 5 0 For example ut and ut22 could be correlated In using the usual t statistic from 1214 we must assume that the errors in 1213 satisfy the appropriate homoskedasticity assumption 1211 In fact it is easy to make the test robust to heteroskedasticity in et we simply use the usual heteroskedasticityrobust t statistic from Chapter 8 For the static Phillips curve in Example 121 the heteroskedasticityrobust t statistic is 403 which is How would you use regression 1214 to construct an approximate 95 confidence interval for r Exploring FurthEr 122 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 378 smaller than the nonrobust t statistic but still very significant In Section 127 we further discuss het eroskedasticity in time series regressions including its dynamic forms 122b The DurbinWatson Test under Classical Assumptions Another test for AR1 serial correlation is the DurbinWatson test The DurbinWatson DW statistic is also based on the OLS residuals DW 5 a n t52 1u t 2 u t212 2 a n t51 u 2 t 1215 Simple algebra shows that DW and r from 1214 are closely linked DW 211 2 r 2 1216 One reason this relationship is not exact is that r has a n t52u 2 t21 in its denominator while the DW statistic has the sum of squares of all OLS residuals in its denominator Even with moderate sample sizes the approximation in 1216 is often pretty close Therefore tests based on DW and the t test based on r are conceptually the same Durbin and Watson 1950 derive the distribution of DW conditional on X something that requires the full set of classical linear model assumptions including normality of the error terms Unfortunately this distribution depends on the values of the independent variables It also depends on the sample size the number of regressors and whether the regression contains an intercept Although some econometrics packages tabulate critical values and pvalues for DW many do not In any case they depend on the full set of CLM assumptions Several econometrics texts report upper and lower bounds for the critical values that depend on the desired significance level the alternative hypothesis the number of observations and the number of regressors We assume that an intercept is included in the model Usually the DW test is com puted for the alternative H1 r 0 1217 From the approximation in 1216 r 0 implies that DW 2 and r 5 0 implies that DW 2 Thus to reject the null hypothesis 1212 in favor of 1217 we are looking for a value of DW that is significantly less than two Unfortunately because of the problems in obtaining the null distribution of DW we must compare DW with two sets of critical values These are usually labeled as dU for upper and dL for lower If DW dL then we reject H0 in favor of 1217 if DW dU we fail to reject H0 If dL DW dU the test is inconclusive As an example if we choose a 5 significance level with n 5 45 and k 5 4 dU 5 1720 and dL 5 1336 see Savin and White 1977 If DW 1336 we reject the null of no serial correlation at the 5 level if DW 172 we fail to reject H0 if 1336 DW 172 the test is inconclusive In Example 121 for the static Phillips curve DW is computed to be DW 5 80 We can obtain the lower 1 critical value from Savin and White 1977 for k 5 1 and n 5 50 dL 5 132 Therefore we reject the null of no serial correlation against the alternative of positive serial correlation at the 1 level Using the previous t test we can conclude that the pvalue equals zero to three decimal places For the expectations augmented Phillips curve DW 5 177 which is well within the failtoreject region at even the 5 level dU 5 159 The fact that an exact sampling distribution for DW can be tabulated is the only advantage that DW has over the t test from 1214 Given that the tabulated critical values are exactly valid only under the full set of CLM assumptions and that they can lead to a wide inconclusive region the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 379 practical disadvantages of the DW statistic are substantial The t statistic from 1214 is simple to compute and asymptotically valid without normally distributed errors The t statistic is also valid in the presence of heteroskedasticity that depends on the xtj Plus it is easy to make it robust to any form of heteroskedasticity 122c Testing for AR1 Serial Correlation without Strictly Exogenous Regressors When the explanatory variables are not strictly exogenous so that one or more xtj are correlated with ut21 neither the t test from regression 1214 nor the DurbinWatson statistic are valid even in large samples The leading case of nonstrictly exogenous regressors occurs when the model contains a lagged dependent variable yt21 and ut21 are obviously correlated Durbin 1970 suggested two alter natives to the DW statistic when the model contains a lagged dependent variable and the other regres sors are nonrandom or more generally strictly exogenous The first is called Durbins h statistic This statistic has a practical drawback in that it cannot always be computed so we do not cover it here Durbins alternative statistic is simple to compute and is valid when there are any number of non strictly exogenous explanatory variables The test also works if the explanatory variables happen to be strictly exogenous Testing for Serial Correlation with General Regressors i Run the OLS regression of yt on xt1 p xtk and obtain the OLS residuals u t for all t 5 1 2 p n ii Run the regression of u t on xt1 xt2 p xtk u t21 for all t 5 2 p n 1218 to obtain the coefficient r on u t21 and its t statistic tr iii Use tr to test H0 r 5 0 against H1 r 2 0 in the usual way or use a onesided alternative In equation 1218 we regress the OLS residuals on all independent variables including an intercept and the lagged residual The t statistic on the lagged residual is a valid test of 1212 in the AR1 model 1213 when we add Var1ut0xt ut212 5 s2 under H0 Any number of lagged dependent vari ables may appear among the xtj and other nonstrictly exogenous explanatory variables are allowed as well The inclusion of xt1 p xtk explicitly allows for each xtj to be correlated with ut21 and this ensures that tr has an approximate t distribution in large samples The t statistic from 1214 ignores possible correlation between xtj and ut21 so it is not valid without strictly exogenous regressors Incidentally because u t 5 yt 2 b 0 2 b 1xt1 2 p 2 b kxtk it can be shown that the t statistic on u t21 is the same if yt is used in place of u t as the dependent variable in 1218 The t statistic from 1218 is easily made robust to heteroskedasticity of unknown form in par ticular when Var1ut0xt ut212 is not constant just use the heteroskedasticityrobust t statistic on u t21 ExamplE 122 Testing for aR1 Serial Correlation in the minimum Wage Equation In Chapter 10 see Example 109 we estimated the effect of the minimum wage on the Puerto Rican employment rate We now check whether the errors appear to contain serial correlation using the test that does not assume strict exogeneity of the minimum wage or GNP variables We add the log of Puerto Rican real GNP to equation 1038 as in Computer Exercise C3 in Chapter 10 We are assuming that the underlying stochastic processes are weakly dependent but we allow them to contain a linear time trend by including t in the regression Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 380 Letting u t denote the OLS residuals we run the regression of u t on log1mincovt2 log1prgnpt2 log1usgnpt2 t and u t21 using the 37 available observations The estimated coefficient on u t21 is r 5 481 with t 5 289 two sided pvalue 5 007 Therefore there is strong evidence of AR1 serial correlation in the errors which means the t statistics for the b j that we obtained before are not valid for inference Remember though the b j are still consistent if ut is contemporaneously uncorrelated with each explanatory var iable Incidentally if we use regression 1214 instead we obtain r 5 417 and t 5 263 so the outcome of the test is similar in this case 122d Testing for Higher Order Serial Correlation The test from 1218 is easily extended to higher orders of serial correlation For example suppose that we wish to test H0 r1 5 0 r2 5 0 1219 in the AR2 model ut 5 r1ut21 1 r2ut22 1 et This alternative model of serial correlation allows us to test for second order serial correlation As always we estimate the model by OLS and obtain the OLS residuals u t Then we can run the regression of u t on xt1 xt2 p xtk u t21 and u t22 for all t 5 3 p n to obtain the F test for joint significance of u t21 and u t22 If these two lags are jointly significant at a small enough level say 5 then we reject 1219 and conclude that the errors are serially correlated More generally we can test for serial correlation in the autoregressive model of order q ut 5 r1ut21 1 r2ut22 1 p 1 rqut2q 1 et 1220 The null hypothesis is H0 r1 5 0 r2 5 0 p rq 5 0 1221 Testing for ARq Serial Correlation i Run the OLS regression of yt on xt1 p xtk and obtain the OLS residuals u t for all t 5 1 2 p n ii Run the regression of u t on xt1 xt2 p xtk u t21 u t22 p u t2q for all t 5 1q 1 12 p n 1222 iii Compute the F test for joint significance of u t21 u t22 p u t2q in 1222 The F statistic with yt as the dependent variable in 1222 can also be used as it gives an identical answer If the xtj are assumed to be strictly exogenous so that each xtj is uncorrelated with ut21 ut22 p ut2q then the xtj can be omitted from 1222 Including the xtj in the regression makes the test valid with or without the strict exogeneity assumption The test requires the homoskedasticity assumption Var1ut0xt ut21 p ut2q2 5 s2 1223 A heteroskedasticityrobust version can be computed as described in Chapter 8 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 381 An alternative to computing the F test is to use the Lagrange multiplier LM form of the statistic We covered the LM statistic for testing exclusion restrictions in Chapter 5 for crosssectional analysis The LM statistic for testing 1221 is simply LM 5 1n 2 q2R2 u 1224 where R2 u is just the usual Rsquared from regression 1222 Under the null hypothesis LM a x2 q This is usually called the BreuschGodfrey test for ARq serial correlation The LM statistic also requires 1223 but it can be made robust to heteroskedasticity For details see Wooldridge 1991b ExamplE 123 Testing for aR3 Serial Correlation In the event study of the barium chloride industry see Example 105 we used monthly data so we may wish to test for higher orders of serial correlation For illustration purposes we test for AR3 serial correlation in the errors underlying equation 1022 Using regression 1222 we find the F statistic for joint significance of u t21 u t22 and u t23 is F 5 512 Originally we had n 5 131 and we lose three observations in the auxiliary regression 1222 Because we estimate 10 parameters in 1222 for this example the df in the F statistic are 3 and 118 The pvalue of the F statistic is 0023 so there is strong evidence of AR3 serial correlation With quarterly or monthly data that have not been seasonally adjusted we sometimes wish to test for seasonal forms of serial correlation For example with quarterly data we might postulate the autoregressive model ut 5 r4ut24 1 et 1225 From the AR1 serial correlation tests it is pretty clear how to proceed When the regressors are strictly exogenous we can use a t test on u t24 in the regression of u t on u t24 for all t 5 5 p n A modification of the DurbinWatson statistic is also available see Wallis 1972 When the xtj are not strictly exogenous we can use the regression in 1218 with u t24 replacing u t21 In Example 123 the data are monthly and are not seasonally adjusted Therefore it makes sense to test for correlation between ut and ut212 A regression of u t on u t212 yields r 12 5 2187 and pvalue 5 028 so there is evidence of negative seasonal autocorrelation Including the regressors changes things only modestly r 12 5 2170 and pvalue 5 052 This is somewhat unusual and does not have an obvious explanation 123 Correcting for Serial Correlation with Strictly Exogenous Regressors If we detect serial correlation after applying one of the tests in Section 122 we have to do something about it If our goal is to estimate a model with complete dynamics we need to respecify the model In applications where our goal is not to estimate a fully dynamic model we need to find a way to carry out statistical inference as we saw in Section 121 the usual OLS test statistics are no longer valid In this section we begin with the important case of AR1 serial correlation The traditional approach to this problem assumes fixed regressors What are actually needed are strictly exogenous regressors Therefore at a minimum we should not use these corrections when the explanatory variables include lagged dependent variables Suppose you have quarterly data and you want to test for the presence of first order or fourth order serial correlation With strictly exogenous regressors how would you proceed Exploring FurthEr 123 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 382 123a Obtaining the Best Linear Unbiased Estimator in the AR1 Model We assume the GaussMarkov assumptions TS1 through TS4 but we relax Assumption TS5 In particular we assume that the errors follow the AR1 model ut 5 rut21 1 et for all t 5 1 2 p 1226 Remember that Assumption TS3 implies that ut has a zero mean conditional on X In the following analysis we let the conditioning on X be implied in order to simplify the notation Thus we write the variance of ut as Var1ut2 5 s2 e11 2 r22 1227 For simplicity consider the case with a single explanatory variable yt 5 b0 1 b1xt 1 ut for all t 5 1 2 p n Because the problem in this equation is serial correlation in the ut it makes sense to transform the equation to eliminate the serial correlation For t 2 we write yt21 5 b0 1 b1xt21 1 ut21 yt 5 b0 1 b1xt 1 ut Now if we multiply this first equation by r and subtract it from the second equation we get yt 2 ryt21 5 11 2 r2b0 1 b11xt 2 rxt212 1 et t 2 where we have used the fact that et 5 ut 2 rut21 We can write this as y 1 5 11 2 r2b0 1 b1x t 1 et t 2 1228 where y 1 5yt 2ryt21 x t 5xt 2rxt21 1229 are called the quasidifferenced data If r 5 1 these are differenced data but remember we are assuming 0r0 1 The error terms in 1228 are serially uncorrelated in fact this equation satisfies all of the GaussMarkov assumptions This means that if we knew r we could estimate b0 and b1 by regressing yt on xt provided we divide the estimated intercept by 1 2 r The OLS estimators from 1228 are not quite BLUE because they do not use the first time period This is easily fixed by writing the equation for t 5 1 as y1 5 b0 1 b1x1 1 u1 1230 Since each et is uncorrelated with u1 we can add 1230 to 1228 and still have serially uncorrelated errors However using 1227 Var1u12 5 s2 e11 2 r22 s2 e 5 Var1et2 Equation 1227 clearly does not hold when 0r0 1 which is why we assume the stability condition Thus we must multiply 1230 by 11 2 r22 12 to get errors with the same variance 11 2 r22 12y1 5 11 2 r22 12b0 1 b111 2 r22 12x1 1 11 2 r22 12u1 or y 1 5 11 2 r22 12b0 1 b1x 1 1 u 1 1231 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 383 where u1 5 11 2 r22 12u1 y1 5 11 2 r22 12y1 and so on The error in 1231 has variance Var1 u12 5 11 2 r22Var1u12 5 s2 e so we can use 1231 along with 1228 in an OLS regression This gives the BLUE estimators of b0 and b1 under Assumptions TS1 through TS4 and the AR1 model for ut This is another example of a generalized least squares or GLS estimator We saw other GLS estimators in the context of heteroskedasticity in Chapter 8 Adding more regressors changes very little For t 2 we use the equation y t 5 11 2 r2b0 1 b1x t1 1 p 1 bkx tk 1 et 1232 where xtj 5 xtj 2 rxt21 j For t 5 1 we have y1 5 11 2 r22 12y1 x1j 5 11 2 r22 12x1j and the inter cept is 11 2 r22 12b0 For given r it is fairly easy to transform the data and to carry out OLS Unless r 5 0 the GLS estimator that is OLS on the transformed data will generally be different from the original OLS estimator The GLS estimator turns out to be BLUE and since the errors in the trans formed equation are serially uncorrelated and homoskedastic t and F statistics from the transformed equation are valid at least asymptotically and exactly if the errors et are normally distributed 123b Feasible GLS Estimation with AR1 Errors The problem with the GLS estimator is that r is rarely known in practice However we already know how to get a consistent estimator of r we simply regress the OLS residuals on their lagged counter parts exactly as in equation 1214 Next we use this estimate r in place of r to obtain the quasi differenced variables We then use OLS on the equation y t 5 b0x t0 1 b1x t1 1 p 1 bkx tk 1 errort 1233 where x t0 5 11 2 r 2 for t 2 and x 10 5 11 2 r 22 12 This results in the feasible GLSFGLS esti mator of the bj The error term in 1233 contains et and also the terms involving the estimation error in r Fortunately the estimation error in r does not affect the asymptotic distribution of the FGLS estimators Feasible GLS Estimation of the AR1 Model i Run the OLS regression of yt on xt1 p xtk and obtain the OLS residuals u t t 5 1 2 p n ii Run the regression in equation 1214 and obtain r iii Apply OLS to equation 1233 to estimate b0 b1 p bk The usual standard errors t statistics and F statistics are asymptotically valid The cost of using r in place of r is that the FGLS estimator has no tractable finite sample properties In particular it is not unbiased although it is consistent when the data are weakly dependent Further even if et in 1232 is normally distributed the t and F statistics are only approximately t and F distributed because of the estimation error in r This is fine for most purposes although we must be careful with small sample sizes Since the FGLS estimator is not unbiased we certainly cannot say it is BLUE Nevertheless it is asymptotically more efficient than the OLS estimator when the AR1 model for serial correlation holds and the explanatory variables are strictly exogenous Again this statement assumes that the time series are weakly dependent There are several names for FGLS estimation of the AR1 model that come from different methods of estimating r and different treatment of the first observation CochraneOrcutt CO estimation omits the first observation and uses r from 1214 whereas PraisWinsten PW estimation uses the first observa tion in the previously suggested way Asymptotically it makes no difference whether or not the first obser vation is used but many time series samples are small so the differences can be notable in applications In practice both the CochraneOrcutt and PraisWinsten methods are used in an iterative scheme That is once the FGLS estimator is found using r from 1214 we can compute a new set of residu als obtain a new estimator of r from 1214 transform the data using the new estimate of r and Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 384 estimate 1233 by OLS We can repeat the whole process many times until the estimate of r changes by very little from the previous iteration Many regression packages implement an iterative procedure automatically so there is no additional work for us It is difficult to say whether more than one itera tion helps It seems to be helpful in some cases but theoretically the largesample properties of the iterated estimator are the same as the estimator that uses only the first iteration For details on these and other methods see Davidson and MacKinnon 1993 Chapter 10 ExamplE 124 praisWinsten Estimation in the Event Study Again using the data in BARIUM we estimate the equation in Example 105 using iterated Prais Winsten estimation For comparison we also present the OLS results in Table 121 The coefficients that are statistically significant in the PraisWinsten estimation do not differ by much from the OLS estimates in particular the coefficients on logchempi logrtwex and afdec6 It is not surprising for statistically insignificant coefficients to change perhaps markedly across dif ferent estimation methods Notice how the standard errors in the second column are uniformly higher than the standard errors in column 1 This is common The PraisWinsten standard errors account for serial correla tion the OLS standard errors do not As we saw in Section 121 the OLS standard errors usually understate the actual sampling variation in the OLS estimates and should not be relied upon when sig nificant serial correlation is present Therefore the effect on Chinese imports after the International Trade Commissions decision is now less statistically significant than we thought tafdec6 5 2169 Finally an Rsquared is reported for the PW estimation that is well below the Rsquared for the OLS estimation in this case However these Rsquareds should not be compared For OLS the Rsquared as usual is based on the regression with the untransformed dependent and independ ent variables For PW the Rsquared comes from the final regression of the transformed dependent variable on the transformed independent variables It is not clear what this R2 is actually measuring nevertheless it is traditionally reported TAblE 121 Dependent Variable logchnimp Coefficient OLS PraisWinsten logchempi 312 048 294 063 loggas 196 907 105 098 logrtwex 983 400 113 051 befile6 060 261 2016 322 affile6 2032 264 2033 322 afdec6 2565 286 2577 342 intercept 21780 2105 23708 2278 r 293 Observations Rsquared 131 305 131 202 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 385 123c Comparing OLS and FGLS In some applications of the CochraneOrcutt or PraisWinsten methods the FGLS estimates differ in practically important ways from the OLS estimates This was not the case in Example 124 Typically this has been interpreted as a verification of FGLSs superiority over OLS Unfortunately things are not so simple To see why consider the regression model yt 5 b0 1 b1xt 1 ut where the time series processes are stationary Now assuming that the law of large numbers holds consistency of OLS for b1 holds if Cov1xt ut2 5 0 1234 Earlier we asserted that FGLS was consistent under the strict exogeneity assumption which is more restrictive than 1234 In fact it can be shown that the weakest assumption that must hold for FGLS to be consistent in addition to 1234 is that the sum of xt21 and xt11 is uncorrelated with ut Cov3 1xt21 1 xt112 ut4 5 0 1235 Practically speaking consistency of FGLS requires ut to be uncorrelated with xt21 xt and xt11 How can we show that condition 1235 is needed along with 1234 The argument is simple if we assume r is known and drop the first time period as in CochraneOrcutt The argument when we use r is technically harder and yields no additional insights Since one observation cannot affect the asymptotic properties of an estimator dropping it does not affect the argument Now with known r the GLS estimator uses xt 2 rxt21 as the regressor in an equation where ut 2 rut21 is the error From Theorem 111 we know the key condition for consistency of OLS is that the error and the regressor are uncorrelated In this case we need E3 1xt 2 rxt212 1ut 2 rut212 4 5 0 If we expand the expecta tion we get E3 1xt 2 rxt212 1ut 2 rut212 4 5 E1xtut2 2 rE1xt21ut2 2 rE1xtut212 1 r2E1xt21ut212 5 2r3E1xt21ut2 1 E1xtut212 4 because E1xtut2 5 E1xt21ut212 5 0 by assumption 1234 Now under stationarity E1xtut2125 E1xt11ut2 because we are just shifting the time index one period forward Therefore E1xt21ut2 1 E1xtut212 5 E3 1xt21 1 xt112ut4 and the last expectation is the covariance in equation 1235 because E1ut2 5 0 We have shown that 1235 is necessary along with 1234 for GLS to be consistent for b1 Of course if r 5 0 we do not need 1235 because we are back to doing OLS Our derivation shows that OLS and FGLS might give significantly different estimates because 1235 fails In this case OLSwhich is still consistent under 1234is preferred to FGLS which is inconsistent If x has a lagged effect on y or xt11 reacts to changes in ut FGLS can produce mis leading results Because OLS and FGLS are different estimation procedures we never expect them to give the same estimates If they provide similar estimates of the bj then FGLS is preferred if there is evi dence of serial correlation because the estimator is more efficient and the FGLS test statistics are at least asymptotically valid A more difficult problem arises when there are practical differences in the OLS and FGLS estimates it is hard to determine whether such differences are statistically sig nificant The general method proposed by Hausman 1978 can be used but it is beyond the scope of this text The next example gives a case where OLS and FGLS are different in practically important ways Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 386 ExamplE 125 Static phillips Curve Table 122 presents OLS and iterated PraisWinsten estimates of the static Phillips curve from Example 101 using the observations through 1996 TAblE 122 Dependent Variable inf Coefficient OLS PraisWinsten unem 468 289 2716 313 intercept 1424 1719 8296 2231 r 781 Observations Rsquared 49 053 49 136 The coefficient of interest is on unem and it differs markedly between PW and OLS Because the PW estimate is consistent with the inflationunemployment tradeoff our tendency is to focus on the PW estimates In fact these estimates are fairly close to what is obtained by first differencing both inf and unem see Computer Exercise C4 in Chapter 11 which makes sense because the quasidifferencing used in PW with r 5 781 is similar to first differencing It may just be that inf and unem are not related in levels but they have a negative relationship in first differences Examples like the static Phillips curve can pose difficult problems for empirical researchers On the one hand if we are truly interested in a static relationship and if unemployment and inflation are I0 processes then OLS produces consistent estimators without additional assumptions But it could be that unemployment inflation or both have unit roots in which case OLS need not have its usual desirable properties we discuss this further in Chapter 18 In Example 125 FGLS gives more economically sensible estimates because it is similar to first differencing FGLS has the advantage of approximately eliminating unit roots 123d Correcting for Higher Order Serial Correlation It is also possible to correct for higher orders of serial correlation A general treatment is given in Harvey 1990 Here we illustrate the approach for AR2 serial correlation ut 5 r1ut21 1 r2ut22 1 et where 5et6 satisfies the assumptions stated for the AR1 model The stability conditions are more complicated now They can be shown to be see Harvey 1990 r2 21 r2 2 r1 1 and r1 1 r2 1 For example the model is stable if r1 5 8 and r2 5 23 the model is unstable if r1 5 7 and r2 5 4 Assuming the stability conditions hold we can obtain the transformation that eliminates the serial correlation In the simple regression model this is easy when t 2 yt 2 r1yt21 2 r2yt22 5 b011 2 r1 2 r22 1 b11xt 2 r1xt21 2 r2xt222 1 et or yt 5 b011 2 r1 2 r22 1 b1xt 1 et t 5 3 4 p n 1236 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 387 If we know r1 and r2 we can easily estimate this equation by OLS after obtaining the transformed variables Since we rarely know r1 and r2 we have to estimate them As usual we can use the OLS residuals u t obtain r 1 and r 2 from the regression of u t on u t21 u t22 t 5 3 p n This is the same regression used to test for AR2 serial correlation with strictly exogenous regres sors Then we use r 1 and r 2 in place of r1 and r2 to obtain the transformed variables This gives one version of the FGLS estimator If we have multiple explanatory variables then each one is trans formed by xtj 5 xtj 2 r1 xt21 j 2 r2 xt22 j when t 2 The treatment of the first two observations is a little tricky It can be shown that the dependent variable and each independent variable including the intercept should be transformed by z1 5 511 1 r22 3 11 2 r22 2 2 r2 1411 2 r22 612z1 z2 5 11 2 r2 22 12z2 2 3r111 2 r2 12 1211 2 r22 4z1 where z1 and z2 denote either the dependent or an independent variable at t 5 1 and t 5 2 respec tively We will not derive these transformations Briefly they eliminate the serial correlation between the first two observations and make their error variances equal to s2 e Fortunately econometrics packages geared toward time series analysis easily estimate models with general ARq errors we rarely need to directly compute the transformed variables ourselves 124 Differencing and Serial Correlation In Chapter 11 we presented differencing as a transformation for making an integrated process weakly dependent There is another way to see the merits of differencing when dealing with highly persistent data Suppose that we start with the simple regression model yt 5 b0 1 b1xt 1 ut t 5 1 2 p 1237 where ut follows the AR1 process in 1226 As we mentioned in Section 113 and as we will discuss more fully in Chapter 18 the usual OLS inference procedures can be very misleading when the variables yt and xt are integrated of order one or I1 In the extreme case where the errors 5ut6 in 1237 follow a random walk the equation makes no sense because among other things the variance of ut grows with t It is more logical to difference the equation Dyt 5 b1Dxt 1 Dut t 5 2 p n 1238 If ut follows a random walk then et Dut has zero mean and a constant variance and is serially uncorrelated Thus assuming that et and Dxt are uncorrelated we can estimate 1238 by OLS where we lose the first observation Even if ut does not follow a random walk but r is positive and large first differencing is often a good idea it will eliminate most of the serial correlation Of course equation 1238 is different from 1237 but at least we can have more faith in the OLS standard errors and t statistics in 1238 Allowing for multiple explanatory variables does not change anything ExamplE 126 Differencing the Interest Rate Equation In Example 102 we estimated an equation relating the threemonth Tbill rate to inflation and the federal deficit see equation 1015 If we obtain the residuals obtained from estimating 1015 and regress them on a single lag we obtain r 5 62311102 which is large and very statistically signifi cant Therefore at a minimum serial correlation is a problem in this equation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 388 If we difference the data and run the regression we obtain Di3t 5 042 1 149 D inf t 2 181 Ddeft 1 et 11712 10922 11482 1239 n 5 55 R2 5 176 R2 5 145 The coefficients from this regression are very different from the equation in levels suggesting either that the explanatory variables are not strictly exogenous or that one or more of the variables has a unit root In fact the correlation between i3t and i3t21 is about 885 which may indicate a problem with interpreting 1015 as a meaningful regression Plus the regression in differences has essentially no serial correlation a regression of et on et21 gives r 5 072 11342 Because first differencing elimi nates possible unit roots as well as serial correlation we probably have more faith in the estimates and standard errors from 1239 than 1015 The equation in differences shows that annual changes in interest rates are only weakly positively related to annual changes in inflation and the coefficient on Ddeft is actually negative though not statistically significant at even the 20 significance level against a twosided alternative As we explained in Chapter 11 the decision of whether or not to difference is a tough one But this discussion points out another benefit of differencing which is that it removes serial correlation We will come back to this issue in Chapter 18 125 Serial CorrelationRobust Inference after OLS In recent years it has become more popular to estimate models by OLS but to correct the standard errors for fairly arbitrary forms of serial correlation and heteroskedasticity Even though we know OLS will be inefficient there are some good reasons for taking this approach First the explanatory variables may not be strictly exogenous In this case FGLS is not even consistent let alone efficient Second in most applications of FGLS the errors are assumed to follow an AR1 model It may be better to compute standard errors for the OLS estimates that are robust to more general forms of serial correlation To get the idea consider equation 124 which is the variance of the OLS slope estimator in a simple regression model with AR1 errors We can estimate this variance very simply by plugging in our standard estimators of r and s2 The only problems with this are that it assumes the AR1 model holds and also assumes homoskedasticity It is possible to relax both of these assumptions A general treatment of standard errors that are both heteroskedasticity and serial correlation robust is given in Davidson and MacKinnon 1993 Here we provide a simple method to compute the robust standard error of any OLS coefficient Our treatment here follows Wooldridge 1989 Consider the standard multiple linear regression model yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ut t 5 1 2 p n 1240 which we have estimated by OLS For concreteness we are interested in obtaining a serial correlation robust standard error for b 1 This turns out to be fairly easy Write xt1 as a linear function of the remaining independent variables and an error term xt1 5 d0 1 d2xt2 1 p 1 dkxtk 1 rt where the error rt has zero mean and is uncorrelated with xt2 xt3 p xtk Suppose after estimating a model by OLS that you estimate r from regression 1214 and you obtain r 5 92 What would you do about this Exploring FurthEr 124 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 389 Then it can be shown that the asymptotic variance of the OLS estimator b 1 is AVar1b 12 5 a a n t51 E1r2 t 2 b 22 Vara a n t51 rtutb Under the no serial correlation Assumption TS5r 5at rtut6 is serially uncorrelated so either the usual OLS standard errors under homoskedasticity or the heteroskedasticityrobust standard errors will be valid But if TS5r fails our expression for AVar1b 12 must account for the correlation between at and as when t 2 s In practice it is common to assume that once the terms are farther apart than a few periods the correlation is essentially zero Remember that under weak dependence the correla tion must be approaching zero so this is a reasonable approach Following the general framework of Newey and West 1987 Wooldridge 1989 shows that AVar1b 12 can be estimated as follows Let se1b 12 denote the usual but incorrect OLS standard error and let s be the usual standard error of the regression or root mean squared error from estimat ing 1240 by OLS Let rt denote the residuals from the auxiliary regression of xt1 on xt2 xt3 p xtk 1241 including a constant as usual For a chosen integer g 0 define n 5 a n t51 a 2 t 1 2 a g h51 31 2 h1g 1 12 4a a n t5h11 a ta t2hb 1242 where a t 5 rt u t t 5 1 2 p n This looks somewhat complicated but in practice it is easy to obtain The integer g in 1242 controls how much serial correlation we are allowing in computing the standard error Once we have n the serial correlationrobust standard error of b 1 is simply se1b 12 5 3se1b 12s 42n 1243 In other words we take the usual OLS standard error of b 1 divide it by s square the result and then multiply by the square root of n This can be used to construct confidence intervals and t statistics for b 1 It is useful to see what n looks like in some simple cases When g 5 1 n 5 a n t51 a 2 t 1 a n t52 a t a t21 1244 and when g 5 2 n 5 a n t51 a 2 t 1 1432 a a n t52 a t a t21b 1 1232 a a n t53 a t a t22b 1245 The larger that g is the more terms are included to correct for serial correlation The purpose of the factor 1 hg 1 in 1242 is to ensure that n is in fact nonnegative Newey and West 1987 verify this We clearly need n 0 since n is estimating a variance and the square root of n appears in 1243 The standard error in 1243 is also robust to arbitrary heteroskedasticity In the time series liter ature the serial correlationrobust standard errors are sometimes called heteroskedasticity and auto correlation consistent or HAC standard errors In fact if we drop the second term in 1242 then 1243 becomes the usual heteroskedasticityrobust standard error that we discussed in Chapter 8 without the degrees of freedom adjustment Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 390 The theory underlying the standard error in 1243 is technical and somewhat subtle Remember we started off by claiming we do not know the form of serial correlation If this is the case how can we select the integer g Theory states that 1243 works for fairly arbitrary forms of serial correla tion provided g grows with sample size n The idea is that with larger sample sizes we can be more flexible about the amount of correlation in 1242 There has been much recent work on the relation ship between g and n but we will not go into that here For annual data choosing a small g such as g 5 1 or g 5 2 is likely to account for most of the serial correlation For quarterly or monthly data g should probably be larger such as g 5 4 or 8 for quarterly and g 5 12 or 24 for monthly assum ing that we have enough data Newey and West 1987 recommend taking g to be the integer part of 41n1002 29 others have suggested the integer part of n14 The NeweyWest suggestion is imple mented by the econometrics program Eviews For say n 5 50 which is reasonable for annual postwar data from World War II g 5 3 The integer part of n14 gives g 5 2 We summarize how to obtain a serial correlationrobust standard error for b 1 Of course since we can list any independent variable first the following procedure works for computing a standard error for any slope coefficient Serial CorrelationRobust Standard Error for b 1 i Estimate 1240 by OLS which yields se1b 12 s and the OLS residuals 5u t t 5 1 p n6 ii Compute the residuals 5rt t 5 1 p n6 from the auxiliary regression 1241 Then form a t 5 rt u t for each t iii For your choice of g compute n as in 1242 iv Compute se1b 12 from 1243 Empirically the serial correlationrobust standard errors are typically larger than the usual OLS standard errors when there is serial correlation This is true because in most cases the errors are posi tively serially correlated However it is possible to have substantial serial correlation in 5ut6 but to also have similarities in the usual and serial correlationrobust SCrobust standard errors of some coef ficients it is the sample autocorrelations of a t 5 rtu t that determine the robust standard error for b 1 The use of SCrobust standard errors has somewhat lagged behind the use of standard errors robust only to heteroskedasticity for several reasons First large cross sections where the heteroskedasticityrobust standard errors will have good properties are more common than large time series The SCrobust stand ard errors can be poorly behaved when there is substantial serial correlation and the sample size is small where small can even be as large as say 100 Second since we must choose the integer g in equa tion 1242 computation of the SCrobust standard errors is not automatic As mentioned earlier some econometrics packages have automated the selection but you still have to abide by the choice Another important reason that SCrobust standard errors are not yet routinely reported is that in the presence of severe serial correlation OLS can be very inefficient especially in small sample sizes After performing OLS and correcting the standard errors for serial correlation we find the coefficients are often insignificant or at least less significant than they were with the usual OLS standard errors If we are confident that the explanatory variables are strictly exogenous yet are skeptical about the errors following an AR1 process we can still get estimators more efficient than OLS by using a standard FGLS estimator such as PraisWinsten or CochraneOrcutt With substantial serial correla tion the quasidifferencing transformation used by PW and CO is likely to be better than doing nothing and just using OLS But if the errors do not follow an AR1 model then the standard errors reported from PW or CO estimation will be incorrect Nevertheless we can manually quasi difference the data after estimating r use pooled OLS on the transformed data and then use SCrobust standard errors in the transformed equation Computing an SCrobust standard error after quasidifferencing would ensure that any extra serial correlation is accounted for in statistical inference In fact the SCrobust standard errors probably work better after much serial correlation has been eliminated using quasi differencing or some other transformation such as that used for AR2 serial correlation Such an Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 391 approach is analogous to using weighted least squares in the presence of heteroskedasticity but then computing standard errors that are robust to having the variance function incorrectly specified see Section 84 The SCrobust standard errors after OLS estimation are most useful when we have doubts about some of the explanatory variables being strictly exogenous so that methods such as PraisWinsten and CochraneOrcutt are not even consistent It is also valid to use the SCrobust standard errors in models with lagged dependent variables assuming of course that there is good reason for allowing serial correlation in such models ExamplE 127 The puerto Rican minimum Wage We obtain an SCrobust standard error for the minimum wage effect in the Puerto Rican employment equation In Example 122 we found pretty strong evidence of AR1 serial correlation As in that example we use as additional controls logusgnp logprgnp and a linear time trend The OLS estimate of the elasticity of the employment rate with respect to the minimum wage is b1 5 22123 and the usual OLS standard error is se1b 12 5 0402 The standard error of the regression is s 5 0328 Further using the previous procedure with g 5 2 see 1245 we obtain n 5 000805 This gives the SCrobust standard error as se1b 12 5 3 1040203282 24 000805 0426 Interestingly the robust standard error is only slightly greater than the usual OLS standard error The robust t statistic is about 498 and so the estimated elasticity is still very statistically significant For comparison the iterated PW estimate of b1 is 1477 with a standard error of 0458 Thus the FGLS estimate is closer to zero than the OLS estimate and we might suspect violation of the strict exogeneity assumption Or the difference in the OLS and FGLS estimates might be explainable by sampling error It is very difficult to tell Kiefer and Vogelsang 2005 provide a different way to obtain valid inference in the presence of arbitrary serial correlation Rather than worry about the rate at which g is allowed to grow as a function of n in order for the t statistics to have asymptotic standard normal distributions Kiefer and Vogelsang derive the largesample distribution of the t statistic when b 5 1g 1 12n is allowed to settle down to a nonzero fraction In the NeweyWest setup g 1n always converges to zero For example when b 5 1 g 5 n 2 1 which means that we include every covariance term in equa tion 1242 The resulting t statistic does not have a largesample standard normal distribution but Kiefer and Vogelsang show that it does have an asymptotic distribution and they tabulate the appro priate critical values For a twosided 5 level test the critical value is 4771 and for a twosided 10 level test the critical value is 3764 Compared with the critical values from the standard normal distribution we need a t statistic substantially larger But we do not have to worry about choosing the number of covariances in 1242 Before leaving this section we note that it is possible to construct SCrobust Ftype statistics for testing multiple hypotheses but these are too advanced to cover here See Wooldridge 1991b 1995 and Davidson and MacKinnon 1993 for treatments 126 Heteroskedasticity in Time Series Regressions We discussed testing and correcting for heteroskedasticity for crosssectional applications in Chapter 8 Heteroskedasticity can also occur in time series regression models and the presence of heteroske dasticity while not causing bias or inconsistency in the b j does invalidate the usual standard errors t statistics and F statistics This is just as in the crosssectional case Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 392 In time series regression applications heteroskedasticity often receives little if any attention the problem of serially correlated errors is usually more pressing Nevertheless it is useful to briefly cover some of the issues that arise in applying tests and corrections for heteroskedasticity in time series regressions Because the usual OLS statistics are asymptotically valid under Assumptions TS1r through TS5r we are interested in what happens when the homoskedasticity assumption TS4r does not hold Assumption TS3r rules out misspecifications such as omitted variables and certain kinds of measure ment error while TS5r rules out serial correlation in the errors It is important to remember that seri ally correlated errors cause problems that adjustments for heteroskedasticity are not able to address 126a HeteroskedasticityRobust Statistics In studying heteroskedasticity for crosssectional regressions we noted how it has no bearing on the unbiasedness or consistency of the OLS estimators Exactly the same conclusions hold in the time series case as we can see by reviewing the assumptions needed for unbiasedness Theorem 101 and consistency Theorem 111 In Section 82 we discussed how the usual OLS standard errors t statistics and F statistics can be adjusted to allow for the presence of heteroskedasticity of unknown form These same adjustments work for time series regressions under Assumptions TS1r TS2r TS3r and TS5r Thus provided the only assumption violated is the homoskedasticity assumption valid inference is easily obtained in most econometric packages 126b Testing for Heteroskedasticity Sometimes we wish to test for heteroskedasticity in time series regressions especially if we are con cerned about the performance of heteroskedasticityrobust statistics in relatively small sample sizes The tests we covered in Chapter 8 can be applied directly but with a few caveats First the errors ut should not be serially correlated any serial correlation will generally invalidate a test for heteroske dasticity Thus it makes sense to test for serial correlation first using a heteroskedasticityrobust test if heteroskedasticity is suspected Then after something has been done to correct for serial correla tion we can test for heteroskedasticity Second consider the equation used to motivate the BreuschPagan test for heteroskedasticity u2 t 5 d0 1 d1xt1 1 p 1 dkxtk 1 nt 1246 where the null hypothesis is H0 d1 5 d2 5 p 5 dk 5 0 For the F statisticwith u 2 t replacing u2 t as the dependent variableto be valid we must assume that the errors 5vt6 are themselves homoskedastic as in the crosssectional case and serially uncorrelated These are implicitly assumed in computing all standard tests for heteroskedasticity including the version of the White test we covered in Section 83 Assuming that the 5vt6 are serially uncorrelated rules out certain forms of dynamic heteroskedasticity something we will treat in the next subsection If heteroskedasticity is found in the ut and the ut are not serially correlated then the heteroskedasticityrobust test statistics can be used An alternative is to use weighted least squares as in Section 84 The mechanics of weighted least squares for the time series case are identical to those for the crosssectional case How would you compute the White test for heteroskedasticity in equation 1247 Exploring FurthEr 125 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 393 ExamplE 128 Heteroskedasticity and the Efficient markets Hypothesis In Example 114 we estimated the simple model returnt 5 b0 1 b1returnt21 1 ut 1247 The EMH states that b1 5 0 When we tested this hypothesis using the data in NYSE we obtained tb1 5 155 with n 5 689 With such a large sample this is not much evidence against the EMH Although the EMH states that the expected return given past observable information should be con stant it says nothing about the conditional variance In fact the BreuschPagan test for heteroskedas ticity entails regressing the squared OLS residuals u 2 t on returnt21 u 2 t 5 466 2 1104 returnt21 1 residualt 10432 102012 1248 n 5 689 R2 5 042 The t statistic on returnt21 is about 55 indicating strong evidence of heteroskedasticity Because the coefficient on returnt21 is negative we have the interesting finding that volatility in stock returns is lower when the previous return was high and vice versa Therefore we have found what is common in many financial studies the expected value of stock returns does not depend on past returns but the variance of returns does 126c Autoregressive Conditional Heteroskedasticity In recent years economists have become interested in dynamic forms of heteroskedasticity Of course if xt contains a lagged dependent variable then heteroskedasticity as in 1246 is dynamic But dynamic forms of heteroskedasticity can appear even in models with no dynamics in the regres sion equation To see this consider a simple static regression model yt 5 b0 1 b1zt 1 ut and assume that the GaussMarkov assumptions hold This means that the OLS estimators are BLUE The homoskedasticity assumption says that Var1ut0Z2 is constant where Z denotes all n outcomes of zt Even if the variance of ut given Z is constant there are other ways that heteroskedasticity can arise Engle 1982 suggested looking at the conditional variance of ut given past errors where the conditioning on Z is left implicit Engle suggested what is known as the autoregressive conditional heteroskedasticity ARCH model The first order ARCH model is E1u2 t 0ut21 ut22 p2 5 E1u2 t 0ut212 5 a0 1 a1u2 t21 1249 where we leave the conditioning on Z implicit This equation represents the conditional variance of ut given past ut only if E1ut0ut21 ut22 p2 5 0 which means that the errors are serially uncorrelated Since conditional variances must be positive this model only makes sense if a0 0 and a1 0 if a1 5 0 there are no dynamics in the variance equation It is instructive to write 1249 as u2 t 5 a0 1 a1u2 t21 1 vt 1250 where the expected value of vt given ut21 ut22 p is zero by definition However the vt are not independent of past ut because of the constraint nt 2a0 2 a1u2 t21 Equation 1250 looks like an autoregressive model in u2 t hence the name ARCH The stability condition for this equation is a1 1 just as in the usual AR1 model When a1 0 the squared errors contain positive serial correlation even though the ut themselves do not Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 394 What implications does 1250 have for OLS Because we began by assuming the Gauss Markov assumptions hold OLS is BLUE Further even if ut is not normally distributed we know that the usual OLS test statistics are asymptotically valid under Assumptions TS1r through TS5r which are satisfied by static and distributed lag models with ARCH errors If OLS still has desirable properties under ARCH why should we care about ARCH forms of heteroskedasticity in static and distributed lag models We should be concerned for two reasons First it is possible to get consistent but not unbiased estimators of the bj that are asymptotically more efficient than the OLS estimators A weighted least squares procedure based on estimating 1250 will do the trick A maximum likelihood procedure also works under the assumption that the errors ut have a conditional normal distribution Second economists in various fields have become interested in dynamics in the conditional variance Engles original application was to the variance of United Kingdom inflation where he found that a larger magnitude of the error in the previous time period larger u2 t21 was associated with a larger error variance in the current period Since variance is often used to measure volatility and volatility is a key element in asset pricing theories ARCH models have become important in empirical finance ARCH models also apply when there are dynamics in the conditional mean Suppose we have the dependent variable yt a contemporaneous exogenous variable zt and E1yt0zt yt21 zt21 yt22 p2 5 b0 1 b1zt 1 b2yt21 1 b3zt21 so that at most one lag of y and z appears in the dynamic regression The typical approach is to assume that Var1yt0zt yt21 zt21 yt22 p2 is constant as we discussed in Chapter 11 But this variance could follow an ARCH model Var1yt0zt yt21 zt21 yt22 p2 5 Var1ut0zt yt21 zt21 yt22 p2 5 a0 1 a1u2 t21 where ut 5 yt 2 E1yt0zt yt21 zt21 yt22 p2 As we know from Chapter 11 the presence of ARCH does not affect consistency of OLS and the usual heteroskedasticityrobust standard errors and test statistics are valid Remember these are valid for any form of heteroskedasticity and ARCH is just one particular form of heteroskedasticity If you are interested in the ARCH model and its extensions see Bollerslev Chou and Kroner 1992 and Bollerslev Engle and Nelson 1994 for recent surveys ExamplE 129 aRCH in Stock Returns In Example 128 we saw that there was heteroskedasticity in weekly stock returns This heteroske dasticity is actually better characterized by the ARCH model in 1250 If we compute the OLS residuals from 1247 square these and regress them on the lagged squared residual we obtain u 2 t 5 295 1 337 u 2 t21 1 residualt 1442 10362 1251 n 5 688 R2 5 114 The t statistic on u 2 t21 is over nine indicating strong ARCH As we discussed earlier a larger error at time t 1 implies a larger variance in stock returns today It is important to see that though the squared OLS residuals are autocorrelated the OLS residu als themselves are not as is consistent with the EMH Regressing u t on u t21 gives r 5 0014 with tr 5 038 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 395 126d Heteroskedasticity and Serial Correlation in Regression Models Nothing rules out the possibility of both heteroskedasticity and serial correlation being present in a regression model If we are unsure we can always use OLS and compute fully robust standard errors as described in Section 125 Much of the time serial correlation is viewed as the most important problem because it usually has a larger impact on standard errors and the efficiency of estimators than does heteroskedasticity As we concluded in Section 122 obtaining tests for serial correlation that are robust to arbitrary heter oskedasticity is fairly straightforward If we detect serial correlation using such a test we can employ the CochraneOrcutt or PraisWinsten transformation see equation 1232 and in the transformed equation use heteroskedasticityrobust standard errors and test statistics Or we can even test for het eroskedasticity in 1232 using the BreuschPagan or White tests Alternatively we can model heteroskedasticity and serial correlation and correct for both through a combined weighted least squares AR1 procedure Specifically consider the model yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ut ut 5 htnt 1252 nt 5 rnt21 1 et 0r0 1 where the explanatory variables X are independent of et for all t and ht is a function of the xtj The process 5et6 has zero mean and constant variance s2 e and is serially uncorrelated Therefore 5vt6 satis fies a stable AR1 process The error ut is heteroskedastic in addition to containing serial correlation Var1ut0xt2 5 s2 nht where s2 n 5 s2 e11 2 r22 But nt 5 ut ht is homoskedastic and follows a stable AR1 model Therefore the transformed equation yt ht 5 b011 ht2 1 b11xt1 ht2 1 p 1 bk1xtk ht2 1 nt 1253 has AR1 errors Now if we have a particular kind of heteroskedasticity in mindthat is we know htwe can estimate 1253 using standard CO or PW methods In most cases we have to estimate ht first The following method combines the weighted least squares method from Section 84 with the AR1 serial correlation correction from Section 123 Feasible GLS with Heteroskedasticity and AR1 Serial Correlation i Estimate 1252 by OLS and save the residuals u t ii Regress log1u 2 t 2 on xt1 p xtk or on yt y2 t and obtain the fitted values say g t iii Obtain the estimates of ht h t 5 exp1g t2 iv Estimate the transformed equation h 212 t yt 5 h 212 t b0 1 b1h 212 t xt1 1 p 1 bkh 212 t xtk 1 errort 1254 by standard CochraneOrcutt or PraisWinsten methods The FGLS estimators obtained from the procedure are asymptotically efficient provided the assumptions in model 1252 hold More importantly all standard errors and test statistics from the CO or PW estimation are asymptotically valid If we allow the variance function to be misspecified or allow the possibility that any serial correlation does not follow an AR1 model then we can apply quasidifferencing to 1254 estimating the resulting equation by OLS and then obtain the Newey West standard errors By doing so we would be using a procedure that could be asymptotically effi cient while ensuring that our inference is valid asymptotically if we have misspecified our model of either heteroskedasticity or serial correlation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 396 Summary We have covered the important problem of serial correlation in the errors of multiple regression models Positive correlation between adjacent errors is common especially in static and finite distributed lag models This causes the usual OLS standard errors and statistics to be misleading although the b j can still be un biased or at least consistent Typically the OLS standard errors underestimate the true uncertainty in the parameter estimates The most popular model of serial correlation is the AR1 model Using this as the starting point it is easy to test for the presence of AR1 serial correlation using the OLS residuals An asymptotically valid t statistic is obtained by regressing the OLS residuals on the lagged residuals assuming the regressors are strictly exogenous and a homoskedasticity assumption holds Making the test robust to heteroskedasticity is simple The DurbinWatson statistic is available under the classical linear model assumptions but it can lead to an inconclusive outcome and it has little to offer over the t test For models with a lagged dependent variable or other nonstrictly exogenous regressors the standard t test on u t21 is still valid provided all independent variables are included as regressors along with u t21 We can use an F or an LM statistic to test for higher order serial correlation In models with strictly exogenous regressors we can use a feasible GLS procedureCochraneOrcutt or PraisWinstento correct for AR1 serial correlation This gives estimates that are different from the OLS estimates the FGLS estimates are obtained from OLS on quasidifferenced variables All of the usual test statistics from the transformed equation are asymptotically valid Almost all regression packages have builtin features for estimating models with AR1 errors Another way to deal with serial correlation especially when the strict exogeneity assumption might fail is to use OLS but to compute serial correlationrobust standard errors that are also robust to heter oskedasticity Many regression packages follow a method suggested by Newey and West 1987 it is also possible to use standard regression packages to obtain one standard error at a time Finally we discussed some special features of heteroskedasticity in time series models As in the crosssectional case the most important kind of heteroskedasticity is that which depends on the explana tory variables this is what determines whether the usual OLS statistics are valid The BreuschPagan and White tests covered in Chapter 8 can be applied directly with the caveat that the errors should not be serially correlated In recent years economistsespecially those who study the financial marketshave become interested in dynamic forms of heteroskedasticity The ARCH model is the leading example Key Terms AR1 Serial Correlation Autoregressive Conditional Het eroskedasticity ARCH BreuschGodfrey Test CochraneOrcutt CO Estimation DurbinWatson DW Statistic Feasible GLS FGLS PraisWinsten PW Estimation QuasiDifferenced Data Serial CorrelationRobust Stand ard Error Weighted Least Squares Problems 1 When the errors in a regression model have AR1 serial correlation why do the OLS standard errors tend to underestimate the sampling variation in the b j Is it always true that the OLS standard errors are too small 2 Explain what is wrong with the following statement The CochraneOrcutt and PraisWinsten methods are both used to obtain valid standard errors for the OLS estimates when there is a serial correlation 3 In Example 106 we used the data in FAIR to estimate a variant on Fairs model for predicting presidential election outcomes in the United States Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 397 i What argument can be made for the error term in this equation being serially uncorrelated Hint How often do presidential elections take place ii When the OLS residuals from 1023 are regressed on the lagged residuals we obtain r 5 2068 and se1r 2 5 240 What do you conclude about serial correlation in the ut iii Does the small sample size in this application worry you in testing for serial correlation 4 True or false If the errors in a regression model contain ARCH they must be serially correlated 5 i In the enterprise zone event study in Computer Exercise C5 in Chapter 10 a regression of the OLS residuals on the lagged residuals produces r 5 841 and se1r 2 5 053 What implications does this have for OLS ii If you want to use OLS but also want to obtain a valid standard error for the EZ coefficient what would you do 6 In Example 128 we found evidence of heteroskedasticity in ut in equation 1247 Thus we compute the heteroskedasticityrobust standard errors in 34 along with the usual standard errors returnt 5 180 1 059 returnt2l 10812 10382 30854 30694 n 5 689 R2 5 0035 R2 5 0020 What does using the heteroskedasticityrobust t statistic do to the significance of returnt21 7 Consider a standard multiple linear regression model with time series data yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ut Assume that Assumptions TS1 TS2 TS3 and TS4 all hold i Suppose we think that the errors 5ut6 follow an AR1 model with parameter r and so we apply the PraisWinsten method If the errors do not follow an AR1 modelfor example suppose they follow an AR2 model or an MA1 modelwhy will the usual PraisWinsten standard errors be incorrect ii Can you think of a way to use the NeweyWest procedure in conjunction with PraisWinsten estimation to obtain valid standard errors Be very specific about the steps you would follow Hint It may help to study equation 1232 and note that if 5ut6 does not follow an AR1 process et generally should be replaced by ut 2 rut21 where r is the probability limit of the estimator r Now is the error 5ut 2 rut216 serially uncorrelated in general What can you do if it is not iii Explain why your answer to part ii should not change if we drop Assumption TS4 Computer Exercises C1 In Example 116 we estimated a finite DL model in first differences changes cgfrt 5 g0 1 d0cpet 1 d1cpet21 1 d2cpet22 1 ut Use the data in FERTIL3 to test whether there is AR1 serial correlation in the errors C2 i Using the data in WAGEPRC estimate the distributed lag model from Problem 5 in Chapter 11 Use regression 1214 to test for AR1 serial correlation ii Reestimate the model using iterated CochraneOrcutt estimation What is your new estimate of the longrun propensity Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 398 iii Using iterated CO find the standard error for the LRP This requires you to estimate a modified equation Determine whether the estimated LRP is statistically different from one at the 5 level C3 i In part i of Computer Exercise C6 in Chapter 11 you were asked to estimate the accelerator model for inventory investment Test this equation for AR1 serial correlation ii If you find evidence of serial correlation reestimate the equation by CochraneOrcutt and compare the results C4 i Use NYSE to estimate equation 1248 Let h t be the fitted values from this equation the esti mates of the conditional variance How many h t are negative ii Add return2 t21 to 1248 and again compute the fitted values h t Are any h t negative iii Use the h t from part ii to estimate 1247 by weighted least squares as in Section 84 Compare your estimate of b1 with that in equation 1116 Test H0 b1 5 0 and compare the outcome when OLS is used iv Now estimate 1247 by WLS using the estimated ARCH model in 1251 to obtain the h t Does this change your findings from part iii C5 Consider the version of Fairs model in Example 106 Now rather than predicting the proportion of the twoparty vote received by the Democrat estimate a linear probability model for whether or not the Democrat wins i Use the binary variable demwins in place of demvote in 1023 and report the results in standard form Which factors affect the probability of winning Use the data only through 1992 ii How many fitted values are less than zero How many are greater than one iii Use the following prediction rule if demwins 5 you predict the Democrat wins otherwise the Republican wins Using this rule determine how many of the 20 elections are correctly predicted by the model iv Plug in the values of the explanatory variables for 1996 What is the predicted probability that Clinton would win the election Clinton did win did you get the correct prediction v Use a heteroskedasticityrobust t test for AR1 serial correlation in the errors What do you find vi Obtain the heteroskedasticityrobust standard errors for the estimates in part i Are there notable changes in any t statistics C6 i In Computer Exercise C7 in Chapter 10 you estimated a simple relationship between consump tion growth and growth in disposable income Test the equation for AR1 serial correlation using CONSUMP ii In Computer Exercise C7 in Chapter 11 you tested the permanent income hypothesis by regressing the growth in consumption on one lag After running this regression test for heteroskedasticity by regressing the squared residuals on gct21 and gc2 t21 What do you conclude C7 i For Example 124 using the data in BARIUM obtain the iterative CochraneOrcutt estimates ii Are the PraisWinsten and CochraneOrcutt estimates similar Did you expect them to be C8 Use the data in TRAFFIC2 for this exercise i Run an OLS regression of prcfat on a linear time trend monthly dummy variables and the variables wkends unem spdlaw and beltlaw Test the errors for AR1 serial correlation using the regression in equation 1214 Does it make sense to use the test that assumes strict exogeneity of the regressors ii Obtain serial correlation and heteroskedasticityrobust standard errors for the coefficients on spdlaw and beltlaw using four lags in the NeweyWest estimator How does this affect the statistical significance of the two policy variables iii Now estimate the model using iterative PraisWinsten and compare the estimates with the OLS estimates Are there important changes in the policy variable coefficients or their statistical significance Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 399 C9 The file FISH contains 97 daily price and quantity observations on fish prices at the Fulton Fish Market in New York City Use the variable logavgprc as the dependent variable i Regress logavgprc on four daily dummy variables with Friday as the base Include a linear time trend Is there evidence that price varies systematically within a week ii Now add the variables wave2 and wave3 which are measures of wave heights over the past several days Are these variables individually significant Describe a mechanism by which stormy seas would increase the price of fish iii What happened to the time trend when wave2 and wave3 were added to the regression What must be going on iv Explain why all explanatory variables in the regression are safely assumed to be strictly exogenous v Test the errors for AR1 serial correlation vi Obtain the NeweyWest standard errors using four lags What happens to the t statistics on wave2 and wave3 Did you expect a bigger or smaller change compared with the usual OLS t statistics vii Now obtain the PraisWinsten estimates for the model estimated in part ii Are wave2 and wave3 jointly statistically significant C10 Use the data in PHILLIPS to answer these questions i Using the entire data set estimate the static Phillips curve equation inft 5 b0 1 b1 unemt 1 ut by OLS and report the results in the usual form ii Obtain the OLS residuals from part i u t and obtain r from the regression u t on u t21 It is fine to include an intercept in this regression Is there strong evidence of serial correlation iii Now estimate the static Phillips curve model by iterative PraisWinsten Compare the estimate of b1 with that obtained in Table 122 Is there much difference in the estimate when the later years are added iv Rather than using PraisWinsten use iterative CochraneOrcutt How similar are the final estimates of r How similar are the PW and CO estimates of b1 C11 Use the data in NYSE to answer these questions i Estimate the model in equation 1247 and obtain the squared OLS residuals Find the average minimum and maximum values of u 2 t over the sample ii Use the squared OLS residuals to estimate the following model of heteroskedasticity Var1ut0returnt21 returnt22 p2 5 Var1ut0returnt212 5 d0 1 d1returnt21 1 d2return2 t21 Report the estimated coefficients the reported standard errors the Rsquared and the adjusted Rsquared iii Sketch the conditional variance as a function of the lagged return21 For what value of return21 is the variance the smallest and what is the variance iv For predicting the dynamic variance does the model in part ii produce any negative variance estimates v Does the model in part ii seem to fit better or worse than the ARCH1 model in Example 129 Explain vi To the ARCH1 regression in equation 1251 add the second lag u 2 t22 Does this lag seem important Does the ARCH2 model fit better than the model in part ii C12 Use the data in INVEN for this exercise see also Computer Exercise C6 in Chapter 11 i Obtain the OLS residuals from the accelerator model Dinvent 5 b0 1 b1DGDPt 1 ut and use the regression u t on u t21 to test for serial correlation What is the estimate of r How big a problem does serial correlation seem to be ii Estimate the accelerator model by PW and compare the estimate of b1 to the OLS estimate Why do you expect them to be similar Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 400 C13 Use the data in OKUN to answer this question see also Computer Exercise C11 in Chapter 11 i Estimate the equation pcrgdpt 5 b0 1 b1cunemt 1 ut and test the errors for AR1 serial correlation without assuming 5cunemt t 5 1 2 p6 is strictly exogenous What do you conclude ii Regress the squared residuals u 2 t on cunemt this is the BreuschPagan test for heteroskedasticity in the simple regression case What do you conclude iii Obtain the heteroskedasticityrobust standard error for the OLS estimate b 1 Is it substantially different from the usual OLS standard error C14 Use the data in MINWAGE for this exercise focusing on sector 232 i Estimate the equation gwage232t 5 b0 1 b1gmwaget 1 b2gcpii 1 ut and test the errors for AR1 serial correlation Does it matter whether you assume gmwaget and gcpit are strictly exogenous What do you conclude overall ii Obtain the NeweyWest standard error for the OLS estimates in part i using a lag of 12 How do the NeweyWest standard errors compare to the usual OLS standard errors iii Now obtain the heteroskedasticityrobust standard errors for OLS and compare them with the usual standard errors and the NeweyWest standard errors Does it appear that serial correlation or heteroskedasticity is more of a problem in this application iv Use the BreuschPagan test in the original equation to verify that the errors exhibit strong heteroskedasticity v Add lags 1 through 12 of gmwage to the equation in part i Obtain the pvalue for the joint F test for lags 1 through 12 and compare it with the pvalue for the heteroskedasticityrobust test How does adjusting for heteroskedasticity affect the significance of the lags vi Obtain the pvalue for the joint significance test in part v using the NeweyWest approach What do you conclude now vii If you leave out the lags of gmwage is the estimate of the longrun propensity much different C15 Use the data in BARIUM to answer this question i In Table 121 the reported standard errors for OLS are uniformly below those of the corresponding standard errors for GLS PraisWinsten Explain why comparing the OLS and GLS standard errors is flawed ii Reestimate the equation represented by the column labeled OLS in Table 121 by OLS but now find the NeweyWest standard errors using a window g 5 4 four months How does the NeweyWest standard error on lchempi compare to the usual OLS standard error How does it compare to the PW standard error Make the same comparisons for the afdec6 variable iii Redo part ii now using a window g 5 12 What happens to the standard errors on lchempi and afdec6 when the window increases from 4 to 12 C16 Use the data in APPROVAL to answer the following questions See also Computer Exercise C14 in Chapter 11 i Estimate the equation approvet 5 b0 1 b1lcpifoodt 1 b2lrgaspricet 1 b3unemployt 1 b4sep11t 1 b5iraqinvadet 1 ut using first differencing and test the errors in the firstdifferenced FD equation for AR1 serial correlation In particular let et be the OLS residuals in the FD estimation and regress et on et21 report the pvalue of the test What is the estimate of r ii Estimate the FD equation using PraisWinsten How does the estimate of b2 compare with the OLS estimate on the FD equation What about its statistical significance iii Return to estimating the FD equation by OLS Now obtain the NeweyWest standard errors using lags of one four and eight Discuss the statistical significance of the estimate of b2 using each of the three standard errors Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 401 W e now turn to some more specialized topics that are not usually covered in a oneterm introductory course Some of these topics require few more mathematical skills than the multiple regression analysis did in Parts 1 and 2 In Chapter 13 we show how to apply multiple regression to independently pooled cross sections The issues raised are very similar to standard crosssectional analysis except that we can study how relationships change over time by including time dummy variables We also illustrate how panel data sets can be analyzed in a re gression framework Chapter 14 covers more advanced panel data methods that are nevertheless used routinely in applied work Chapters 15 and 16 investigate the problem of endogenous explanatory variables In Chapter 15 we introduce the method of instrumental variables as a way of solving the omitted variable problem as well as the measurement error problem The method of twostage least squares is used quite often in empirical economics and is indispensable for estimating simultaneous equation models a topic we turn to in Chapter 16 Chapter 17 covers some fairly advanced topics that are typically used in crosssectional analy sis including models for limited dependent variables and methods for correcting sample selection bias Chapter 18 heads in a different direction by covering some recent advances in time series econometrics that have proven to be useful in estimating dynamic relationships Chapter 19 should be helpful to students who must write either a term paper or some other paper in the applied social sciences The chapter offers suggestions for how to select a topic col lect and analyze the data and write the paper Advanced Topics Part 3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 402 U ntil now we have covered multiple regression analysis using pure crosssectional or pure time series data Although these two cases arise often in applications data sets that have both cross sectional and time series dimensions are being used more and more often in empirical research Multiple regression methods can still be used on such data sets In fact data with crosssectional and time series aspects can often shed light on important policy questions We will see several examples in this chapter We will analyze two kinds of data sets in this chapter An independently pooled cross section is obtained by sampling randomly from a large population at different points in time usually but not necessarily different years For instance in each year we can draw a random sample on hourly wages education experience and so on from the population of working people in the United States Or in every other year we draw a random sample on the selling price square footage number of bathrooms and so on of houses sold in a particular metropolitan area From a statistical standpoint these data sets have an important feature they consist of independently sampled observations This was also a key aspect in our analysis of crosssectional data among other things it rules out correla tion in the error terms across different observations An independently pooled cross section differs from a single random sample in that sampling from the population at different points in time likely leads to observations that are not identically distributed For example distributions of wages and education have changed over time in most countries As we will see this is easy to deal with in practice by allowing the intercept in a multiple regression Pooling Cross Sections across Time Simple Panel Data Methods c h a p t e r 13 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 403 model and in some cases the slopes to change over time We cover such models in Section 131 In Section 131 we discuss how pooling cross sections over time can be used to evaluate policy changes A panel data set while having both a crosssectional and a time series dimension differs in some important respects from an independently pooled cross section To collect panel datasometimes called longitudinal datawe follow or attempt to follow the same individuals families firms cit ies states or whatever across time For example a panel data set on individual wages hours educa tion and other factors is collected by randomly selecting people from a population at a given point in time Then these same people are reinterviewed at several subsequent points in time This gives us data on wages hours education and so on for the same group of people in different years Panel data sets are fairly easy to collect for school districts cities counties states and countries and policy analysis is greatly enhanced by using panel data sets we will see some examples in the following discussion For the econometric analysis of panel data we cannot assume that the obser vations are independently distributed across time For example unobserved factors such as ability that affect someones wage in 1990 will also affect that persons wage in 1991 unobserved factors that affect a citys crime rate in 1985 will also affect that citys crime rate in 1990 For this reason special models and methods have been developed to analyze panel data In Sections 133 134 and 135 we describe the straightforward method of differencing to remove timeconstant unobserved attributes of the units being studied Because panel data methods are somewhat more advanced we will rely mostly on intuition in describing the statistical properties of the estimation procedures leav ing detailed assumptions to the chapter appendix We follow the same strategy in Chapter 14 which covers more complicated panel data methods 131 Pooling Independent Cross Sections across Time Many surveys of individuals families and firms are repeated at regular intervals often each year An example is the Current Population Survey or CPS which randomly samples households each year See for example CPS7885 which contains data from the 1978 and 1985 CPS If a random sample is drawn at each time period pooling the resulting random samples gives us an independently pooled cross section One reason for using independently pooled cross sections is to increase the sample size By pool ing random samples drawn from the same population but at different points in time we can get more precise estimators and test statistics with more power Pooling is helpful in this regard only insofar as the relationship between the dependent variable and at least some of the independent variables remain constant over time As mentioned in the introduction using pooled cross sections raises only minor statistical com plications Typically to reflect the fact that the population may have different distributions in different time periods we allow the intercept to differ across periods usually years This is easily accom plished by including dummy variables for all but one year where the earliest year in the sample is usually chosen as the base year It is also possible that the error variance changes over time some thing we discuss later Sometimes the pattern of coefficients on the year dummy variables is itself of interest For exam ple a demographer may be interested in the following question After controlling for education has the pattern of fertility among women over age 35 changed between 1972 and 1984 The following example illustrates how this question is simply answered by using multiple regression analysis with year dummy variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 404 ExamplE 131 Womens Fertility over Time The data set in FERTIL1 which is similar to that used by Sander 1992 comes from the National Opinion Research Centers General Social Survey for the even years from 1972 to 1984 inclusively We use these data to estimate a model explaining the total number of kids born to a woman kids One question of interest is After controlling for other observable factors what has happened to fertility rates over time The factors we control for are years of education age race region of the country where living at age 16 and living environment at age 16 The estimates are given in Table 131 The base year is 1972 The coefficients on the year dummy variables show a sharp drop in fertil ity in the early 1980s For example the coefficient on y82 implies that holding education age and other factors fixed a woman had on average 52 less children or about onehalf a child in 1982 than in 1972 This is a very large drop holding educ age and the other factors fixed 100 women in 1982 are predicted to have about 52 fewer children than 100 comparable women in 1972 Since we are controlling for education this drop is separate from the decline in fertility that is due to the increase in average education levels The average years of education are 122 for 1972 and 133 for 1984 The coefficients on y82 and y84 represent drops in fertility for reasons that are not captured in the explana tory variables Given that the 1982 and 1984 year dummies are individually quite significant it is not surprising that as a group the year dummies are jointly very significant the Rsquared for the regression without the year dummies is 1019 and this leads to F61111 5 587 and pvalue 0 TAblE 131 Determinants of Womens Fertility Dependent Variable kids Independent Variables Coefficients Standard Errors educ 2128 018 age 532 138 age 2 20058 0016 black 1076 174 east 217 133 northcen 363 121 west 198 167 farm 2053 147 othrural 2163 175 town 084 124 smcity 212 160 y74 268 173 y76 2097 179 y78 2069 182 y80 2071 183 y82 2522 172 y84 2545 175 constant 27742 3052 n 5 1129 R 2 5 1295 R 2 5 1162 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 405 Women with more education have fewer children and the estimate is very statistically signifi cant Other things being equal 100 women with a college education will have about 51 fewer children on average than 100 women with only a high school education 1284 5 512 Age has a diminish ing effect on fertility The turning point in the quadratic is at about age 5 46 by which time most women have finished having children The model estimated in Table 131 assumes that the effect of each explanatory variable particu larly education has remained constant This may or may not be true you will be asked to explore this issue in Computer Exercise C1 Finally there may be heteroskedasticity in the error term underlying the estimated equation This can be dealt with using the methods in Chapter 8 There is one interesting difference here now the error variance may change over time even if it does not change with the values of educ age black and so on The heteroskedasticityrobust standard errors and test statistics are nevertheless valid The BreuschPagan test would be obtained by regressing the squared OLS residuals on all of the inde pendent variables in Table 131 including the year dummies For the special case of the White sta tistic the fitted values kids and the squared fitted values are used as the independent variables as always A weighted least squares procedure should account for variances that possibly change over time In the procedure discussed in Section 84 year dummies would be included in equation 832 We can also interact a year dummy variable with key explanatory variables to see if the effect of that variable has changed over a certain time period The next example examines how the return to edu cation and the gender gap have changed from 1978 to 1985 ExamplE 132 Changes in the Return to Education and the Gender Wage Gap A logwage equation where wage is hourly wage pooled across the years 1978 the base year and 1985 is log1wage2 5 b0 1 d0y85 1 b1educ 1 d1y85 educ 1 b2exper 131 1 b3exper2 1 b4union 1 b5female 1 d5y85 female 1 u where most explanatory variables should by now be familiar The variable union is a dummy vari able equal to one if the person belongs to a union and zero otherwise The variable y85 is a dummy variable equal to one if the observation comes from 1985 and zero if it comes from 1978 There are 550 people in the sample in 1978 and a different set of 534 people in 1985 The intercept for 1978 is b0 and the intercept for 1985 is b0 1 d0 The return to education in 1978 is b1 and the return to education in 1985 is b1 1 d1 Therefore d1 measures how the return to another year of education has changed over the sevenyear period Finally in 1978 the logwage dif ferential between women and men is b5 the differential in 1985 is b5 1 d5 Thus we can test the null hypothesis that nothing has happened to the gender differential over this sevenyear period by testing H0 d5 5 0 The alternative that the gender differential has been reduced is H1 d5 0 For simplicity we have assumed that experience and union membership have the same effect on wages in both time periods Before we present the estimates there is one other issue we need to addressnamely hourly wage here is in nominal or current dollars Since nominal wages grow simply due to inflation we are really interested in the effect of each explanatory variable on real wages Suppose that we set tle on measuring wages in 1978 dollars This requires deflating 1985 wages to 1978 dollars Using In reading Table 131 someone claims that if everything else is equal in the table a black woman is expected to have one more child than a nonblack woman Do you agree with this claim Exploring FurthEr 131 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 406 the Consumer Price Index for the 1997 Economic Report of the President the deflation factor is 1076652 165 Although we can easily divide each 1985 wage by 165 it turns out that this is not necessary provided a 1985 year dummy is included in the regression and logwage as opposed to wage is used as the dependent variable Using real or nominal wage in a logarithmic functional form only affects the coefficient on the year dummy y85 To see this let P85 denote the deflation fac tor for 1985 wages 165 if we use the CPI Then the log of the real wage for each person i in the 1985 sample is log1wagei P852 5 log1wagei2 2 log1P852 Now while wagei differs across people P85 does not Therefore logP85 will be absorbed into the intercept for 1985 This conclusion would change if for example we used a different price index for people living in different parts of the country The bottom line is that for studying how the return to education or the gender gap has changed we do not need to turn nominal wages into real wages in equation 131 Computer Exercise C2 asks you to verify this for the current example If we forget to allow different intercepts in 1978 and 1985 the use of nominal wages can produce seriously misleading results If we use wage rather than logwage as the dependent variable it is important to use the real wage and to include a year dummy The previous discussion generally holds when using dollar values for either the dependent or independent variables Provided the dollar amounts appear in logarithmic form and dummy variables are used for all time periods except of course the base period the use of aggregate price deflators will only affect the intercepts none of the slope estimates will change Now we use the data in CPS7885 to estimate the equation log1wage2 5 459 1 118 y85 1 0747 educ 1 0185 y85 educ 10932 11242 100672 100942 1 0296 exper 2 00040 exper2 1 202 union 100362 1000082 10302 132 2 317 female 1 085 y85 female 10372 10512 n 5 1084 R2 5 426 R2 5 422 The return to education in 1978 is estimated to be about 75 the return to education in 1985 is about 185 percentage points higher or about 935 Because the t statistic on the interaction term is 01850094 197 the difference in the return to education is statistically significant at the 5 level against a twosided alternative What about the gender gap In 1978 other things being equal a woman earned about 317 less than a man 272 is the more accurate estimate In 1985 the gap in logwage is 2317 1 085 5 2232 Therefore the gender gap appears to have fallen from 1978 to 1985 by about 85 percentage points The t statistic on the interaction term is about 167 which means it is significant at the 5 level against the positive onesided alternative What happens if we interact all independent variables with y85 in equation 132 This is identi cal to estimating two separate equations one for 1978 and one for 1985 Sometimes this is desirable For example in Chapter 7 we discussed a study by Krueger 1993 in which he estimated the return to using a computer on the job Krueger estimates two separate equations one using the 1984 CPS and the other using the 1989 CPS By comparing how the return to education changes across time and whether or not computer usage is controlled for he estimates that onethird to onehalf of the observed increase in the return to education over the fiveyear period can be attributed to increased computer usage See Tables VIII and IX in Krueger 1993 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 407 131a The Chow Test for Structural Change across Time In Chapter 7 we discussed how the Chow testwhich is simply an F testcan be used to determine whether a multiple regression function differs across two groups We can apply that test to two different time periods as well One form of the test obtains the sum of squared residuals from the pooled estimation as the restricted SSR The unrestricted SSR is the sum of the SSRs for the two separately estimated time periods The mechanics of computing the statistic are exactly as they were in Section 74 A heteroskedasticityrobust version is also available see Section 82 Example 132 suggests another way to compute the Chow test for two time periods by interacting each variable with a year dummy for one of the two years and testing for joint significance of the year dummy and all of the interaction terms Since the intercept in a regression model often changes over time due to say inflation in the housing price example this fullblown Chow test can detect such changes It is usually more interesting to allow for an intercept difference and then to test whether certain slope coefficients change over time as we did in Example 132 A Chow test can also be computed for more than two time periods Just as in the twoperiod case it is usually more interesting to allow the intercepts to change over time and then test whether the slope coefficients have changed over time We can test the constancy of slope coefficients generally by interacting all of the timeperiod dummies except that defining the base group with one several or all of the explanatory variables and test the joint significance of the interaction terms Computer Exercises C1 and C2 are examples For many time periods and explanatory variables constructing a full set of interactions can be tedious Alternatively we can adapt the approach described in part vi of Computer Exercise C11 in Chapter 7 First estimate the restricted model by doing a pooled regression allowing for different time intercepts this gives SSRr Then run a regression for each of the say T time periods and obtain the sum of squared residuals for each time period The unrestricted sum of squared residuals is obtained as SSRur 5 SSR1 1 SSR2 1 p 1 SSRT If there are k explana tory variables not including the intercept or the time dummies with T time periods then we are test ing 1T 2 12k restrictions and there are T 1 Tk parameters estimated in the unrestricted model So if n 5 n1 1 n2 1 p 1 nT is the total number of observations then the df of the F test are 1T 2 12k and n 2 T 2 Tk We compute the F statistic as usual 3 1SSRr 2 SSRur2SSRur43 1n 2 T 2 Tk21T 2 12k4 Unfortunately as with any F test based on sums of squared residuals or Rsquareds this test is not robust to heteroskedasticity including changing variances across time To obtain a heteroskedasticityrobust test we must construct the interaction terms and do a pooled regression 132 Policy Analysis with Pooled Cross Sections Pooled cross sections can be very useful for evaluating the impact of a certain event or policy The fol lowing example of an event study shows how two crosssectional data sets collected before and after the occurrence of an event can be used to determine the effect on economic outcomes ExamplE 133 Effect of a Garbage Incinerators location on Housing prices Kiel and McClain 1995 studied the effect that a new garbage incinerator had on housing values in North Andover Massachusetts They used many years of data and a fairly complicated econometric analysis We will use two years of data and some simplified models but our analysis is similar The rumor that a new incinerator would be built in North Andover began after 1978 and construction began in 1981 The incinerator was expected to be in operation soon after the start of construction the incinerator actually began operating in 1985 We will use data on prices of houses that sold in 1978 and another sample on those that sold in 1981 The hypothesis is that the price of houses located near the incinerator would fall relative to the price of more distant houses Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 408 For illustration we define a house to be near the incinerator if it is within three miles In Computer Exercise C3 you are instead asked to use the actual distance from the house to the incin erator as in Kiel and McClain 1995 We will start by looking at the dollar effect on housing prices This requires us to measure price in constant dollars We measure all housing prices in 1978 dollars using the Boston housing price index Let rprice denote the house price in real terms A naive analyst would use only the 1981 data and estimate a very simple model rprice 5 g0 1 g1nearinc 1 u 133 where nearinc is a binary variable equal to one if the house is near the incinerator and zero otherwise Estimating this equation using the data in KIELMC gives rprice 5 1013075 2 3068827 nearinc 1309302 15827712 134 n 5 142 R2 5 165 Since this is a simple regression on a single dummy variable the intercept is the average selling price for homes not near the incinerator and the coefficient on nearinc is the difference in the average sell ing price between homes near the incinerator and those that are not The estimate shows that the aver age selling price for the former group was 3068827 less than for the latter group The t statistic is greater than five in absolute value so we can strongly reject the hypothesis that the average value for homes near and far from the incinerator are the same Unfortunately equation 134 does not imply that the siting of the incinerator is causing the lower housing values In fact if we run the same regression for 1978 before the incinerator was even rumored we obtain rprice 5 8251723 2 1882437 nearinc 12653792 14744592 135 n 5 179 R2 5 082 Therefore even before there was any talk of an incinerator the average value of a home near the site was 1882437 less than the average value of a home not near the site 8251723 the difference is statistically significant as well This is consistent with the view that the incinerator was built in an area with lower housing values How then can we tell whether building a new incinerator depresses housing values The key is to look at how the coefficient on nearinc changed between 1978 and 1981 The difference in aver age housing value was much larger in 1981 than in 1978 3068827 versus 1882437 even as a percentage of the average value of homes not near the incinerator site The difference in the two coef ficients on nearinc is d 1 5 23068827 2 1218824372 5 2118639 This is our estimate of the effect of the incinerator on values of homes near the incinerator site In empirical economics d 1 has become known as the differenceindifferences estimator because it can be expressed as d 1 5 1rprice81 nr 2 rprice81 fr2 2 1rprice78 nr 2 rprice78 fr2 136 where nr stands for near the incinerator site and fr stands for farther away from the site In other words d 1 is the difference over time in the average difference of housing prices in the two locations To test whether d 1 is statistically different from zero we need to find its standard error by using a regression analysis In fact d 1 can be obtained by estimating rprice 5 b0 1 d0y81 1 b1nearinc 1 d1y81 nearinc 1 u 137 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 409 using the data pooled over both years The intercept b0 is the average price of a home not near the incinerator in 1978 The parameter d0 captures changes in all housing values in North Andover from 1978 to 1981 A comparison of equations 134 and 135 shows that housing values in North Andover relative to the Boston housing price index increased sharply over this period The coef ficient on nearinc b1 measures the location effect that is not due to the presence of the incinerator as we saw in equation 135 even in 1978 homes near the incinerator site sold for less than homes farther away from the site The parameter of interest is on the interaction term y81nearinc d1 measures the decline in hous ing values due to the new incinerator provided we assume that houses both near and far from the site did not appreciate at different rates for other reasons The estimates of equation 137 are given in column 1 of Table 132 The only number we could not obtain from equations 134 and 135 is the standard error of d 1 The t statistic on d 1 is about 2159 which is marginally significant against a onesided alternative 1pvalue 0572 Kiel and McClain 1995 included various housing characteristics in their analysis of the incin erator siting There are two good reasons for doing this First the kinds of homes selling near the incinerator in 1981 might have been systematically different than those selling near the incinerator in 1978 if so it can be important to control for such characteristics Second even if the relevant house characteristics did not change including them can greatly reduce the error variance which can then shrink the standard error of d 1 See Section 63 for discussion In column 2 we control for the age of the houses using a quadratic This substantially increases the Rsquared by reducing the residual variance The coefficient on y81nearinc is now much larger in magnitude and its standard error is lower In addition to the age variables in column 2 column 3 controls for distance to the inter state in feet intst land area in feet land house area in feet area number of rooms rooms and number of baths baths This produces an estimate on y81nearinc closer to that without any controls but it yields a much smaller standard error the t statistic for d 1 is about 2284 Therefore we find a much more significant effect in column 3 than in column 1 The column 3 estimates are preferred because they control for the most factors and have the smallest standard errors except in the constant which is not important here The fact that nearinc has a much smaller coefficient and is insignificant in column 3 indicates that the characteristics included in column 3 largely capture the housing characteristics that are most important for determining housing prices TAblE 132 Effects of Incinerator Location on Housing Prices Dependent Variable rprice Independent Variable 1 2 3 constant 8251723 272691 8911654 240605 1380767 1116659 y81 1879029 405007 2132104 344363 1392848 279875 nearinc 1882437 487532 939794 481222 378034 445342 y81nearinc 1186390 745665 2192027 635975 1417793 498727 Other controls No age age2 Full Set Observations Rsquared 321 174 321 414 321 660 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 410 For the purpose of introducing the method we used the level of real housing prices in Table 132 It makes more sense to use log price or logrprice in the analysis in order to get an approximate percentage effect The basic model becomes log1price2 5 b0 1 d0y81 1 b1nearinc 1 d1y81 nearinc 1 u 138 Now 100 d1 is the approximate percentage reduction in housing value due to the incinerator Just as in Example 132 using logprice versus logrprice only affects the coefficient on y81 Using the same 321 pooled observations gives log1price2 5 1129 1 457 y81 2 340 nearinc 2 063 y81 nearinc 1312 10452 10552 10832 139 n 5 321 R2 5 409 The coefficient on the interaction term implies that because of the new incinerator houses near the incinerator lost about 63 in value However this estimate is not statistically different from zero But when we use a full set of controls as in column 3 of Table 132 but with intst land and area appearing in logarithmic form the coefficient on y81nearinc becomes 2132 with a t statistic of about 2253 Again controlling for other factors turns out to be important Using the logarithmic form we estimate that houses near the incinerator were devalued by about 132 The methodology used in the previous example has numerous applications especially when the data arise from a natural experiment or a quasiexperiment A natural experiment occurs when some exogenous eventoften a change in government policy changes the environment in which individuals families firms or cities operate A natural experiment always has a control group which is not affected by the policy change and a treatment group which is thought to be affected by the policy change Unlike a true experiment in which treatment and control groups are randomly and explicitly chosen the control and treatment groups in natural experiments arise from the particular policy change To control for systematic differences between the control and treatment groups we need two years of data one before the policy change and one after the change Thus our sample is usefully broken down into four groups the control group before the change the control group after the change the treatment group before the change and the treatment group after the change Call C the control group and T the treatment group letting dT equal unity for those in the treat ment group T and zero otherwise Then letting d2 denote a dummy variable for the second post policy change time period the equation of interest is y 5 b0 1 d0d2 1 b1dT 1 d1d2 dT 1 other factors 1310 where y is the outcome variable of interest As in Example 133 d1 measures the effect of the policy Without other factors in the regression d 1 will be the differenceindifferences estimator d 1 5 1y2T 2 y2C2 2 1y1T 2 y1C2 1311 where the bar denotes average the first subscript denotes the year and the second subscript denotes the group The general differenceindifferences setup is shown in Table 133 Table 133 suggests that the parameter d1 sometimes called the average treatment effect because it measures the effect of the treatment or policy on the average outcome of y can be estimated in two ways 1 Compute the differences in averages between the treatment and control groups in each time period and then difference the results over time this is just as in equation 1311 2 Compute the change in aver ages over time for each of the treatment and control groups and then difference these changes which means we simply write d 1 5 1y2T 2 y1T2 2 1y2C 2 y1C2 Naturally the estimate d 1 does not depend on how we do the differencing as is seen by simple rearrangement Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 411 When explanatory variables are added to equation 1310 to control for the fact that the popula tions sampled may differ systematically over the two periods the OLS estimate of d1 no longer has the simple form of 1311 but its interpretation is similar ExamplE 134 Effect of Worker Compensation laws on Weeks out of Work Meyer Viscusi and Durbin 1995 hereafter MVD studied the length of time in weeks that an injured worker receives workers compensation On July 15 1980 Kentucky raised the cap on weekly earnings that were covered by workers compensation An increase in the cap has no effect on the benefit for lowincome workers but it makes it less costly for a highincome worker to stay on work ers compensation Therefore the control group is lowincome workers and the treatment group is highincome workers highincome workers are defined as those who were subject to the prepolicy change cap Using random samples both before and after the policy change MVD were able to test whether more generous workers compensation causes people to stay out of work longer everything else fixed They started with a differenceindifferences analysis using logdurat as the dependent variable Let afchnge be the dummy variable for observations after the policy change and highearn the dummy variable for high earners Using the data in INJURY the estimated equation with stand ard errors in parentheses is log1durat2 5 1126 1 0077 afchnge 1 256 highearn 100312 104472 10472 1 191 afchnge highearn 1312 10692 n 5 5626 R2 5 021 Therefore d 1 5 1911t 5 2772 which implies that the average length of time on workers compen sation for high earners increased by about 19 due to the increased earnings cap The coefficient on afchnge is small and statistically insignificant as is expected the increase in the earnings cap has no effect on duration for lowincome workers This is a good example of how we can get a fairly precise estimate of the effect of a policy change even though we cannot explain much of the variation in the dependent variable The dummy variables in 1312 explain only 21 of the variation in logdurat This makes sense there are clearly many factors including severity of the injury that affect how long someone receives workers compensa tion Fortunately we have a very large sample size and this allows us to get a significant t statistic MVD also added a variety of controls for gender marital status age industry and type of injury This allows for the fact that the kinds of people and types of injuries may differ systematically by earnings group across the two years Controlling for these factors turns out to have little effect on the estimate of d1 See Computer Exercise C4 Sometimes the two groups consist of people living in two neighboring states in the United States For example to assess the impact of changing cigarette taxes on cigarette consumption we can obtain random samples from two states for two years In State A the control group there was no change in TAblE 133 Illustration of the DifferenceinDifferences Estimator Before After After 2 Before Control b0 b0 1 d0 d0 Treatment b0 1 b1 b0 1 d0 1 b1 1 d1 d0 1 d1 TreatmentControl b1 b1 1 d1 d1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 412 the cigarette tax In State B the treatment group the tax increased or decreased between the two years The outcome variable would be a measure of ciga rette consumption and equation 1310 can be esti mated to determine the effect of the tax on cigarette consumption For an interesting survey on natural experiment methodology and several additional examples see Meyer 1995 133 TwoPeriod Panel Data Analysis We now turn to the analysis of the simplest kind of panel data for a cross section of individuals schools firms cities or whatever we have two years of data call these t 5 1 and t 5 2 These years need not be adjacent but t 5 1 corresponds to the earlier year For example the file CRIME2 con tains data on among other things crime and unemployment rates for 46 cities for 1982 and 1987 Therefore t 5 1 corresponds to 1982 and t 5 2 corresponds to 1987 What happens if we use the 1987 cross section and run a simple regression of crmrte on unem We obtain crmrte 5 12838 2 416 unem 120762 13422 n 5 46 R2 5 033 If we interpret the estimated equation causally it implies that an increase in the unemployment rate lowers the crime rate This is certainly not what we expect The coefficient on unem is not statistically significant at standard significance levels at best we have found no link between crime and unem ployment rates As we have emphasized throughout this text this simple regression equation likely suffers from omitted variable problems One possible solution is to try to control for more factors such as age distribution gender distribution education levels law enforcement efforts and so on in a multiple regression analysis But many factors might be hard to control for In Chapter 9 we showed how including the crmrte from a previous yearin this case 1982can help to control for the fact that different cities have historically different crime rates This is one way to use two years of data for estimating a causal effect An alternative way to use panel data is to view the unobserved factors affecting the dependent variable as consisting of two types those that are constant and those that vary over time Letting i denote the crosssectional unit and t the time period we can write a model with a single observed explanatory variable as yit 5b0 1d 0d2t 1b1xit 1ai 1uit t 51 2 1313 In the notation yit i denotes the person firm city and so on and t denotes the time period The vari able d2t is a dummy variable that equals zero when t 5 1 and one when t 5 2 it does not change across i which is why it has no i subscript Therefore the intercept for t 5 1 is b0 and the intercept for t 5 2 is b0 1 d0 Just as in using independently pooled cross sections allowing the intercept to change over time is important in most applications In the crime example secular trends in the United States will cause crime rates in all US cities to change perhaps markedly over a fiveyear period The variable ai captures all unobserved timeconstant factors that affect yit The fact that ai has no t subscript tells us that it does not change over time Generically ai is called an unobserved effect It is also common in applied work to find ai referred to as a fixed effect which helps us to remember that ai is fixed over time The model in 1313 is called an unobserved effects model or What do you make of the coefficient and t statistic on highearn in equation 1312 Exploring FurthEr 132 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 413 a fixed effects model In applications you might see ai referred to as unobserved heterogeneity as well or individual heterogeneity firm heterogeneity city heterogeneity and so on The error uit is often called the idiosyncratic error or timevarying error because it represents unobserved factors that change over time and affect yit These are very much like the errors in a straight time series regression equation A simple unobserved effects model for city crime rates for 1982 and 1987 is crmrteit 5 b0 1 d0d87t 1 b1unemit 1 ai 1 uit 1314 where d87 is a dummy variable for 1987 Since i denotes different cities we call ai an unobserved city effect or a city fixed effect it represents all factors affecting city crime rates that do not change over time Geographical features such as the citys location in the United States are included in ai Many other factors may not be exactly constant but they might be roughly constant over a fiveyear period These might include certain demographic features of the population age race and educa tion Different cities may have their own methods for reporting crimes and the people living in the cities might have different attitudes toward crime these are typically slow to change For historical reasons cities can have very different crime rates and historical factors are effectively captured by the unobserved effect ai How should we estimate the parameter of interest b1 given two years of panel data One pos sibility is just to pool the two years and use OLS essentially as in Section 131 This method has two drawbacks The most important of these is that in order for pooled OLS to produce a consistent esti mator of b1 we would have to assume that the unobserved effect ai is uncorrelated with xit We can easily see this by writing 1313 as yit 5 b0 1 d0d2t 1 b1xit 1 vit t 5 1 2 1315 where vit 5 ai 1 uit is often called the composite error From what we know about OLS we must assume that vit is uncorrelated with xit where t 5 1 or 2 for OLS to estimate b1 and the other param eters consistently This is true whether we use a single cross section or pool the two cross sections Therefore even if we assume that the idiosyncratic error uit is uncorrelated with xit pooled OLS is biased and inconsistent if ai and xit are correlated The resulting bias in pooled OLS is sometimes called het erogeneity bias but it is really just bias caused from omitting a timeconstant variable To illustrate what happens we use the data in CRIME2 to estimate 1314 by pooled OLS Since there are 46 cities and two years for each city there are 92 total observations crmrte 5 9342 1 794 d87 1 427 unem 112742 17982 111882 1316 n 5 92 R2 5 012 When reporting the estimated equation we usually drop the i and t subscripts The coefficient on unem though positive in 1316 has a very small t statistic Thus using pooled OLS on the two years has not substantially changed anything from using a single cross section This is not surprising since using pooled OLS does not solve the omitted variables problem The standard errors in this equation are incorrect because of the serial correlation described in Question 133 but we ignore this since pooled OLS is not the focus here In most applications the main reason for collecting panel data is to allow for the unobserved effect ai to be correlated with the explanatory variables For example in the crime equation we want to allow the unmeasured city factors in ai that affect the crime rate also to be correlated with the Suppose that ai ui1 and ui2 have zero means and are pairwise uncorrelated Show that Cov1vi1 vi22 5 Var1ai2 so that the com posite errors are positively serially correlated across time unless ai 5 0 What does this imply about the usual OLS standard errors from pooled OLS estimation Exploring FurthEr 133 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 414 unemployment rate It turns out that this is simple to allow because ai is constant over time we can difference the data across the two years More precisely for a crosssectional observation i write the two years as yi2 5 1b0 1 d02 1 b1xi2 1 ai 1 ui2 1t 5 22 yi1 5 b0 1 b1xi1 1 ai 1 ui1 1t 5 12 If we subtract the second equation from the first we obtain 1yi2 2 yi12 5 d0 1 b11xi2 2 xi12 1 1ui2 2 ui12 or Dyi 5 d0 1 b1Dxi 1 Dui 1317 where denotes the change from t 5 1 to t 5 2 The unobserved effect ai does not appear in 1317 it has been differenced away Also the intercept in 1317 is actually the change in the intercept from t 5 1 to t 5 2 Equation 1317 which we call the firstdifferenced equation is very simple It is just a single crosssectional equation but each variable is differenced over time We can analyze 1317 using the methods we developed in Part 1 provided the key assumptions are satisfied The most important of these is that Dui is uncorrelated with Dxi This assumption holds if the idiosyncratic error at each time t uit is uncorrelated with the explanatory variable in both time periods This is another version of the strict exogeneity assumption that we encountered in Chapter 10 for time series models In particular this assumption rules out the case where xit is the lagged dependent variable yi t21 Unlike in Chapter 10 we allow xit to be correlated with unobservables that are constant over time When we obtain the OLS estimator of b1 from 1317 we call the resulting estimator the firstdifferenced estimator In the crime example assuming that Dui and Dunemi are uncorrelated may be reasonable but it can also fail For example suppose that law enforcement effort which is in the idiosyncratic error increases more in cities where the unemployment rate decreases This can cause negative correlation between Dui and Dunemi which would then lead to bias in the OLS estimator Naturally this problem can be overcome to some extent by including more factors in the equation something we will cover later As usual it is always possible that we have not accounted for enough timevarying factors Another crucial condition is that Dxi must have some variation across i This qualification fails if the explanatory variable does not change over time for any crosssectional observation or if it changes by the same amount for every observation This is not an issue in the crime rate example because the unemployment rate changes across time for almost all cities But if i denotes an individual and xit is a dummy variable for gender Dxi 5 0 for all i we clearly cannot estimate 1317 by OLS in this case This actually makes perfectly good sense since we allow ai to be correlated with xit we cannot hope to separate the effect of ai on yit from the effect of any variable that does not change over time The only other assumption we need to apply to the usual OLS statistics is that 1317 satisfies the homoskedasticity assumption This is reasonable in many cases and if it does not hold we know how to test and correct for heteroskedasticity using the methods in Chapter 8 It is sometimes fair to assume that 1317 fulfills all of the classical linear model assumptions The OLS estimators are unbiased and all statistical inference is exact in such cases When we estimate 1317 for the crime rate example we get Dcrmrte 5 1540 1 222 Dunem 14702 1882 1318 n 5 46 R2 5 127 which now gives a positive statistically significant relationship between the crime and unemployment rates Thus differencing to eliminate timeconstant effects makes a big difference in this example The Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 415 intercept in 1318 also reveals something interesting Even if Dunem 5 0 we predict an increase in the crime rate crimes per 1000 people of 1540 This reflects a secular increase in crime rates throughout the United States from 1982 to 1987 Even if we do not begin with the unobserved effects model 1313 using differences across time makes intuitive sense Rather than estimating a standard crosssectional relationshipwhich may suf fer from omitted variables thereby making ceteris paribus conclusions difficultequation 1317 explicitly considers how changes in the explanatory variable over time affect the change in y over the same time period Nevertheless it is still very useful to have 1313 in mind it explicitly shows that we can estimate the effect of xit on yit holding ai fixed Although differencing two years of panel data is a powerful way to control for unobserved effects it is not without cost First panel data sets are harder to collect than a single cross section especially for individuals We must use a survey and keep track of the individual for a followup sur vey It is often difficult to locate some people for a second survey For units such as firms some will go bankrupt or merge with other firms Panel data are much easier to obtain for schools cities coun ties states and countries Even if we have collected a panel data set the differencing used to eliminate ai can greatly reduce the variation in the explanatory variables While xit frequently has substantial variation in the cross section for each t Dxi may not have much variation We know from Chapter 3 that a little variation in Dxi can lead to a large standard error for b 1 when estimating 1317 by OLS We can combat this by using a large cross section but this is not always possible Also using longer differences over time is sometimes better than using yeartoyear changes As an example consider the problem of estimating the return to education now using panel data on individuals for two years The model for person i is log1wageit2 5 b0 1 d0d2t 1 b1educit 1 ai 1 uit t 5 1 2 where ai contains unobserved abilitywhich is probably correlated with educit Again we allow dif ferent intercepts across time to account for aggregate productivity gains and inflation if wageit is in nominal terms Since by definition innate ability does not change over time panel data methods seem ideally suited to estimate the return to education The equation in first differences is Dlog1wagei2 5 d0 1 b1Deduci 1 Dui 1319 and we can estimate this by OLS The problem is that we are interested in working adults and for most employed individuals education does not change over time If only a small fraction of our sample has Deduci different from zero it will be difficult to get a precise estimator of b1 from 1319 unless we have a rather large sample size In theory using a firstdifferenced equation to estimate the return to education is a good idea but it does not work very well with most currently available panel data sets Adding several explanatory variables causes no difficulties We begin with the unobserved effects model yit 5 b0 1 d0d2t 1 b1xit1 1 b2xit2 1 p 1 bkxitk 1 ai 1 uit 1320 for t 5 1 and 2 This equation looks more complicated than it is because each explanatory variable has three subscripts The first denotes the crosssectional observation number the second denotes the time period and the third is just a variable label ExamplE 135 Sleeping versus Working We use the two years of panel data in SLP7581 from Biddle and Hamermesh 1990 to estimate the tradeoff between sleeping and working In Problem 3 in Chapter 3 we used just the 1975 cross sec tion The panel data set for 1975 and 1981 has 239 people which is much smaller than the 1975 cross Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 416 section that includes over 700 people An unobserved effects model for total minutes of sleeping per week is slpnapit 5 b0 1 d0d81t 1 b1totwrkit 1 b2educit 1 b3marrit 1 b4yngkidit 1 b5gdhlthit 1 ai 1 uit t 5 1 2 The unobserved effect ai would be called an unobserved individual effect or an individual fixed effect It is potentially important to allow ai to be correlated with totwrkit the same factors some bio logical that cause people to sleep more or less captured in ai are likely correlated with the amount of time spent working Some people just have more energy and this causes them to sleep less and work more The variable educ is years of education marr is a marriage dummy variable yngkid is a dummy variable indicating the presence of a small child and gdhlth is a good health dummy vari able Notice that we do not include gender or race as we did in the crosssectional analysis since these do not change over time they are part of ai Our primary interest is in b1 Differencing across the two years gives the estimable equation Dslpnapi 5 d0 1 b1Dtotwrki 1 b2Deduci 1 b3Dmarri 1 b4Dyngkidi 1 b5Dgdhlthi 1 Dui Assuming that the change in the idiosyncratic error Dui is uncorrelated with the changes in all explanatory variables we can get consistent estimators using OLS This gives Dslpnap 5 29263 2 227 Dtotwrk 2 024 Deduc 145872 10362 1487592 1 10421 Dmarr 1 9467 Dyngkid 1 8758 Dgdhlth 1321 192862 187652 176602 n 5 239 R2 5 150 The coefficient on totwrk indicates a tradeoff between sleeping and working holding other factors fixed one more hour of work is associated with 2271602 5 1362 fewer minutes of sleeping The t statistic 631 is very significant No other estimates except the intercept are statistically different from zero The F test for joint significance of all variables except totwrk gives pvalue 49 which means they are jointly insignificant at any reasonable significance level and could be dropped from the equation The standard error on educ is especially large relative to the estimate This is the phenomenon described earlier for the wage equation In the sample of 239 people 183 766 have no change in education over the sixyear period 90 of the people have a change in education of at most one year As reflected by the extremely large standard error of b 2 there is not nearly enough variation in educa tion to estimate b2 with any precision Anyway b 2 is practically very small Panel data can also be used to estimate finite distributed lag models Even if we specify the equa tion for only two years we need to collect more years of data to obtain the lagged explanatory vari ables The following is a simple example ExamplE 136 Distributed lag of Crime Rate on ClearUp Rate Eide 1994 uses panel data from police districts in Norway to estimate a distributed lag model for crime rates The single explanatory variable is the clearup percentage clrprcthe percentage of crimes that led to a conviction The crime rate data are from the years 1972 and 1978 Following Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 417 Eide we lag clrprc for one and two years it is likely that past clearup rates have a deterrent effect on current crime This leads to the following unobserved effects model for the two years log1crimeit2 5 b0 1 d0d78t 1 b1clrprci t21 1 b2clrprci t22 1 at 1 uit When we difference the equation and estimate it using the data in CRIME3 we get Dlog1crime2 5 086 2 0040 Dclrprc21 2 0132 Dclrprc22 10642 100472 100522 1322 n 5 53 R2 5 193 R2 5 161 The second lag is negative and statistically significant which implies that a higher clearup percent age two years ago would deter crime this year In particular a 10 percentage point increase in clrprc two years ago would lead to an estimated 132 drop in the crime rate this year This suggests that using more resources for solving crimes and obtaining convictions can reduce crime in the future 133a Organizing Panel Data In using panel data in an econometric study it is important to know how the data should be stored We must be careful to arrange the data so that the different time periods for the same crosssectional unit person firm city and so on are easily linked For concreteness suppose that the data set is on cities for two different years For most purposes the best way to enter the data is to have two records for each city one for each year the first record for each city corresponds to the early year and the second record is for the later year These two records should be adjacent Therefore a data set for 100 cities and two years will contain 200 records The first two records are for the first city in the sample the next two records are for the second city and so on See Table 15 in Chapter 1 for an example This makes it easy to construct the differences to store these in the second record for each city and to do a pooled crosssectional analysis which can be compared with the differencing estimation Most of the twoperiod panel data sets accompanying this text are stored in this way for exam ple CRIME2 CRIME3 GPA3 LOWBRTH and RENTAL We use a direct extension of this scheme for panel data sets with more than two time periods A second way of organizing two periods of panel data is to have only one record per cross sectional unit This requires two entries for each variable one for each time period The panel data in SLP7581 are organized in this way Each individual has data on the variables slpnap75 slpnap81 totwrk75 totwrk81 and so on Creating the differences from 1975 to 1981 is easy Other panel data sets with this structure are TRAFFIC1 and VOTE2 Putting the data in one record however does not allow a pooled OLS analysis using the two time periods on the original data Also this organizational method does not work for panel data sets with more than two time periods a case we will consider in Section 135 134 Policy Analysis with TwoPeriod Panel Data Panel data sets are very useful for policy analysis and in particular program evaluation In the sim plest program evaluation setup a sample of individuals firms cities and so on is obtained in the first time period Some of these units those in the treatment group then take part in a particular program in a later time period the ones that do not are the control group This is similar to the natural experi ment literature discussed earlier with one important difference the same crosssectional units appear in each time period As an example suppose we wish to evaluate the effect of a Michigan job training program on worker productivity of manufacturing firms see also Computer Exercise C3 in Chapter 9 Let scrapit denote the scrap rate of firm i during year t the number of items per 100 that must be scrapped due Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 418 to defects Let grantit be a binary indicator equal to one if firm i in year t received a job training grant For the years 1987 and 1988 the model is scrapit 5 b0 1 d0y88t 1 b1grantit 1 ai 1 uit t 5 1 2 1323 where y88t is a dummy variable for 1988 and ai is the unobserved firm effect or the firm fixed effect The unobserved effect contains such factors as average employee ability capital and managerial skill these are roughly constant over a twoyear period We are concerned about ai being systemati cally related to whether a firm receives a grant For example administrators of the program might give priority to firms whose workers have lower skills Or the opposite problem could occur to make the job training program appear effective administrators may give the grants to employers with more productive workers Actually in this particular program grants were awarded on a firstcome first served basis But whether a firm applied early for a grant could be correlated with worker productiv ity In that case an analysis using a single cross section or just a pooling of the cross sections will produce biased and inconsistent estimators Differencing to remove ai gives Dscrapi 5 d0 1 b1Dgranti 1 Dui 1324 Therefore we simply regress the change in the scrap rate on the change in the grant indicator Because no firms received grants in 1987 granti1 5 0 for all i and so Dgranti 5 granti2 2 granti1 5 granti2 which simply indicates whether the firm received a grant in 1988 However it is generally important to difference all variables dummy variables included because this is necessary for removing ai in the unobserved effects model 1323 Estimating the firstdifferenced equation using the data in JTRAIN gives Dscrap 5 2564 2 739 Dgrant 14052 16832 n 5 54 R2 5 022 Therefore we estimate that having a job training grant lowered the scrap rate on average by 2739 But the estimate is not statistically different from zero We get stronger results by using logscrap and estimating the percentage effect Dlog1scrap2 5 2057 2 317Dgrant 10972 11642 n 5 54 R2 5 067 Having a job training grant is estimated to lower the scrap rate by about 272 We obtain this estimate from equation 710 exp123172 2 1 22724 The t statistic is about 2193 which is marginally significant By contrast using pooled OLS of logscrap on y88 and grant gives b 1 5 057 standard error 5 431 Thus we find no significant relationship between the scrap rate and the job training grant Since this differs so much from the firstdifference estimates it suggests that firms that have lowerability workers are more likely to receive a grant It is useful to study the program evaluation model more generally Let yit denote an outcome varia ble and let progit be a program participation dummy variable The simplest unobserved effects model is yit 5 b0 1 d0d2t 1 b1progit 1 ai 1 uit 1325 If program participation only occurred in the second period then the OLS estimator of b1 in the dif ferenced equation has a very simple representation b 1 5 Dytreat 2 Dycontrol 1326 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 419 That is we compute the average change in y over the two time periods for the treatment and con trol groups Then b 1 is the difference of these This is the panel data version of the differencein differences estimator in equation 1311 for two pooled cross sections With panel data we have a potentially important advantage we can difference y across time for the same crosssectional units This allows us to control for person firm or cityspecific effects as the model in 1325 makes clear If program participation takes place in both periods b 1 cannot be written as in 1326 but we interpret it in the same way it is the change in the average value of y due to program participation Controlling for timevarying factors does not change anything of significance We simply differ ence those variables and include them along with Dprog This allows us to control for timevarying variables that might be correlated with program designation The same differencing method works for analyzing the effects of any policy that varies across city or state The following is a simple example ExamplE 137 Effect of Drunk Driving laws on Traffic Fatalities Many states in the United States have adopted different policies in an attempt to curb drunk driv ing Two types of laws that we will study here are open container lawswhich make it illegal for passengers to have open containers of alcoholic beveragesand administrative per se lawswhich allow courts to suspend licenses after a driver is arrested for drunk driving but before the driver is convicted One possible analysis is to use a single cross section of states to regress driving fatalities or those related to drunk driving on dummy variable indicators for whether each law is present This is unlikely to work well because states decide through legislative processes whether they need such laws Therefore the presence of laws is likely to be related to the average drunk driving fatali ties in recent years A more convincing analysis uses panel data over a time period where some states adopted new laws and some states may have repealed existing laws The file TRAFFIC1 contains data for 1985 and 1990 for all 50 states and the District of Columbia The dependent variable is the number of traffic deaths per 100 million miles driven dthrte In 1985 19 states had open container laws while 22 states had such laws in 1990 In 1985 21 states had per se laws the number had grown to 29 by 1990 Using OLS after first differencing gives Ddthrte 5 2497 2 420 Dopen 2 151Dadmn 10522 12062 11172 1327 n 5 51 R2 5 119 The estimates suggest that adopting an open con tainer law lowered the traffic fatality rate by 42 a nontrivial effect given that the average death rate in 1985 was 27 with a standard deviation of about 6 The estimate is statistically significant at the 5 level against a twosided alternative The administrative per se law has a smaller effect and its t statistic is only 2129 but the estimate is the sign we expect The intercept in this equation shows that traffic fatalities fell substantially for all states over the five year period whether or not there were any law changes The states that adopted an open container law over this period saw a further drop on average in fatality rates Other laws might also affect traffic fatalities such as seat belt laws motorcycle helmet laws and maximum speed limits In addition we might want to control for age and gender distributions as well as measures of how influential an organization such as Mothers Against Drunk Driving is in each state In Example 137 Dadmn 5 21 for the state of Washington Explain what this means Exploring FurthEr 134 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 420 135 Differencing with More Than Two Time Periods We can also use differencing with more than two time periods For illustration suppose we have N individuals and T 5 3 time periods for each individual A general fixed effects model is yit 5 d1 1 d2d2t 1 d3d3t 1 b1xit1 1 p 1 bkxitk 1 ai 1 uit 1328 for t 5 1 2 and 3 The total number of observations is therefore 3N Notice that we now include two timeperiod dummies in addition to the intercept It is a good idea to allow a separate intercept for each time period especially when we have a small number of them The base period as always is t 5 1 The intercept for the second time period is d1 1 d2 and so on We are primarily interested in b1 b2 p bk If the unobserved effect ai is correlated with any of the explanatory variables then using pooled OLS on the three years of data results in biased and inconsistent estimates The key assumption is that the idiosyncratic errors are uncorrelated with the explanatory variable in each time period Cov1xitj uis2 5 0 for all t s and j 1329 That is the explanatory variables are strictly exogenous after we take out the unobserved effect ai The strict exogeneity assumption stated in terms of a zero conditional expectation is given in the chapter appendix Assumption 1329 rules out cases where future explanatory variables react to current changes in the idiosyncratic errors as must be the case if xitj is a lagged dependent variable If we have omitted an important timevarying variable then 1329 is generally violated Measurement error in one or more explanatory variables can cause 1329 to be false just as in Chapter 9 In Chapters 15 and 16 we will discuss what can be done in such cases If ai is correlated with xitj then xitj will be correlated with the composite error vit 5 ai 1 uit under 1329 We can eliminate ai by differencing adjacent periods In the T 5 3 case we subtract time period one from time period two and time period two from time period three This gives Dyit 5 d2Dd2t 1 d3Dd3t 1 b1Dxit1 1 p 1 bkDxitk 1 Duit 1330 for t 5 2 and 3 We do not have a differenced equation for t 5 1 because there is nothing to subtract from the t 5 1 equation Now 1330 represents two time periods for each individual in the sam ple If this equation satisfies the classical linear model assumptions then pooled OLS gives unbiased estimators and the usual t and F statistics are valid for hypothesis We can also appeal to asymptotic results The important requirement for OLS to be consistent is that Duit is uncorrelated with Dxitj for all j and t 5 2 and 3 This is the natural extension from the two time period case Notice how 1330 contains the differences in the year dummies d2t and d3t For t 5 2 Dd2t 5 1 and Dd3t 5 0 for t 5 3 Dd2t 5 21 and Dd3t 5 1 Therefore 1330 does not contain an intercept This is inconvenient for certain purposes including the computation of Rsquared Unless the time intercepts in the original model 1328 are of direct interestthey rarely areit is better to estimate the firstdifferenced equation with an intercept and a single timeperiod dummy usually for the third period In other words the equation becomes Dyit 5 a0 1 a3d3t 1 b1Dxit1 1 p 1 bkDxitk 1 Duit for t 5 2 and 3 The estimates of the bj are identical in either formulation With more than three time periods things are similar If we have the same T time periods for each of N crosssectional units we say that the data set is a balanced panel we have the same time periods for all individuals firms cities and so on When T is small relative to N we should include a dummy variable for each time period to account for secular changes that are not being modeled Therefore after first differencing the equation looks like Dyit 5 a0 1 a3d3t 1 a4d4t 1 p 1 aT dTt 1 b1Dxit1 1 p 1 bkDxitk 1 Duit t 5 2 3 p T 1331 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 421 where we have T 2 1 time periods on each unit i for the firstdifferenced equation The total number of observations is N1T 2 12 It is simple to estimate 1331 by pooled OLS provided the observations have been properly organized and the differencing carefully done To facilitate first differencing the data file should con sist of NT records The first T records are for the first crosssectional observation arranged chronolog ically the second T records are for the second crosssectional observations arranged chronologically and so on Then we compute the differences with the change from t 2 1 to t stored in the time t record Therefore the differences for t 5 1 should be missing values for all N crosssectional obser vations Without doing this you run the risk of using bogus observations in the regression analysis An invalid observation is created when the last observation for say person i 2 1 is subtracted from the first observation for person i If you do the regression on the differenced data and NT or NT 2 1 observations are reported then you forgot to set the t 5 1 observations as missing When using more than two time periods we must assume that Duit is uncorrelated over time for the usual standard errors and test statistics to be valid This assumption is sometimes reasonable but it does not follow if we assume that the original idiosyncratic errors uit are uncorrelated over time an assumption we will use in Chapter 14 In fact if we assume the uit are serially uncorrelated with constant variance then the correlation between Duit and Dui t11 can be shown to be 25 If uit follows a stable AR1 model then Duit will be serially correlated Only when uit follows a random walk will Duit be serially uncorrelated It is easy to test for serial correlation in the firstdifferenced equation Let rit 5 Duit denote the first difference of the original error If rit follows the AR1 model rit 5 rri t21 1 eit then we can easily test H0 r 5 0 First we estimate 1331 by pooled OLS and obtain the residuals rit Then we run a simple pooled OLS regression of rit on ri t21 t 5 3 p T i 5 1 p N and compute a standard t test for the coefficient on ri t21 Or we can make the t statistic robust to het eroskedasticity The coefficient r on ri t21 is a consistent estimator of r Because we are using the lagged residual we lose another time period For example if we started with T 5 3 the differenced equation has two time periods and the test for serial correlation is just a crosssectional regression of the residuals from the third time period on the residuals from the second time period We will give an example later We can correct for the presence of AR1 serial correlation in rit by using feasible GLS Essentially within each crosssectional observation we would use the PraisWinsten transformation based on r described in the previous paragraph We clearly prefer PraisWinsten to CochraneOrcutt here as dropping the first time period would now mean losing N crosssectional observations Unfortunately standard packages that perform AR1 corrections for time series regressions will not work Standard PraisWinsten methods will treat the observations as if they followed an AR1 process across i and t this makes no sense as we are assuming the observations are independent across i Corrections to the OLS standard errors that allow arbitrary forms of serial correlation and heteroskedasticity can be computed when N is large and N should be nota bly larger than T A detailed treatment of standard errors and test statistics that are robust to any forms of serial correlation and heteroskedasticity is beyond the scope of this text see for example Wooldridge 2010 Chapter 10 Nevertheless such statistics are easy to compute in many econometrics software packages and the appendix contains an intuitive discussion If there is no serial correlation in the errors the usual methods for dealing with heteroskedasticity are valid We can use the BreuschPagan and White tests for heteroskedasticity from Chapter 8 and we can also compute robust standard errors Differencing more than two years of panel data is very useful for policy analysis as shown by the following example Does serial correlation in Duit cause the firstdifferenced estimator to be biased and inconsistent Why is serial correlation a concern Exploring FurthEr 135 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 422 ExamplE 138 Effect of Enterprise Zones on Unemployment Claims Papke 1994 studied the effect of the Indiana enterprise zone EZ program on unemployment claims She analyzed 22 cities in Indiana over the period from 1980 to 1988 Six enterprise zones were desig nated in 1984 and four more were assigned in 1985 Twelve of the cities in the sample did not receive an enterprise zone over this period they served as the control group A simple policy evaluation model is log1uclmsit2 5 ut 1 b1ezit 1 ai 1 uit where uclmsit is the number of unemployment claims filed during year t in city i The parameter ut just denotes a different intercept for each time period Generally unemployment claims were falling statewide over this period and this should be reflected in the different year intercepts The binary variable ezit is equal to one if city i at time t was an enterprise zone we are interested in b1 The unobserved effect ai represents fixed factors that affect the economic climate in city i Because enter prise zone designation was not determined randomlyenterprise zones are usually economically depressed areasit is likely that ezit and ai are positively correlated high ai means higher unemploy ment claims which lead to a higher chance of being given an EZ Thus we should difference the equation to eliminate ai Dlog1uclmsit2 5 a0 1 a1d82t 1 p 1 a7d88t 1 b1Dezit 1 Duit 1332 The dependent variable in this equation the change in log1uclmsit2 is the approximate annual growth rate in unemployment claims from year t 2 1 to t We can estimate this equation for the years 1981 to 1988 using the data in EZUNEM the total sample size is 228 5 176 The estimate of b1 is b 1 5 2182 1standard error 5 0782 Therefore it appears that the presence of an EZ causes about a 166 3exp121822 2 1 21664 fall in unemployment claims This is an economically large and statistically significant effect There is no evidence of heteroskedasticity in the equation the BreuschPagan F test yields F 5 85 pvalue 5 557 However when we add the lagged OLS residuals to the differenced equa tion and lose the year 1981 we get r 5 2197 1t 5 22442 so there is evidence of minimal nega tive serial correlation in the firstdifferenced errors Unlike with positive serial correlation the usual OLS standard errors may not greatly understate the correct standard errors when the errors are nega tively correlated see Section 121 Thus the significance of the enterprise zone dummy variable will probably not be affected ExamplE 139 County Crime Rates in North Carolina Cornwell and Trumbull 1994 used data on 90 counties in North Carolina for the years 1981 through 1987 to estimate an unobserved effects model of crime the data are contained in CRIME4 Here we estimate a simpler version of their model and we difference the equation over time to eliminate ai the unobserved effect Cornwell and Trumbull use a different transformation which we will cover in Chapter 14 Various factors including geographical location attitudes toward crime historical records and reporting conventions might be contained in ai The crime rate is number of crimes per person prbarr is the estimated probability of arrest prbconv is the estimated probability of convic tion given an arrest prbpris is the probability of serving time in prison given a conviction avgsen is the average sentence length served and polpc is the number of police officers per capita As is standard in criminometric studies we use the logs of all variables to estimate elasticities We also include a full set of year dummies to control for state trends in crime rates We can use the years 1982 through 1987 to estimate the differenced equation The quantities in parentheses are the usual OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 423 standard errors the quantities in brackets are standard errors robust to both serial correlation and heteroskedasticity Dlog1crmrte2 5 008 2 100 d83 2 048 d84 2 005 d85 10172 10242 10242 10232 30144 30224 30204 30254 1 028 d86 1 041 d87 2 327 Dlog1prbarr2 10242 10242 10302 30214 30244 30564 2 238 Dlog1prbconv2 2 165 Dlog1prbpris2 1333 10182 10262 30404 30464 2 022 Dlog1avgsen2 1 398 Dlog1polpc2 10222 10272 30264 31034 n 5 540 R2 5 433 R2 5 422 The three probability variablesof arrest conviction and serving prison timeall have the expected sign and all are statistically significant For example a 1 increase in the probability of arrest is pre dicted to lower the crime rate by about 33 The average sentence variable shows a modest deterrent effect but it is not statistically significant The coefficient on the police per capita variable is somewhat surprising and is a feature of most studies that seek to explain crime rates Interpreted causally it says that a 1 increase in police per capita increases crime rates by about 4 The usual t statistic is very large almost 15 It is hard to believe that having more police officers causes more crime What is going on here There are at least two possibilities First the crime rate variable is calculated from reported crimes It might be that when there are additional police more crimes are reported Second the police variable might be endogenous in the equation for other reasons counties may enlarge the police force when they expect crime rates to increase In this case 1333 cannot be interpreted in a causal fashion In Chapters 15 and 16 we will cover models and estimation methods that can account for this additional form of endogeneity The special case of the White test for heteroskedasticity in Section 83 gives F 5 7548 and pvalue 5 0000 so there is strong evidence of heteroskedasticity Technically this test is not valid if there is also serial correlation but it is strongly suggestive Testing for AR1 serial correlation yields r 5 2233 t 5 2477 so negative serial correlation exists The standard errors in brackets adjust for serial correlation and heteroskedasticity We will not give the details of this the calculations are similar to those described in Section 125 and are carried out by many econometric packages See Wooldridge 2010 Chapter 10 for more discussion No variables lose statistical significance but the t statistics on the significant deterrent variables get notably smaller For example the t statistic on the probability of conviction variable goes from 21322 using the usual OLS standard error to 2610 using the fully robust standard error Equivalently the confidence intervals constructed using the robust standard errors will appropriately be much wider than those based on the usual OLS standard errors Naturally we can apply the Chow test to panel data models estimated by first differencing As in the case of pooled cross sections we rarely want to test whether the intercepts are constant over time for many reasons we expect the intercepts to be different Much more interesting is to test whether slope coefficients have changed over time and we can easily carry out such tests by interacting the explanatory variables of interest with timeperiod dummy variables Interestingly while we cannot estimate the slopes on variables that do not change over time we can test whether the partial effects of Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 424 timeconstant variables have changed over time As an illustration suppose we observe three years of data on a random sample of people working in 2000 2002 and 2004 and specify the model for the log of wage lwage lwageit 5 b0 1 d1d02t 1 d2d04t 1 b1femalei 1 g1d02tfemalei 1 g2d04t femalei 1 zitl 1 ai 1 uit where zitl is shorthand for other explanatory variables included in the model and their coefficients When we first difference we eliminate the intercept for 2000 b0 and also the gender wage gap for 2000 b1 However the change in d01t femalei is 1Dd01t2femalei which does not drop out Consequently we can estimate how the wage gap has changed in 2002 and 2004 relative to 2000 and we can test whether g1 5 0 or g2 5 0 or both We might also ask whether the union wage premium has changed over time in which case we include in the model unionit d02tunionit and d04tunionit The coefficients on all of these explanatory variables can be estimated because unionit would presum ably have some time variation If one tries to estimate a model containing interactions by differencing by hand it can be a bit tricky For example in the previous equation with union status we must simply difference the interac tion terms d02tunionit and d04tunionit We cannot compute the proper differences as say d02tDunionit and d04tDunionit or even replacing d02t and d04t with their first differences As a general comment it is important to return to the original model and remember that the dif ferencing is used to eliminate ai It is easiest to use a builtin command that allows first differencing as an option in panel data analysis We will see some of the other options in Chapter 14 135a Potential Pitfalls in First Differencing Panel Data In this and previous sections we have argued that differencing panel data over time in order to elimi nate a timeconstant unobserved effect is a valuable method for obtaining causal effects Nevertheless differencing is not free of difficulties We have already discussed potential problems with the method when the key explanatory variables do not vary much over time and the method is useless for explan atory variables that never vary over time Unfortunately even when we do have sufficient time vari ation in the xitj firstdifferenced FD estimation can be subject to serious biases We have already mentioned that strict exogeneity of the regressors is a critical assumption Unfortunately as discussed in Wooldridge 2010 Section 111 having more time periods generally does not reduce the incon sistency in the FD estimator when the regressors are not strictly exogenous say if yi t21 is included among the xitj Another important drawback to the FD estimator is that it can be worse than pooled OLS if one or more of the explanatory variables is subject to measurement error especially the classical errors invariables model discussed in Section 93 Differencing a poorly measured regressor reduces its variation relative to its correlation with the differenced error caused by classical measurement error resulting in a potentially sizable bias Solving such problems can be very difficult See Section 158 and Wooldridge 2010 Chapter 11 Summary We have studied methods for analyzing independently pooled crosssectional and panel data sets Inde pendent cross sections arise when different random samples are obtained in different time periods usually years OLS using pooled data is the leading method of estimation and the usual inference procedures are available including corrections for heteroskedasticity Serial correlation is not an issue because the samples are independent across time Because of the time series dimension we often allow different time Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 425 Problems 1 In Example 131 assume that the averages of all factors other than educ have remained constant over time and that the average level of education is 122 for the 1972 sample and 133 in the 1984 sample Using the estimates in Table 131 find the estimated change in average fertility between 1972 and 1984 Be sure to account for the intercept change and the change in average education 2 Using the data in KIELMC the following equations were estimated using the years 1978 and 1981 log1price2 5 1149 2 547 nearinc 1 394 y81 nearinc 1262 10582 10802 n 5 321 R2 5 220 and log1price2 5 1118 1 563 y81 2 403 y81 nearinc 1272 10442 10672 n 5 321 R2 5 337 Compare the estimates on the interaction term y81nearinc with those from equation 139 Why are the estimates so different 3 Why can we not use first differences when we have independent cross sections in two years as opposed to panel data intercepts We might also interact time dummies with certain key variables to see how they have changed over time This is especially important in the policy evaluation literature for natural experiments Panel data sets are being used more and more in applied work especially for policy analysis These are data sets where the same crosssectional units are followed over time Panel data sets are most useful when controlling for timeconstant unobserved featuresof people firms cities and so onwhich we think might be correlated with the explanatory variables in our model One way to remove the unobserved effect is to difference the data in adjacent time periods Then a standard OLS analysis on the differences can be used Using two periods of data results in a crosssectional regression of the differenced data The usual inference procedures are asymptotically valid under homoskedasticity exact inference is available under normality For more than two time periods we can use pooled OLS on the differenced data we lose the first time period because of the differencing In addition to homoskedasticity we must assume that the differenced errors are serially uncorrelated in order to apply the usual t and F statistics The chapter appendix contains a careful listing of the assumptions Naturally any variable that is constant over time drops out of the analysis Key Terms Average Treatment Effect Balanced Panel Clustering Composite Error DifferenceinDifferences Estimator FirstDifferenced Equation FirstDifferenced Estimator Fixed Effect Fixed Effects Model Heterogeneity Bias Idiosyncratic Error Independently Pooled Cross Section Longitudinal Data Natural Experiment Panel Data QuasiExperiment Strict Exogeneity Unobserved Effect Unobserved Effects Model Unobserved Heterogeneity Year Dummy Variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 426 4 If we think that b1 is positive in 1314 and that Dui and Dunemi are negatively correlated what is the bias in the OLS estimator of b1 in the firstdifferenced equation Hint Review equation 54 5 Suppose that we want to estimate the effect of several variables on annual saving and that we have a panel data set on individuals collected on January 31 1990 and January 31 1992 If we include a year dummy for 1992 and use first differencing can we also include age in the original model Explain 6 In 1985 neither Florida nor Georgia had laws banning open alcohol containers in vehicle passenger compartments By 1990 Florida had passed such a law but Georgia had not i Suppose you can collect random samples of the drivingage population in both states for 1985 and 1990 Let arrest be a binary variable equal to unity if a person was arrested for drunk driv ing during the year Without controlling for any other factors write down a linear probability model that allows you to test whether the open container law reduced the probability of being arrested for drunk driving Which coefficient in your model measures the effect of the law ii Why might you want to control for other factors in the model What might some of these factors be iii Now suppose that you can only collect data for 1985 and 1990 at the county level for the two states The dependent variable would be the fraction of licensed drivers arrested for drunk driv ing during the year How does this data structure differ from the individuallevel data described in part i What econometric method would you use 7 i Using the data in INJURY for Kentucky we find the estimated equation when afchnge is dropped from 1312 is log1durat2 5 1129 1 253 highearn 1 198 afchnge highearn 100222 10422 10522 n 5 5626 R2 5 021 Is it surprising that the estimate on the interaction is fairly close to that in 1312 Explain ii When afchnge is included but highearn is dropped the result is log1durat2 5 1233 2 100 afchnge 1 447 afchnge highearn 100232 10402 10502 n 5 5626 R2 5 016 Why is the coefficient on the interaction term now so much larger than in 1312 Hint In equation 1310 what is the assumption being made about the treatment and control groups if b1 5 0 Computer Exercises C1 Use the data in FERTIL1 for this exercise i In the equation estimated in Example 131 test whether living environment at age 16 has an effect on fertility The base group is large city Report the value of the F statistic and the pvalue ii Test whether region of the country at age 16 South is the base group has an effect on fertility iii Let u be the error term in the population equation Suppose you think that the variance of u changes over time but not with educ age and so on A model that captures this is u2 5 g0 1 g1y74 1 g2y76 1 p 1 g6y84 1 v Using this model test for heteroskedasticity in u Hint Your F test should have 6 and 1122 degrees of freedom iv Add the interaction terms y74 educ y76 educ p y84 educ to the model estimated in Table 131 Explain what these terms represent Are they jointly significant Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 427 C2 Use the data in CPS7885 for this exercise i How do you interpret the coefficient on y85 in equation 132 Does it have an interesting interpretation Be careful here you must account for the interaction terms y85 educ and y85 female ii Holding other factors fixed what is the estimated percent increase in nominal wage for a male with 12 years of education Propose a regression to obtain a confidence interval for this estimate Hint To get the confidence interval replace y85 educ with y85 1educ 2 122 refer to Example 63 iii Reestimate equation 132 but let all wages be measured in 1978 dollars In particular define the real wage as rwage 5 wage for 1978 and as rwage 5 wage165 for 1985 Now use logrwage in place of logwage in estimating 132 Which coefficients differ from those in equation 132 iv Explain why the Rsquared from your regression in part iii is not the same as in equation 132 Hint The residuals and therefore the sum of squared residuals from the two regressions are identical v Describe how union participation changed from 1978 to 1985 vi Starting with equation 132 test whether the union wage differential changed over time This should be a simple t test vii Do your findings in parts v and vi conflict Explain C3 Use the data in KIELMC for this exercise i The variable dist is the distance from each home to the incinerator site in feet Consider the model log1price2 5 b0 1 d0y81 1 b1log1dist2 1 d1y81 log1dist2 1 u If building the incinerator reduces the value of homes closer to the site what is the sign of d1 What does it mean if b1 0 ii Estimate the model from part i and report the results in the usual form Interpret the coefficient on y81 log1dist2 What do you conclude iii Add age age2 rooms baths logintst logland and logarea to the equation Now what do you conclude about the effect of the incinerator on housing values iv Why is the coefficient on logdist positive and statistically significant in part ii but not in part iii What does this say about the controls used in part iii C4 Use the data in INJURY for this exercise i Using the data for Kentucky reestimate equation 1312 adding as explanatory variables male married and a full set of industry and injury type dummy variables How does the estimate on afchnge highearn change when these other factors are controlled for Is the estimate still statistically significant ii What do you make of the small Rsquared from part i Does this mean the equation is useless iii Estimate equation 1312 using the data for Michigan Compare the estimates on the interaction term for Michigan and Kentucky Is the Michigan estimate statistically significant What do you make of this C5 Use the data in RENTAL for this exercise The data for the years 1980 and 1990 include rental prices and other variables for college towns The idea is to see whether a stronger presence of students affects rental rates The unobserved effects model is log1rentit2 5 b0 1 d0y90t 1 b1log1popit2 1 b2log1avgincit2 1 b3pctstuit 1 ai 1 uit where pop is city population avginc is average income and pctstu is student population as a percentage of city population during the school year i Estimate the equation by pooled OLS and report the results in standard form What do you make of the estimate on the 1990 dummy variable What do you get for b pctstu ii Are the standard errors you report in part i valid Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 428 iii Now difference the equation and estimate by OLS Compare your estimate of bpctstu with that from part ii Does the relative size of the student population appear to affect rental prices iv Obtain the heteroskedasticityrobust standard errors for the firstdifferenced equation in part iii Does this change your conclusions C6 Use CRIME3 for this exercise i In the model of Example 136 test the hypothesis H0 b1 5 b2 Hint Define u1 5 b1 2 b2 and write b1 in terms of u1 and b2 Substitute this into the equation and then rearrange Do a t test on u1 ii If b1 5 b2 show that the differenced equation can be written as Dlog1crimei2 5 d0 1 d1Davgclri 1 Dui where d1 5 2b1 and avgclri 5 1clrprci 21 1 clrprci 2222 is the average clearup percentage over the previous two years iii Estimate the equation from part ii Compare the adjusted Rsquared with that in 1322 Which model would you finally use C7 Use GPA3 for this exercise The data set is for 366 studentathletes from a large university for fall and spring semesters A similar analysis is in Maloney and McCormick 1993 but here we use a true panel data set Because you have two terms of data for each student an unobserved effects model is appropriate The primary question of interest is this Do athletes perform more poorly in school during the semester their sport is in season i Use pooled OLS to estimate a model with term GPA trmgpa as the dependent variable The explanatory variables are spring sat hsperc female black white frstsem tothrs crsgpa and season Interpret the coefficient on season Is it statistically significant ii Most of the athletes who play their sport only in the fall are football players Suppose the ability levels of football players differ systematically from those of other athletes If ability is not adequately captured by SAT score and high school percentile explain why the pooled OLS estimators will be biased iii Now use the data differenced across the two terms Which variables drop out Now test for an inseason effect iv Can you think of one or more potentially important timevarying variables that have been omitted from the analysis C8 VOTE2 includes panel data on House of Representatives elections in 1988 and 1990 Only winners from 1988 who are also running in 1990 appear in the sample these are the incumbents An unob served effects model explaining the share of the incumbents vote in terms of expenditures by both candidates is voteit 5 b0 1 d0d90t 1 b1log1inexpit2 1 b2log1chexpit2 1 b3incshrit 1 ai 1 uit where incshrit is the incumbents share of total campaign spending in percentage form The unob served effect ai contains characteristics of the incumbentsuch as qualityas well as things about the district that are constant The incumbents gender and party are constant over time so these are subsumed in ai We are interested in the effect of campaign expenditures on election outcomes i Difference the given equation across the two years and estimate the differenced equation by OLS Which variables are individually significant at the 5 level against a twosided alternative ii In the equation from part i test for joint significance of loginexp and logchexp Report the pvalue iii Reestimate the equation from part i using incshr as the only independent variable Interpret the coefficient on incshr For example if the incumbents share of spending increases by 10 percentage points how is this predicted to affect the incumbents share of the vote Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 429 iv Redo part iii but now use only the pairs that have repeat challengers This allows us to control for characteristics of the challengers as well which would be in ai Levitt 1994 conducts a much more extensive analysis C9 Use CRIME4 for this exercise i Add the logs of each wage variable in the data set and estimate the model by first differencing How does including these variables affect the coefficients on the criminal justice variables in Example 139 ii Do the wage variables in i all have the expected sign Are they jointly significant Explain C10 For this exercise we use JTRAIN to determine the effect of the job training grant on hours of job train ing per employee The basic model for the three years is hrsempit 5 b0 1 d1d88t 1 d2d89t 1 b1grantit 1 b2granti t21 1 b3log1employit2 1 ai 1 uit i Estimate the equation using first differencing How many firms are used in the estimation How many total observations would be used if each firm had data on all variables in particular hrsemp for all three time periods ii Interpret the coefficient on grant and comment on its significance iii Is it surprising that grant21 is insignificant Explain iv Do larger firms train their employees more or less on average How big are the differences in training C11 The file MATHPNL contains panel data on school districts in Michigan for the years 1992 through 1998 It is the districtlevel analogue of the schoollevel data used by Papke 2005 The response variable of interest in this question is math4 the percentage of fourth graders in a district receiving a passing score on a standardized math test The key explanatory variable is rexpp which is real expen ditures per pupil in the district The amounts are in 1997 dollars The spending variable will appear in logarithmic form i Consider the static unobserved effects model math4it 5 d1y93t 1 p 1 d6y98t 1 b1log1rexppit2 1 b2log1enrolit2 1 b3lunchit 1 ai 1 uit where enrolit is total district enrollment and lunchit is the percentage of students in the district eligible for the school lunch program So lunchit is a pretty good measure of the districtwide poverty rate Argue that b110 is the percentage point change in math4it when real perstudent spending increases by roughly 10 ii Use first differencing to estimate the model in part i The simplest approach is to allow an intercept in the firstdifferenced equation and to include dummy variables for the years 1994 through 1998 Interpret the coefficient on the spending variable iii Now add one lag of the spending variable to the model and reestimate using first differencing Note that you lose another year of data so you are only using changes starting in 1994 Discuss the coefficients and significance on the current and lagged spending variables iv Obtain heteroskedasticityrobust standard errors for the firstdifferenced regression in part iii How do these standard errors compare with those from part iii for the spending variables v Now obtain standard errors robust to both heteroskedasticity and serial correlation What does this do to the significance of the lagged spending variable vi Verify that the differenced errors rit 5 Duit have negative serial correlation by carrying out a test of AR1 serial correlation vii Based on a fully robust joint test does it appear necessary to include the enrollment and lunch variables in the model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 430 C12 Use the data in MURDER for this exercise i Using the years 1990 and 1993 estimate the equation mrdrteit 5 d0 1 d1d93t 1 b1execit 1 b2unemit 1 ai 1 uit t 5 1 2 by pooled OLS and report the results in the usual form Do not worry that the usual OLS standard errors are inappropriate because of the presence of ai Do you estimate a deterrent effect of capital punishment ii Compute the FD estimates use only the differences from 1990 to 1993 you should have 51 observations in the FD regression Now what do you conclude about a deterrent effect iii In the FD regression from part ii obtain the residuals say ei Run the BreuschPagan regression e2 i on Dexeci Dunemi and compute the F test for heteroskedasticity Do the same for the special case of the White test that is regress e2 i on yi y2 i where the fitted values are from part ii What do you conclude about heteroskedasticity in the FD equation iv Run the same regression from part ii but obtain the heteroskedasticityrobust t statistics What happens v Which t statistic on Dexeci do you feel more comfortable relying on the usual one or the heteroskedasticityrobust one Why C13 Use the data in WAGEPAN for this exercise i Consider the unobserved effects model lwageit 5 b0 1 d1d81t 1 p 1 d7d87t 1 b1educi 1 g1d81t educi 1 p 1 d7d87t educi 1 b2unionit 1 ai 1 uit where ai is allowed to be correlated with educi and unionit Which parameters can you estimate using first differencing ii Estimate the equation from part i by FD and test the null hypothesis that the return to education has not changed over time iii Test the hypothesis from part ii using a fully robust test that is one that allows arbitrary heteroskedasticity and serial correlation in the FD errors Duit Does your conclusion change iv Now allow the union differential to change over time along with education and estimate the equation by FD What is the estimated union differential in 1980 What about 1987 Is the difference statistically significant v Test the null hypothesis that the union differential has not changed over time and discuss your results in light of your answer to part iv C14 Use the data in JTRAIN3 for this question i Estimate the simple regression model re78 5 b0 1 b1train 1 u and report the results in the usual form Based on this regression does it appear that job training which took place in 1976 and 1977 had a positive effect on real labor earnings in 1978 ii Now use the change in real labor earnings cre 5 re78 2 re75 as the dependent variable We need not difference train because we assume there was no job training prior to 1975 That is if we define ctrain 5 train78 2 train75 then ctrain 5 train78 because train75 5 0 Now what is the estimated effect of training Discuss how it compares with the estimate in part i iii Find the 95 confidence interval for the training effect using the usual OLS standard error and the heteroskedasticityrobust standard error and describe your findings C15 The data set HAPPINESS contains independently pooled cross sections for the even years from 1994 through 2006 obtained from the General Social Survey The dependent variable for this problem is a measure of happiness vhappy which is a binary variable equal to one if the person reports being very happy as opposed to just pretty happy or not too happy Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 431 i Which year has the largest number of observations Which has the smallest What is the percentage of people in the sample reporting they are very happy ii Regress vhappy on all of the year dummies leaving out y94 so that 1994 is the base year Compute a heteroskedasticityrobust statistic of the null hypothesis that the proportion of very happy people has not changed over time What is the pvalue of the test iii To the regression in part ii add the dummy variables occattend and regattend Interpret their coefficients Remember the coefficients are interpreted relative to a base group How would you summarize the effects of church attendance on happiness iv Define a variable say highinc equal to one if family income is above 25000 Unfortunately the same threshold is used in each year and so inflation is not accounted for Also 25000 is hardly what one would consider high income Include highinc unem10 educ and teens in the regression in part iii Is the coefficient on regattend affected much What about its statistical significance v Discuss the signs magnitudes and statistical significance of the four new variables in part iv Do the estimates make sense vi Controlling for the factors in part iv do there appear to be differences in happiness by gender or race Justify your answer C16 Use the data in COUNTYMURDERS to answer this question The data set covers murders and execu tions capital punishment for 2197 counties in the United States i Find the average value of murdrate across all counties and years What is the standard deviation For what percentage of the sample is murdrate equal to zero ii How many observations have execs equal to zero What is the maximum value of execs Why is the average of execs so small iii Consider the model murdrateit 5 ut 1 b1execsit 1 b2execsi t21 1 b3percblackit 1 b4percmalei 1 b5perc1019 1 b6perc2029 1 ai 1 uit where ut represents a different intercept for each time period ai is the county fixed effect and uit is the idiosyncratic error What do we need to assume about ai and the execution variables in order for pooled OLS to consistently estimate the parameters in particular b1 and b2 iv Apply OLS to the equation from part ii and report the estimates of b1 and b2 along with the usual pooled OLS standard errors Do you estimate that executions have a deterrent effect on murders What do you think is happening v Even if the pooled OLS estimators are consistent do you trust the standard errors obtained from part iv Explain vi Now estimate the equation in part iii using first differencing to remove ai What are the new estimates of b1 and b2 Are they very different from the estimates from part iv vii Using the estimates from part vi can you say there is evidence of a statistically significant deterrent effect of capital punishment on the murder rate If possible in addition to the usual OLS standard errors use those that are robust to any kind of serial correlation or heteroskedasticity in the FD errors Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 432 APPEndix 13A 13A1 Assumptions for Pooled OLS Using First Differences In this appendix we provide careful statements of the assumptions for the firstdifferencing estima tor Verification of these claims is somewhat involved but it can be found in Wooldridge 2010 Chapter 10 Assumption FD1 For each i the model is yit 5 b1xit1 1 p 1 bkxitk 1 ai 1 uit t 5 1 p T where the bj are the parameters to estimate and ai is the unobserved effect Assumption FD2 We have a random sample from the cross section Assumption FD3 Each explanatory variable changes over time for at least some i and no perfect linear relationships exist among the explanatory variables For the next assumption it is useful to let Xi denote the explanatory variables for all time periods for crosssectional observation i thus Xi contains xitj t 5 1 p T j 5 1 p k Assumption FD4 For each t the expected value of the idiosyncratic error given the explanatory variables in all time periods and the unobserved effect is zero E1uit0Xi ai2 5 0 When Assumption FD4 holds we sometimes say that the xitj are strictly exogenous conditional on the unobserved effect The idea is that once we control for ai there is no correlation between the xisj and the remaining idiosyncratic error uit for all s and t As stated Assumption FD4 is stronger than necessary We use this form of the assumption be cause it emphasizes that we are interested in the equation E1yit0Xi ai2 5 E1yit0xit ai2 5 b1xit1 1 p 1 bkxitk 1 ai so that the bj measure partial effects of the observed explanatory variables holding fixed or control ling for the unobserved effect ai Nevertheless an important implication of FD4 and one that is sufficient for the unbiasedness of the FD estimator is E1Duit0Xi2 5 0 t 5 2 p T In fact for con sistency we can simply assume that Dxitj is uncorrelated with Duit for all t 5 2 p T and j 5 1 p k See Wooldridge 2010 Chapter 10 for further discussion Under these first four assumptions the firstdifference estimators are unbiased The key as sumption is FD4 which is strict exogeneity of the explanatory variables Under these same as sumptions we can also show that the FD estimator is consistent with a fixed T and as N S and perhaps more generally The next two assumptions ensure that the standard errors and test statistics resulting from pooled OLS on the first differences are asymptotically valid Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 433 Assumption FD5 The variance of the differenced errors conditional on all explanatory variables is constant Var1Duit0Xi2 5 s2 t 5 2 p T Assumption FD6 For all t 2 s the differences in the idiosyncratic errors are uncorrelated conditional on all explanatory variables Cov1Duit Duis0Xi2 5 0 t 2 s Assumption FD5 ensures that the differenced errors Duit are homoskedastic Assumption FD6 states that the differenced errors are serially uncorrelated which means that the uit follow a random walk across time see Chapter 11 Under Assumptions FD1 through FD6 the FD estimator of the bj is the best linear unbiased estimator conditional on the explanatory variables Assumption FD7 Conditional on Xi the Duit are independent and identically distributed normal random variables When we add Assumption FD7 the FD estimators are normally distributed and the t and F statistics from pooled OLS on the differences have exact t and F distributions Without FD7 we can rely on the usual asymptotic approximations 13A2 Computing Standard Errors Robust to Serial Correlation and Heteroskedasticity of Unknown Form Because the FD estimator is consistent as N S under Assumptions FD1 through FD4 it would be very handy to have a simple method of obtaining proper standard errors and test statistics that al low for any kind of serial correlation or heteroskedasticity in the FD errors eit 5 Duit Fortunately provided N is moderately large and T is not too large fully robust standard errors and test statistics are readily available As mentioned in the text a detailed treatment is above the level of this text The technical arguments combine the insights described in Chapters 8 and 12 where statistics robust to heteroskedasticity and serial correlation are discussed Actually there is one important advantage with panel data because we have a large cross section we can allow unrestricted serial correlation in the errors 5eit6 provided T is not too large We can contrast this situation with the NeweyWest ap proach in Section 125 where the estimated covariances must be downweighted as the observations get farther apart in time The general approach to obtaining fully robust standard errors and test statistics in the context of panel data is known as clustering and ideas have been borrowed from the cluster sampling litera ture The idea is that each crosssectional unit is defined as a cluster of observations over time and arbitrary correlationserial correlationand changing variances are allowed within each cluster Because of the relationship to cluster sampling many econometric software packages have options for clustering standard errors and test statistics Most commands look something like regress cy cx1 cx2 p cxk cluster1id2 where id is a variable containing unique identifiers for the crosssectional units and the c before each variable denotes change The option clusterid at the end of the regress command tells the software to report all standard errors and test statisticsincluding t statistics and Ftype statistics so that they are valid in large cross sections with any kind of serial correlation or heteroskedasticity Reporting such statistics is very common in modern empirical work with panel data Often the corrected standard errors will be substantially larger than either the usual standard errors or those that only correct for heteroskedasticity The larger standard errors better reflect the sampling error in the pooled OLS coefficients Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 434 I n this chapter we focus on two methods for estimating unobserved effects panel data models that are at least as common as first differencing Although these methods are somewhat harder to describe and implement several econometrics packages support them In Section 141 we discuss the fixed effects estimator which like first differencing uses a transformation to remove the unobserved effect ai prior to estimation Any timeconstant explanatory variables are removed along with ai The random effects estimator in Section 142 is attractive when we think the unobserved effect is uncorrelated with all the explanatory variables If we have good controls in our equation we might believe that any leftover neglected heterogeneity only induces serial correlation in the composite error term but it does not cause correlation between the composite errors and the explanatory variables Estimation of random effects models by generalized least squares is fairly easy and is routinely done by many econometrics packages Section 143 introduces the relatively new correlated random effects approach which provides a synthesis of fixed effects and random effects methods and has been shown to be practically very useful In Section 144 we show how panel data methods can be applied to other data structures including matched pairs and cluster samples Advanced Panel Data Methods c h a p t e r 14 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 435 141 Fixed Effects Estimation First differencing is just one of the many ways to eliminate the fixed effect ai An alternative method which works better under certain assumptions is called the fixed effects transformation To see what this method involves consider a model with a single explanatory variable for each i yit 5 b1xit 1 ai 1 uit t 5 1 2 p T 141 Now for each i average this equation over time We get yi 5 b1xi 1 ai 1 ui 142 where yi 5 T21 a T t51 yit and so on Because ai is fixed over time it appears in both 141 and 142 If we subtract 142 from 141 for each t we wind up with yit 2 yi 5 b11xit 2 xi2 1 uit 2 ui t 5 1 2 p T or y it 5 b1x it 1 u it t 5 1 2 p T 143 where y it 5 yit 2 yi is the timedemeaned data on y and similarly for x it and u it The fixed effects transformation is also called the within transformation The important thing about equation 143 is that the unobserved effect ai has disappeared This suggests that we should estimate 143 by pooled OLS A pooled OLS estimator that is based on the timedemeaned variables is called the fixed effects estimator or the within estimator The latter name comes from the fact that OLS on 143 uses the time variation in y and x within each crosssectional observation The between estimator is obtained as the OLS estimator on the crosssectional equation 142 where we include an intercept b0 we use the time averages for both y and x and then run a cross sectional regression We will not study the between estimator in detail because it is biased when ai is correlated with xi see Problem 2 If we think ai is uncorrelated with xit it is better to use the random effects estimator which we cover in Section 142 The between estimator ignores important informa tion on how the variables change over time Adding more explanatory variables to the equation causes few changes The original unobserved effects model is yit 5 b1xit1 1 b2xit2 1 p 1 bk xitk 1 ai 1 uit t 5 1 2 p T 144 We simply use the timedemeaning on each explanatory variableincluding things like timeperiod dummiesand then do a pooled OLS regression using all timedemeaned variables The general timedemeaned equation for each i is y it 5 b1x it1 1 b2x it2 1 p 1 bk x itk 1 u it t 5 1 2 p T 145 which we estimate by pooled OLS Under a strict exogeneity assumption on the explanatory variables the fixed effects estimator is unbiased roughly the idiosyncratic error uit should be uncorrelated with each explanatory variable across all time periods See the chapter appendix for pre cise statements of the assumptions The fixed effects estimator allows for arbitrary correlation between ai and the explanatory variables in any time period just as with first differencing Because of this any explanatory variable that is constant over time for all i gets swept away by the fixed effects Suppose that in a family savings equation for the years 1990 1991 and 1992 we let kidsit denote the number of children in family i for year t If the number of kids is constant over this threeyear period for most families in the sample what problems might this cause for estimating the effect that the number of kids has on savings Exploring FurthEr 141 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 436 transformation x it 5 0 for all i and t if xit is constant across t Therefore we cannot include variables such as gender or a citys distance from a river The other assumptions needed for a straight OLS analysis to be valid are that the errors uit are homoskedastic and serially uncorrelated across t see the appendix to this chapter There is one subtle point in determining the degrees of freedom for the fixed effects estimator When we estimate the timedemeaned equation 145 by pooled OLS we have NT total observa tions and k independent variables Notice that there is no intercept in 145 it is eliminated by the fixed effects transformation Therefore we should apparently have NT 2 k degrees of freedom This calculation is incorrect For each crosssectional observation i we lose one df because of the time demeaning In other words for each i the demeaned errors u it add up to zero when summed across t so we lose one degree of freedom There is no such constraint on the original idiosyncratic errors uit Therefore the appropriate degrees of freedom is df 5 NT 2 N 2 k 5 N1T 2 12 2 k Fortunately modern regression packages that have a fixed effects estimation feature properly compute the df But if we have to do the timedemeaning and the estimation by pooled OLS ourselves we need to correct the standard errors and test statistics ExamplE 141 Effect of Job Training on Firm Scrap Rates We use the data for three years 1987 1988 and 1989 on the 54 firms that reported scrap rates in each year No firms received grants prior to 1988 in 1988 19 firms received grants in 1989 10 dif ferent firms received grants Therefore we must also allow for the possibility that the additional job training in 1988 made workers more productive in 1989 This is easily done by including a lagged value of the grant indicator We also include year dummies for 1988 and 1989 The results are given in Table 141 We have reported the results in a way that emphasizes the need to interpret the estimates in light of the unobserved effects model 144 We are explicitly controlling for the unobserved time constant effects in ai The timedemeaning allows us to estimate the bj but 145 is not the best equa tion for interpreting the estimates Interestingly the estimated lagged effect of the training grant is substantially larger than the con temporaneous effect job training has an effect at least one year later Because the dependent variable is in logarithmic form obtaining a grant in 1988 is predicted to lower the firm scrap rate in 1989 by about 344 3exp124222 2 1 23444 the coefficient on grant21 is significant at the 5 level against a twosided alternative The coefficient on grant is significant at the 10 level and the size TAblE 141 Fixed Effects Estimation of the Scrap Rate Equation Dependent Variable logscrap Independent Variables Coefficient Standard Error d88 2080 109 d89 2247 133 grant 2252 151 grant21 2422 210 Observations Degrees of freedom Rsquared 162 104 201 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 437 of the coefficient is hardly trivial Notice the df is obtained as N1T 2 12 2 k 5 5413 2 12 2 4 5 104 The coefficient on d89 indicates that the scrap rate was substantially lower in 1989 than in the base year 1987 even in the absence of job training grants Thus it is important to allow for these aggregate effects If we omitted the year dummies the secular increase in worker productivity would be attributed to the job training grants Table 141 shows that even after controlling for aggregate trends in produc tivity the job training grants had a large estimated effect Finally it is crucial to allow for the lagged effect in the model If we omit grant21 then we are assuming that the effect of job training does not last into the next year The estimate on grant when we drop grant21 is 2082 1t 5 2652 this is much smaller and statistically insignificant When estimating an unobserved effects model by fixed effects it is not clear how we should compute a goodnessoffit measure The Rsquared given in Table 141 is based on the within trans formation it is the Rsquared obtained from estimating 145 Thus it is interpreted as the amount of time variation in the yit that is explained by the time variation in the explanatory variables Other ways of computing Rsquared are possible one of which we discuss later Although timeconstant variables cannot be included by themselves in a fixed effects model they can be interacted with variables that change over time and in particular with year dummy vari ables For example in a wage equation where education is constant over time for each individual in our sample we can interact education with each year dummy to see how the return to education has changed over time But we cannot use fixed effects to estimate the return to education in the base period which means we cannot estimate the return to education in any period we can only see how the return to education in each year differs from that in the base period Section 143 describes an approach that allows coefficients on timeconstant variables to be estimated while preserving the fixed effects nature of the analysis When we include a full set of year dummiesthat is year dummies for all years but the first we cannot estimate the effect of any variable whose change across time is constant An example is years of experience in a panel data set where each person works in every year so that experience always increases by one in each year for every person in the sample The presence of ai accounts for differences across people in their years of experience in the initial time period But then the effect of a oneyear increase in experience cannot be distinguished from the aggregate time effects because experience increases by the same amount for everyone This would also be true if in place of sepa rate year dummies we used a linear time trend for each person experience cannot be distinguished from a linear trend ExamplE 142 Has the Return to Education Changed over Time The data in WAGEPAN are from Vella and Verbeek 1998 Each of the 545 men in the sample worked in every year from 1980 through 1987 Some variables in the data set change over time experience marital status and union status are the three important ones Other variables do not change race and education are the key examples If we use fixed effects or first differencing we cannot include race education or experience in the equation However we can include interactions of educ with year dummies for 1981 through 1987 to test whether the return to education was constant over this time period We use logwage as the dependent variable dummy variables for marital and union status a full set of year dummies and the interaction terms d81 educ d82 educ p d87 educ The estimates on these interaction terms are all positive and they generally get larger for more recent years The largest coefficient of 030 is on d87 educ with t 5 248 In other words Under the Michigan program if a firm received a grant in one year it was not eligible for a grant the following year What does this imply about the correlation between grant and grant21 Exploring FurthEr 142 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 438 the return to education is estimated to be about 3 percentage points larger in 1987 than in the base year 1980 We do not have an estimate of the return to education in the base year for the reasons given earlier The other significant interaction term is d86 educ 1coefficient 5 027 t 5 2232 The estimates on the earlier years are smaller and insignificant at the 5 level against a twosided alter native If we do a joint F test for significance of all seven interaction terms we get pvalue 5 28 this gives an example where a set of variables is jointly insignificant even though some variables are individually significant The df for the F test are 7 and 3799 the second of these comes from N1T 2 12 2 k 5 54518 2 12 2 16 5 37994 Generally the results are consistent with an increase in the return to education over this period 141a The Dummy Variable Regression A traditional view of the fixed effects approach is to assume that the unobserved effect ai is a param eter to be estimated for each i Thus in equation 144 ai is the intercept for person i or firm i city i and so on that is to be estimated along with the bj Clearly we cannot do this with a single cross section there would be N 1 k parameters to estimate with only N observations We need at least two time periods The way we estimate an intercept for each i is to put in a dummy variable for each crosssectional observation along with the explanatory variables and probably dummy variables for each time period This method is usually called the dummy variable regression Even when N is not very large say N 5 54 as in Example 141 this results in many explanatory variablesin most cases too many to explicitly carry out the regression Thus the dummy variable method is not very practical for panel data sets with many crosssectional observations Nevertheless the dummy variable regression has some interesting features Most importantly it gives us exactly the same estimates of the bj that we would obtain from the regression on time demeaned data and the standard errors and other major statistics are identical Therefore the fixed effects estimator can be obtained by the dummy variable regression One benefit of the dummy vari able regression is that it properly computes the degrees of freedom directly This is a minor advantage now that many econometrics packages have programmed fixed effects options The Rsquared from the dummy variable regression is usually rather high This occurs because we are including a dummy variable for each crosssectional unit which explains much of the varia tion in the data For example if we estimate the unobserved effects model in Example 138 by fixed effects using the dummy variable regression which is possible with N 5 222 then R2 5 933 We should not get too excited about this large Rsquared it is not surprising that we can explain much of the variation in unemployment claims using both year and city dummies Just as in Example 138 the estimate on the EZ dummy variable is more important than R2 The Rsquared from the dummy variable regression can be used to compute F tests in the usual way assuming of course that the classical linear model assumptions hold see the chapter appendix In particular we can test the joint significance of all of the crosssectional dummies N 2 1 since one unit is chosen as the base group The unrestricted Rsquared is obtained from the regression with all of the crosssectional dummies the restricted Rsquared omits these In the vast majority of applica tions the dummy variables will be jointly significant Occasionally the estimated intercepts say a i are of interest This is the case if we want to study the distribution of the a i across i or if we want to pick a particular firm or city to see whether its a i is above or below the average value in the sample These estimates are directly available from the dummy variable regression but they are rarely reported by packages that have fixed effects routines for the practical reason that there are so many a i After fixed effects estimation with N of any size the a i are pretty easy to compute a i 5 yi 2 b1 xi1 2 p 2 bk xik i 5 1 p N 146 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 439 where the overbar refers to the time averages and the b j are the fixed effects estimates For example if we have estimated a model of crime while controlling for various timevarying factors we can obtain a i for a city to see whether the unobserved fixed effects that contribute to crime are above or below average Some econometrics packages that support fixed effects estimation report an intercept which can cause confusion in light of our earlier claim that the timedemeaning eliminates all timeconstant variables including an overall intercept See equation 145 Reporting an overall intercept in fixed effects FE estimation arises from viewing the ai as parameters to estimate Typically the intercept reported is the average across i of the a i In other words the overall intercept is actually the average of the individualspecific intercepts which is an unbiased consistent estimator of a 5 E1ai2 In most studies the b j are of interest and so the timedemeaned equations are used to obtain these estimates Further it is usually best to view the ai as omitted variables that we control for through the within transformation The sense in which the ai can be estimated is generally weak In fact even though a i is unbiased under Assumptions FE1 through FE4 in the chapter appendix it is not consistent with a fixed T as N S The reason is that as we add each additional cross sectional observation we add a new ai No information accumulates on each ai when T is fixed With larger T we can get better estimates of the ai but most panel data sets are of the large N and small T variety 141b Fixed Effects or First Differencing So far setting aside pooled OLS we have seen two competing methods for estimating unobserved effects models One involves differencing the data and the other involves timedemeaning How do we know which one to use We can eliminate one case immediately when T 5 2 the FE and FD estimates as well as all test statistics are identical and so it does not matter which we use Of course the equivalence between the FE and FD estimates requires that we estimate the same model in each case In particular as we discussed in Chapter 13 it is natural to include an intercept in the FD equation this intercept is actu ally the intercept for the second time period in the original model written for the two time periods Therefore FE estimation must include a dummy variable for the second time period in order to be identical to the FD estimates that include an intercept With T 5 2 FD has the advantage of being straightforward to implement in any econometrics or statistical package that supports basic data manipulation and it is easy to compute heteroskedasticity robust statistics after FD estimation because when T 5 2 FD estimation is just a crosssectional regression When T 3 the FE and FD estimators are not the same Since both are unbiased under Assumptions FE1 through FE4 we cannot use unbiasedness as a criterion Further both are consist ent with T fixed as N S under FE1 through FE4 For large N and small T the choice between FE and FD hinges on the relative efficiency of the estimators and this is determined by the serial cor relation in the idiosyncratic errors uit We will assume homoskedasticity of the uit since efficiency comparisons require homoskedastic errors When the uit are serially uncorrelated fixed effects is more efficient than first differencing and the standard errors reported from fixed effects are valid Since the unobserved effects model is typically stated sometimes only implicitly with serially uncorrelated idiosyncratic errors the FE estimator is used more than the FD estimator But we should remember that this assumption can be false In many applications we can expect the unobserved factors that change over time to be serially correlated If uit follows a random walkwhich means that there is very substantial positive serial correlationthen the difference Duit is serially uncorrelated and first differencing is better In many cases the uit exhibit some positive serial correlation but perhaps not as much as a random walk Then we cannot easily compare the efficiency of the FE and FD estimators Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 440 It is difficult to test whether the uit are serially uncorrelated after FE estimation we can esti mate the timedemeaned errors u it but not the uit However in Section 133 we showed how to test whether the differenced errors Duit are serially uncorrelated If this seems to be the case FD can be used If there is substantial negative serial correlation in the Duit FE is probably better It is often a good idea to try both if the results are not sensitive so much the better When T is large and especially when N is not very large for example N 5 20 and T 5 30 we must exercise caution in using the fixed effects estimator Although exact distributional results hold for any N and T under the classical fixed effects assumptions inference can be very sensitive to violations of the assumptions when N is small and T is large In particular if we are using unit root processessee Chapter 11the spurious regression problem can arise First differencing has the advantage of turning an integrated time series process into a weakly dependent process Therefore if we apply first differencing we can appeal to the central limit theorem even in cases where T is larger than N Normality in the idiosyncratic errors is not needed and heteroskedasticity and serial correla tion can be dealt with as we touched on in Chapter 13 Inference with the fixed effects estimator is potentially more sensitive to nonnormality heteroskedasticity and serial correlation in the idiosyn cratic errors Like the first difference estimator the fixed effects estimator can be very sensitive to classical measurement error in one or more explanatory variables However if each xitj is uncorrelated with uit but the strict exogeneity assumption is otherwise violatedfor example a lagged dependent variable is included among the regressors or there is feedback between uit and future outcomes of the explana tory variablethen the FE estimator likely has substantially less bias than the FD estimator unless T 5 2 The important theoretical fact is that the bias in the FD estimator does not depend on T while that for the FE estimator tends to zero at the rate 1T See Wooldridge 2010 Section 107 for details Generally it is difficult to choose between FE and FD when they give substantively different results It makes sense to report both sets of results and to try to determine why they differ 141c Fixed Effects with Unbalanced Panels Some panel data sets especially on individuals or firms have missing years for at least some cross sectional units in the sample In this case we call the data set an unbalanced panel The mechanics of fixed effects estimation with an unbalanced panel are not much more difficult than with a balanced panel If Ti is the number of time periods for crosssectional unit i we simply use these Ti observa tions in doing the timedemeaning The total number of observations is then T1 1 T2 1 p 1 TN As in the balanced case one degree of freedom is lost for every crosssectional observation due to the timedemeaning Any regression package that does fixed effects makes the appropriate adjustment for this loss The dummy variable regression also goes through in exactly the same way as with a bal anced panel and the df is appropriately obtained It is easy to see that units for which we have only a single time period play no role in a fixed effects analysis The timedemeaning for such observations yields all zeros which are not used in the estimation If Ti is at most two for all i we can use first differencing if Ti 5 1 for any i we do not have two periods to difference The more difficult issue with an unbalanced panel is determining why the panel is unbalanced With cities and states for example data on key variables are sometimes missing for certain years Provided the reason we have missing data for some i is not correlated with the idiosyncratic errors uit the unbalanced panel causes no problems When we have data on individuals families or firms things are trickier Imagine for example that we obtain a random sample of manufacturing firms in 1990 and we are interested in testing how unionization affects firm profitability Ideally we can use a panel data analysis to control for unobserved worker and management characteristics that affect profitability and might also be correlated with the fraction of the firms workforce that is unionized If we collect data again in subsequent years some firms may be lost because they have gone out of business or have merged with other companies If so we probably have a nonrandom sample in Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 441 subsequent time periods The question is If we apply fixed effects to the unbalanced panel when will the estimators be unbiased or at least consistent If the reason a firm leaves the sample called attrition is correlated with the idiosyncratic error those unobserved factors that change over time and affect profitsthen the resulting sample section problem see Chapter 9 can cause biased estimators This is a serious consideration in this example Nevertheless one useful thing about a fixed effects analysis is that it does allow attrition to be cor related with ai the unobserved effect The idea is that with the initial sampling some units are more likely to drop out of the survey and this is captured by ai ExamplE 143 Effect of Job Training on Firm Scrap Rates We add two variables to the analysis in Table 141 log1salesit2 and log1employit2 where sales is annual firm sales and employ is number of employees Three of the 54 firms drop out of the analysis entirely because they do not have sales or employment data Five additional observations are lost due to missing data on one or both of these variables for some years leaving us with n 5 148 Using fixed effects on the unbalanced panel does not change the basic story although the estimated grant effect gets larger b grant 5 2297 tgrant 5 2189 b grant21 5 2536 tgrant21 5 22389 Solving general attrition problems in panel data is complicated and beyond the scope of this text See for example Wooldridge 2010 Chapter 19 142 Random Effects Models We begin with the same unobserved effects model as before yit 5 b0 1 b1xit1 1 p 1 bk xitk 1 ai 1 uit 147 where we explicitly include an intercept so that we can make the assumption that the unobserved effect ai has zero mean without loss of generality We would usually allow for time dummies among the explanatory variables as well In using fixed effects or first differencing the goal is to eliminate ai because it is thought to be correlated with one or more of the xitj But suppose we think ai is uncorrelated with each explanatory variable in all time periods Then using a transformation to eliminate ai results in inefficient estimators Equation 147 becomes a random effects model when we assume that the unobserved effect ai is uncorrelated with each explanatory variable Cov1xitj ai2 5 0 t 5 1 2 p T j 5 1 2 p k 148 In fact the ideal random effects assumptions include all of the fixed effects assumptions plus the additional requirement that ai is independent of all explanatory variables in all time periods See the chapter appendix for the actual assumptions used If we think the unobserved effect ai is correlated with any explanatory variables we should use first differencing or fixed effects Under 148 and along with the random effects assumptions how should we estimate the bj It is important to see that if we believe that ai is uncorrelated with the explanatory variables the bj can be consistently estimated by using a single cross section there is no need for panel data at all But using a single cross section disregards much useful information in the other time periods We can also use the data in a pooled OLS procedure just run OLS of yit on the explanatory variables and prob ably the time dummies This too produces consistent estimators of the bj under the random effects assumption But it ignores a key feature of the model If we define the composite error term as vit 5 ai 1 uit then 147 can be written as yit 5 b0 1 b1xit1 1 p 1 bkxitk 1 vit 149 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 442 Because ai is in the composite error in each time period the vit are serially correlated across time In fact under the random effects assumptions Corr1vit vis2 5 s2 a 1s2 a 1 s2 u2 t 2 s where s2 a 5 Var1ai2 and s2 u 5 Var1uit2 This necessarily positive serial correlation in the error term can be substantial and because the usual pooled OLS standard errors ignore this correlation they will be incorrect as will the usual test statistics In Chapter 12 we showed how generalized least squares can be used to estimate models with autoregressive serial correlation We can also use GLS to solve the serial correlation problem here For the procedure to have good properties we should have large N and relatively small T We assume that we have a balanced panel although the method can be extended to unbalanced panels Deriving the GLS transformation that eliminates serial correlation in the errors requires sophisti cated matrix algebra see for example Wooldridge 2010 Chapter 10 But the transformation itself is simple Define u 5 1 2 3s2 u 1s2 u 1 Ts2 a2 412 1410 which is between zero and one Then the transformed equation turns out to be yit 2 uyi 5 b011 2 u2 1 b11xit1 2 uxi12 1 p 1 bk1xitk 2 uxik2 1 1vit 2 uvi2 1411 where the overbar again denotes the time averages This is a very interesting equation as it involves quasidemeaned data on each variable The fixed effects estimator subtracts the time averages from the corresponding variable The random effects transformation subtracts a fraction of that time aver age where the fraction depends on s2 u s2 a and the number of time periods T The GLS estimator is simply the pooled OLS estimator of equation 1411 It is hardly obvious that the errors in 1411 are serially uncorrelated but they are See Problem 3 The transformation in 1411 allows for explanatory variables that are constant over time and this is one advantage of random effects RE over either fixed effects or first differencing This is possible because RE assumes that the unobserved effect is uncorrelated with all explanatory vari ables whether the explanatory variables are fixed over time or not Thus in a wage equation we can include a variable such as education even if it does not change over time But we are assuming that education is uncorrelated with ai which contains ability and family background In many applica tions the whole reason for using panel data is to allow the unobserved effect to be correlated with the explanatory variables The parameter u is never known in practice but it can always be estimated There are different ways to do this which may be based on pooled OLS or fixed effects for example Generally u takes the form u 5 1 2 5131 1 T1s 2 as 2 u2 4612 where s 2 a is a consistent estimator of s2 a and s 2 u is a con sistent estimator of s2 u These estimators can be based on the pooled OLS or fixed effects residuals One possibility is that s 2 a 5 3NT1T 2 122 2 1k 1 12 4 21 a N i51 a T21 t51 a T s5t11vitvis where the vit are the residuals from estimating 149 by pooled OLS Given this we can estimate s2 u by using s 2 u 5 s 2 v 2 s 2 a where s 2 v is the square of the usual standard error of the regression from pooled OLS See Wooldridge 2010 Chapter 10 for additional discussion of these estimators Many econometrics packages support estimation of random effects models and automatically compute some version of u The feasible GLS estimator that uses u in place of u is called the random effects estimator Under the random effects assumptions in the chapter appendix the estimator is consistent not unbiased and asymptotically normally distributed as N gets large with fixed T The properties of the random effects RE estimator with small N and large T are largely unknown although it has certainly been used in such situations Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 443 Equation 1411 allows us to relate the RE estimator to both pooled OLS and fixed effects Pooled OLS is obtained when u 5 0 and FE is obtained when u 5 1 In practice the estimate u is never zero or one But if u is close to zero the RE estimates will be close to the pooled OLS esti mates This is the case when the unobserved effect ai is relatively unimportant because it has small variance relative to s2 u It is more common for s2 a to be large relative to s2 u in which case u will be closer to unity As T gets large u tends to one and this makes the RE and FE estimates very similar We can gain more insight on the relative merits of random effects versus fixed effects by writ ing the quasidemeaned error in equation 1411 as vit 2 uvi 5 11 2 u2ai 1 uit 2 uui This sim ple expression makes it clear that in the transformed equation the unobserved effect is weighted by 11 2 u2 Although correlation between ai and one or more xitj causes inconsistency in the random effects estimation we see that the correlation is attenuated by the factor 11 2 u2 As u S 1 the bias term goes to zero as it must because the RE estimator tends to the FE estimator If u is close to zero we are leaving a larger fraction of the unobserved effect in the error term and as a consequence the asymptotic bias of the RE estimator will be larger In applications of FE and RE it is usually informative also to compute the pooled OLS estimates Comparing the three sets of estimates can help us determine the nature of the biases caused by leav ing the unobserved effect ai entirely in the error term as does pooled OLS or partially in the error term as does the RE transformation But we must remember that even if ai is uncorrelated with all explanatory variables in all time periods the pooled OLS standard errors and test statistics are gener ally invalid they ignore the often substantial serial correlation in the composite errors vit 5 ai 1 uit As we mentioned in Chapter 13 see Example 139 it is possible to compute standard errors and test statistics that are robust to arbitrary serial correlation and heteroskedasticity in vit and popular sta tistics packages often allow this option See for example Wooldridge 2010 Chapter 10 ExamplE 144 a Wage Equation Using panel Data We again use the data in WAGEPAN to estimate a wage equation for men We use three methods pooled OLS random effects and fixed effects In the first two methods we can include educ and race dummies black and hispan but these drop out of the fixed effects analysis The timevarying vari ables are exper exper2 union and married As we discussed in Section 141 exper is dropped in the FE analysis although exper2 remains Each regression also contains a full set of year dummies The estimation results are in Table 142 TAblE 142 Three Different Estimators of a Wage Equation Dependent Variable logwage Independent Variables Pooled OLS Random Effects Fixed Effects educ 091 005 092 011 black 2139 024 2139 048 hispan 016 021 022 043 exper 067 014 106 015 exper 2 20024 0008 20047 0007 20052 0007 married 108 016 064 017 047 018 union 182 017 106 018 080 019 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 444 The coefficients on educ black and hispan are similar for the pooled OLS and random effects esti mations The pooled OLS standard errors are the usual OLS standard errors and these underestimate the true standard errors because they ignore the posi tive serial correlation we report them here for com parison only The experience profile is somewhat different and both the marriage and union premiums fall notably in the random effects estimation When we eliminate the unobserved effect entirely by using fixed effects the marriage premium falls to about 47 although it is still statistically signifi cant The drop in the marriage premium is consistent with the idea that men who are more ableas captured by a higher unobserved effect aiare more likely to be married Therefore in the pooled OLS estimation a large part of the marriage premium reflects the fact that men who are married would earn more even if they were not married The remaining 47 has at least two possible expla nations 1 marriage really makes men more productive or 2 employers pay married men a pre mium because marriage is a signal of stability We cannot distinguish between these two hypotheses The estimate of u for the random effects estimation is u 5 643 which helps explain why on the timevarying variables the RE estimates lie closer to the FE estimates than to the pooled OLS estimates 142a Random Effects or Fixed Effects Because fixed effects allows arbitrary correlation between ai and the xitj while random effects does not FE is widely thought to be a more convincing tool for estimating ceteris paribus effects Still random effects is applied in certain situations Most obviously if the key explanatory variable is con stant over time we cannot use FE to estimate its effect on y For example in Table 142 we must rely on the RE or pooled OLS estimate of the return to education Of course we can only use ran dom effects because we are willing to assume the unobserved effect is uncorrelated with all explana tory variables Typically if one uses random effects as many timeconstant controls as possible are included among the explanatory variables With an FE analysis it is not necessary to include such controls RE is preferred to pooled OLS because RE is generally more efficient If our interest is in a timevarying explanatory variable is there ever a case to use RE rather than FE Yes but situations in which Cov1xitj ai2 5 0 should be considered the exception rather than the rule If the key policy variable is set experimentallysay each year children are randomly assigned to classes of different sizesthen random effects would be appropriate for estimating the effect of class size on performance Unfortunately in most cases the regressors are themselves outcomes of choice processes and likely to be correlated with individual preferences and abilities as captured by ai It is still fairly common to see researchers apply both random effects and fixed effects and then formally test for statistically significant differences in the coefficients on the timevarying explana tory variables So in Table 142 these would be the coefficients on exper2 married and union Hausman 1978 first proposed such a test and some econometrics packages routinely compute the Hausman test under the full set of random effects assumptions listed in the appendix to this chapter The idea is that one uses the random effects estimates unless the Hausman test rejects 148 In practice a failure to reject means either that the RE and FE estimates are sufficiently close so that it does not matter which is used or the sampling variation is so large in the FE estimates that one cannot conclude practically significant differences are statistically significant In the latter case one is left to wonder whether there is enough information in the data to provide precise estimates of the coefficients A rejection using the Hausman test is taken to mean that the key RE assumption 148 is false and then the FE estimates are used Naturally as in all applications of statistical infer ence one should distinguish between a practically significant difference and a statistically significant The union premium estimated by fixed effects is about 10 percentage points lower than the OLS estimate What does this strongly suggest about the correlation between union and the unobserved effect Exploring FurthEr 143 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 445 difference Wooldridge 2010 Chapter 10 contains further discussion In the next section we discuss an alternative computationally simpler approach to choosing between the RE and FE approaches A final word of caution In reading empirical work you may find that some authors decide on FE versus RE estimation based on whether the ai are properly viewed as parameters to estimate or as random variables Such considerations are usually wrongheaded In this chapter we have treated the ai as random variables in the unobserved effects model 147 regardless of how we decide to estimate the bj As we have emphasized the key issue that determines whether we use FE or RE is whether we can plausibly assume ai is uncorrelated with all xitj Nevertheless in some applications of panel data methods we cannot treat our sample as a random sample from a large population especially when the unit of observation is a large geographical unit say states or provinces Then it often makes sense to think of each ai as a separate intercept to estimate for each crosssectional unit In this case we use fixed effects remember using FE is mechanically the same as allowing a differ ent intercept for each crosssectional unit Fortunately whether or not we engage in the philosophical debate about the nature of ai FE is almost always much more convincing than RE for policy analysis using aggregated data 143 The Correlated Random Effects Approach In applications where it makes sense to view the ai unobserved effects as being random variables along with the observed variables we draw there is an alternative to fixed effects that still allows ai to be correlated with the observed explanatory variables To describe the approach consider again the simple model in equation 141 with a single timevarying explanatory variable xit Rather than assume ai is uncorrelated with 5xit t 5 1 2 p T6which is the random effects approachor take away time averages to remove ai the fixed effects approachwe might instead model correlation between ai and 5xit t 5 1 2 p T6 Because ai is by definition constant over time allowing it to be correlated with the average level of the xit has a certain appeal More specifically let xi 5 T21 a T t51xit be the time average as before Suppose we assume the simple linear relationship ai 5 a 1 gxi 1 ri 1412 where we assume ri is uncorrelated with each xit Because xi is a linear function of the xit Cov1xi ri2 5 0 1413 Equations 1412 and 1413 imply that ai and xi are correlated whenever g 2 0 The correlated random effects CRE approach uses 1412 in conjunction with 141 substi tuting the former in the latter gives yit 5 bxit 1 a 1 gxi 1 ri 1 uit 5 a 1 bxit 1 gxi 1 ri 1 uit 1414 Equation 1414 is interesting because it still has a composite error term ri 1 uit consisting of a timeconstant unobservable ri and the idiosyncratic shocks uit Importantly assumption 148 holds when we replace ai with ri Also because uit is assumed to be uncorrelated with xis all s and t uit is also uncorrelated with xi All of these assumptions add up to random effects estimation of the equation yit 5 a 1 bxit 1 gxi 1 ri 1 uit 1415 which is like the usual equation underlying RE estimation with the important addition of the time average variable xi It is the addition of xi that controls for the correlation between ai and the sequence 5xit t 5 1 2 p T6 What is left over ri is uncorrelated with the xit In most econometrics packages it is easy to compute the unitspecific time averages xi Assuming we have done that for each crosssectional unit i what can we expect to happen if we apply RE to equation 1415 Notice that estimation of 1415 gives a CRE b CRE and g CREthe CRE estimators Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 446 As far as b CRE goes the answer is a bit anticlimactic It can be shownsee for example Wooldridge 2010 Chapter 10that b CRE 5 b FE 1416 where b FE denotes the FE estimator from equation 143 In other words adding the time average xi and using random effects is the same as subtracting the time averages and using pooled OLS Even though 1415 is not needed to obtain b FE the equivalence of the CRE and FE estimates of b provides a nice interpretation of FE it controls for the average level xi when measuring the partial effect of xit on yit As an example suppose that xit is a tax rate on firm profits in county i in year t and yit is some measure of countylevel economic output By including xi the average tax rate in the county over the T years we are allowing for systematic differences between historically hightax and lowtax countiesdifferences that may also affect economic output We can also use equation 1415 to see why the FE estimators are often much less precise than the RE estimators If we set g 5 0 in equation 1415 then we obtain the usual RE estimator of b b RE This means that correlation between xit and xi has no bearing on the variance of the RE estimator By contrast we know from multiple regression analysis in Chapter 3 that correlation between xit and xi that is multicollinearitycan result in a higher variance for b FE Sometimes the variance is much higher particularly when there is little variation in xit across t in which case xit and xi tend to be highly positively correlated In the limiting case where there is no variation across time for any i the correlation is perfectand FE fails to provide an estimate of b Apart from providing a synthesis of the FE and RE approaches are there other reasons to con sider the CRE approach even if it simply delivers the usual FE estimate of b Yes at least two First the CRE approach provides a simple formal way of choosing between the FE and RE approaches As we just discussed the RE approach sets g 5 0 while FE estimates g Because we have g CRE and its standard error obtained from RE estimation of 1415 we can construct a t test of H0 g 5 0 against H1 g 2 0 The appendix discusses how to make this test robust to heteroskedasticity and serial cor relation in 5uit6 If we reject H0 at a sufficiently small significance level we reject RE in favor of FE As usual especially with a large cross section it is important to distinguish between a statistical rejection and economically important differences A second reason to study the CRE approach is that it provides a way to include timeconstant explanatory variables in what is effectively a fixed effects analysis For example let zi be a variable that does not change over timeit could be gender say or an IQ test score determined in childhood We can easily augment 1415 to include zi yit 5 a 1 bxit 1 gxi 1 dzi 1 ri 1 uit 1417 where we do not change the notation for the error term which no longer includes zi If we estimate this expanded equation by RE it can still be shown that the estimate of b is the FE estimate from 141 In fact once we include xi we can include any other timeconstant variables in the equation estimate it by RE and obtain b FE as the coefficient on xit In addition we obtain an estimate of d although the estimate should be interpreted with caution because it does not necessarily estimate a causal effect of zi on yit The same CRE strategy can be applied to models with many timevarying explanatory variables and many timeconstant variables When the equation augmented with the time averages is esti mated by RE the coefficients on the timevarying variables are identical to the FE estimates As a practical note when the panel is balanced there is no need to include the time averages of variables that change over timethe leading case being time period dummies With T time periods the time average of a time period is just 1T a constant for all i and t clearly it makes no sense to add a bunch of constants to an equation that already has an intercept If the panel data set is unbalanced then the average of variables such as time dummies can change across iit will depend on how many periods Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 447 we have for crosssectional unit i In such cases the time averages of any variable that changes over time must be included Computer Exercise 14 in this chapter illustrates how the CRE approach can be applied to the bal anced panel data set in AIRFARE and how one can test RE versus FE in the CRE framework 143a Unbalanced Panels The correlated random effects approach also can be applied to unbalanced panels but some care is required In order to obtain an estimator that reproduces the fixed effects estimates on the time varying explanatory variables one must be careful in constructing the time averages In particular for y or any xj a time period contributes to the time average yi or xij only if data on all of 1yit xit1 p xitk2 are observed One way to depict the situation is to define a dummy variable sit which equals one when a complete set of data on 1yit xit1 p xitk2 is observed If any element is missing including of course if the entire time period is missing then sit 5 0 The notion of a selection indicator is dis cussed in more detail in Chapter 17 With this definition the appropriate time average of 5yit6 can be written as yi 5 T21 i a T t51sit yit where Ti is the total number of complete time periods for crosssectional observation i In other words we only average over the time periods that have a complete set of data Another subtle point is that when time period dummies are included in the model or any other variables that change only by t and not i we must now include their time averages unlike in the bal anced case where the time averages are just constants For example if 5wt t 5 1 p T6 is an aggre gate time variable such as a time dummy or a linear time trend then wi 5 T21 i a T t51sitwt Because of the unbalanced nature of the panel wi almost always varies somewhat across i unless the exact same time periods are missing for all crosssectional units As with variables that actually change across i and t the time averages of aggregate time effects are easy to obtain in many software packages The mechanics of the random effects estimator also change somewhat when we have an unbal anced panel and this is true whether we use the traditional random effects estimator or the CRE version Namely the parameter u in equation 1410 used in equation 1411 to obtain the quasi demeaned data depends on i through the number of time periods observed for unit i Specifically simply replace T in equation 1410 with Ti Econometrics packages that support random effects estimation recognize this difference when using balanced panels and so nothing special needs to be done from a users perspective The bottom line is that once the time averages have been properly obtained using an equation such as 1417 is the same as in the balanced case We can still use a test of statistical significance on the set of time averages to choose between fixed effects and pure random effects and the CRE approach still allows us to include timeconstant variables As with fixed effects estimation a key issue is understanding why the panel data set is unbal anced In the pure random effects case the selection indicator sit cannot be correlated with the com posite error in equation 147 ai 1 uit in any time period Otherwise as discussed in Wooldridge 2010 Chapter 19 the RE estimator is inconsistent As discussed in Section 141 the FE estimator allows for arbitrary correlation between the selection indicator sit and the fixed effect ai Therefore FE estimator is more robust in the context of unbalanced panels And as we already know FE allows arbitrary correlation between timevarying explanatory variables and ai Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 448 144 Applying Panel Data Methods to Other Data Structures The various panel data methods can be applied to certain data structures that do not involve time For example it is common in demography to use siblings sometimes twins to account for unobserved family and background characteristics Usually we want to allow the unobserved family effect which is common to all siblings within a family to be correlated with observed explanatory variables If those explanatory variables vary across siblings within a family differencing across sibling pairs or more generally using the within transformation within a familyis preferred as an estimation method By removing the unobserved effect we eliminate potential bias caused by confounding fam ily background characteristics Implementing fixed effects on such data structures is rather straight forward in regression packages that support FE estimation As an example Geronimus and Korenman 1992 used pairs of sisters to study the effects of teen childbearing on future economic outcomes When the outcome is income relative to needs something that depends on the number of childrenthe model is log1incneedsfs2 5 b0 1 d0sister2s 1 b1teenbrthfs 1 b2agefs 1 other factors 1 af 1 ufs 1418 where f indexes family and s indexes a sister within the family The intercept for the first sister is b0 and the intercept for the second sister is b0 1 d0 The variable of interest is teenbrthfs which is a binary variable equal to one if sister s in family f had a child while a teenager The variable agefs is the current age of sister s in family f Geronimus and Korenman also used some other controls The unobserved variable af which changes only across family is an unobserved family effect or a family fixed effect The main concern in the analysis is that teenbrth is correlated with the family effect If so an OLS analysis that pools across families and sisters gives a biased estimator of the effect of teenage motherhood on economic outcomes Solving this problem is simple within each family difference 1418 across sisters to get Dlog1incneeds2 5 d0 1 b1Dteenbrth 1 b2Dage 1 p 1 Du 1419 this removes the family effect af and the resulting equation can be estimated by OLS Notice that there is no time element here the differencing is across sisters within a family Also we have allowed for differences in intercepts across sisters in 1418 which leads to a nonzero intercept in the differenced equation 1419 If in entering the data the order of the sisters within each family is essentially random the estimated intercept should be close to zero But even in such cases it does not hurt to include an intercept in 1419 and having the intercept allows for the fact that say the first sister listed might always be the neediest Using 129 sister pairs from the 1982 National Longitudinal Survey of Young Women Geronimus and Korenman first estimated b1 by pooled OLS to obtain 233 or 226 where the second estimate comes from controlling for family background variables such as parents education both estimates are very statistically significant see Table 3 in Geronimus and Korenman 1992 Therefore teenage motherhood has a rather large impact on future family income However when the differenced equa tion is estimated the coefficient on teenbrth is 208 which is small and statistically insignificant This suggests that it is largely a womans family background that affects her future income rather than teenage childbearing Geronimus and Korenman looked at several other outcomes and two other data sets in some cases the within family estimates were economically large and statistically significant They also showed how the effects disappear entirely when the sisters education levels are controlled for When using the differencing method does it make sense to include dummy variables for the mother and fathers race in 1418 Explain Exploring FurthEr 144 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 449 Ashenfelter and Krueger 1994 used the differencing methodology to estimate the return to education They obtained a sample of 149 identical twins and collected information on earnings edu cation and other variables Identical twins were used because they should have the same underly ing ability This can be differenced away by using twin differences rather than OLS on the pooled data Because identical twins are the same in age gender and race these factors all drop out of the differenced equation Therefore Ashenfelter and Krueger regressed the difference in logearnings on the difference in education and estimated the return to education to be about 92 1t 5 3832 Interestingly this is actually larger than the pooled OLS estimate of 84 which controls for gender age and race Ashenfelter and Krueger also estimated the equation by random effects and obtained 87 as the return to education See Table 5 in their paper The random effects analysis is mechani cally the same as the panel data case with two time periods The samples used by Geronimus and Korenman 1992 and Ashenfelter and Krueger 1994 are examples of matched pairs samples More generally fixed and random effects methods can be applied to a cluster sample A cluster sample has the same appearance as a crosssectional data set but there is an important difference clusters of units are sampled from a population of clusters rather than sampling individuals from the population of individuals In the previous examples each family is sampled from the population of families and then we obtain data on at least two family members Therefore each family is a cluster As another example suppose we are interested in modeling individual pension plan participation decisions One might obtain a random sample of working individualssay from the United States but it is also common to sample firms from a population of firms Once the firms are sampled one might collect information on all workers or a subset of workers within each firm In either case the resulting data set is a cluster sample because sampling was first at the firm level Unobserved firm level characteristics along with observed firm characteristics are likely to be present in participation decisions and this withinfirm correlation must be accounted for Fixed effects estimation is preferred when we think the unobserved cluster effectan example of which is ai in 1412is correlated with one or more of the explanatory variables Then we can only include explanatory variables that vary at least somewhat within clusters The cluster sizes are rarely the same so we are effectively using fixed effects methods for unbalanced panels Educational data on student outcomes can also come in the form of a cluster sample where a sample of schools is obtained from the population of schools and then information on students within each school is obtained Each school acts as a cluster and allowing a school effect to be correlated with key explanatory variablessay whether a student participates in a statesponsored tutoring programis likely to be important Because the rate at which students are tutored likely varies by school it is probably a good idea to use fixed effects estimation One often sees authors use as a shorthand I included school fixed effects in the analysis The correlated random effects approach can be applied immediately to cluster samples because for the purposes of estimation a cluster sample acts like an unbalanced panel Now the averages that are added to the equation are withincluster averagesfor example averages within schools The only difference with panel data is that the notion of serial correlation in idiosyncratic errors is not relevant Nevertheless as discussed in Wooldridge 2010 Chapter 20 there are still good reasons for using clusterrobust standard errors whether one uses fixed effects or correlated random effects In some cases the key explanatory variablesoften policy variableschange only at the level of the cluster not within the cluster In such cases the fixed effects approach is not applicable For exam ple we may be interested in the effects of measured teacher quality on student performance where each cluster is an elementary school classroom Because all students within a cluster have the same teacher eliminating a class effect also eliminates any observed measures of teacher quality If we have good controls in the equation we may be justified in applying random effects on the unbalanced cluster As with panel data the key requirement for RE to produce convincing estimates is that the explanatory variables are uncorrelated with the unobserved cluster effect Most econometrics pack ages allow random effects estimation on unbalanced clusters without much effort Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 450 Pooled OLS is also commonly applied to cluster samples when eliminating a cluster effect via fixed effects is infeasible or undesirable However as with panel data the usual OLS standard errors are incorrect unless there is no cluster effect and so robust standard errors that allow cluster cor relation and heteroskedasticity should be used Some regression packages have simple commands to correct standard errors and the usual test statistics for general within cluster correlation as well as heteroskedasticity These are the same corrections that work for pooled OLS on panel data sets which we reported in Example 139 As an example Papke 1999 estimates linear probability models for the continuation of defined benefit pension plans based on whether firms adopted defined contri bution plans Because there is likely to be a firm effect that induces correlation across different plans within the same firm Papke corrects the usual OLS standard errors for cluster sampling as well as for heteroskedasticity in the linear probability model Before ending this section some final comments are in order Given the readily available tools of fixed effects random effects and clusterrobust standard inference it is tempting to find reasons to use clustering methods where none may exist For example if a set of data is obtained from a random sample from the population then there is usually no reason to account for cluster effects in comput ing standard errors after OLS estimation The fact that the units can be put into groups ex postthat is after the random sample has been obtainedis not a reason to make inference robust to cluster correlation To illustrate this point suppose that out of the population of fourthgrade students in the United States a random sample of 50000 is obtained these data are properly studied using standard methods for crosssectional regression It may be tempting to group the students by say the 50 states plus the District of Columbiaassuming a state identifier is includedand then treat the data as a cluster sample But this would be wrong and clustering the standard errors at the state level can produce standard errors that are systematically too large Or they might be too small because the asymptotic theory underlying cluster sampling assumes that we have many clusters with each cluster size being relatively small In any case a simple thought experiment shows that clustering cannot be correct For example if we know the county of residence for each student why not cluster at the county level Or at a coarser level we can divide the United States into four census regions and treat those as the clustersand this would give a different set of standard errors that do not have any theoretical justification Taking this argument to its extreme one could argue that we have one cluster the entire United States in which case the clustered standard errors would not be defined and inference would be impossible The confusion comes about because the clusters are defined ex postthat is after the random sample is obtained In a true cluster sample the clusters are first drawn from a population of clusters and then individuals are drawn from the clusters One might use clustering methods if say a districtlevel variable is created after the random sam ple is collected and then used in the studentlevel equation This can create unobserved cluster cor relation within each district Recall that the fixed effects estimator in this case at the district level is the same as putting in districtlevel averages Thus one might want to account for cluster correlation at the district level in addition to using fixed effects As shown by Stock and Watson 2008 in the context of panel data with large cluster sizes the resulting cluster correlation is generally unimpor tant but with small cluster sizes one should use the clusterrobust standard errors Summary In this chapter we have continued our discussion of panel data methods studying the fixed effects and random effects estimators and also described the correlated random effects approach as a unifying framework Compared with first differencing the fixed effects estimator is efficient when the idiosyn cratic errors are serially uncorrelated as well as homoskedastic and we make no assumptions about correlation between the unobserved effect ai and the explanatory variables As with first differencing any Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 451 time constant explanatory variables drop out of the analysis Fixed effects methods apply immediately to unbalanced panels but we must assume that the reasons some time periods are missing are not systemati cally related to the idiosyncratic errors The random effects estimator is appropriate when the unobserved effect is thought to be uncorre lated with all the explanatory variables Then ai can be left in the error term and the resulting serial correlation over time can be handled by generalized least squares estimation Conveniently feasible GLS can be obtained by a pooled regression on quasidemeaned data The value of the estimated transforma tion parameter u indicates whether the estimates are likely to be closer to the pooled OLS or the fixed effects estimates If the full set of random effects assumptions holds the random effects estimator is asymptoticallyas N gets large with T fixedmore efficient than pooled OLS first differencing or fixed effects which are all unbiased consistent and asymptotically normal The correlated random effects approach to panel data models has become more popular in recent years primarily because it allows a simple test for choosing between FE and RE and it allows one to incor porate timeconstant variables in an equation that delivers the FE estimates of the timevarying variables Finally the panel data methods studied in Chapters 13 and 14 can be used when working with matched pairs or cluster samples Differencing or the within transformation eliminates the cluster effect If the clus ter effect is uncorrelated with the explanatory variables pooled OLS can be used but the standard errors and test statistics should be adjusted for cluster correlation Random effects estimation is also a possibility Key Terms Cluster Effect Cluster Sample Clustering Composite Error Term Correlated Random Effects Dummy Variable Regression Fixed Effects Estimator Fixed Effects Transformation Matched Pairs Samples QuasiDemeaned Data Random Effects Estimator Random Effects Model TimeDemeaned Data Unbalanced Panel Unobserved Effects Model Within Estimator Within Transformation Problems 1 Suppose that the idiosyncratic errors in 144 5uit t 5 1 2 p T6 are serially uncorrelated with constant variance s2 u Show that the correlation between adjacent differences Duit and Dui t11 is 25 Therefore under the ideal FE assumptions first differencing induces negative serial correlation of a known value 2 With a single explanatory variable the equation used to obtain the between estimator is yi 5 b0 1 b1xi 1 ai 1 ui where the overbar represents the average over time We can assume that E1ai2 5 0 because we have included an intercept in the equation Suppose that ui is uncorrelated with xi but Cov1xit ai2 5 sxa for all t and i because of random sampling in the cross section i Letting b 1 be the between estimator that is the OLS estimator using the time averages show that plim b 1 5 b1 1 sxa Var1xi2 where the probability limit is defined as N S Hint See equations 55 and 56 ii Assume further that the xit for all t 5 1 2 p T are uncorrelated with constant variance s2 x Show that plim b 1 5 b1 1 T1sxa s2 x2 iii If the explanatory variables are not very highly correlated across time what does part ii suggest about whether the inconsistency in the between estimator is smaller when there are more time periods Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 452 3 In a random effects model define the composite error vit 5 ai 1 uit where ai is uncorrelated with uit and the uit have constant variance s2 u and are serially uncorrelated Define eit 5 vit 2 uvi where u is given in 1410 i Show that E1eit2 5 0 ii Show that Var1eit2 5 s2 u t 5 1 p T iii Show that for t 2 s Cov1eit eis2 5 0 4 In order to determine the effects of collegiate athletic performance on applicants you collect data on applications for a sample of Division I colleges for 1985 1990 and 1995 i What measures of athletic success would you include in an equation What are some of the timing issues ii What other factors might you control for in the equation iii Write an equation that allows you to estimate the effects of athletic success on the percentage change in applications How would you estimate this equation Why would you choose this method 5 Suppose that for one semester you can collect the following data on a random sample of college juniors and seniors for each class taken a standardized final exam score percentage of lectures attended a dummy variable indicating whether the class is within the students major cumulative grade point average prior to the start of the semester and SAT score i Why would you classify this data set as a cluster sample Roughly how many observations would you expect for the typical student ii Write a model similar to equation 1418 that explains final exam performance in terms of attendance and the other characteristics Use s to subscript student and c to subscript class Which variables do not change within a student iii If you pool all of the data and use OLS what are you assuming about unobserved student characteristics that affect performance and attendance rate What roles do SAT score and prior GPA play in this regard iv If you think SAT score and prior GPA do not adequately capture student ability how would you estimate the effect of attendance on final exam performance 6 Using the cluster option in the econometrics package Stata 11 the fully robust standard errors for the pooled OLS estimates in Table 142that is robust to serial correlation and heteroskedas ticity in the composite errors 5vit t 5 1 p T6are obtained as se1b educ2 5 011 se1b black2 5 051 se1b hispan2 5 039 se1b exper2 5 020 se1b exper22 5 0010 se1b married2 5 026 and se1b union2 5 027 i How do these standard errors generally compare with the nonrobust ones and why ii How do the robust standard errors for pooled OLS compare with the standard errors for RE Does it seem to matter whether the explanatory variable is timeconstant or timevarying iii When the fully robust standard errors for the RE estimates are computed Stata 11 reports the following where we look at only the coefficients on the timevarying variables se1b exper2 5 016 se1b expersq2 5 0008 se1b married2 5 019 and se1b union2 5 021 These are robust to any kind of serial correlation or heteroskedasticity in the idiosyncratic errors 5uit t 5 1 p T6 as well as heteroskedasticity in ai How do the robust standard errors generally compare with the usual RE standard errors reported in Table 142 What conclusion might you draw iv Comparing the four standard errors in part iii with their pooled OLS counterparts what do you make of the fact that the robust RE standard errors are all below the robust pooled OLS standard errors 7 The data in CENSUS2000 is a random sample of individuals from the United States Here we are interested in estimating a simple regression model relating the log of weekly income lweekinc to schooling educ There are 29501 observations Associated with each individual is a state identifier state for the 50 states plus the District of Columbia A less coarse geographic identifier is puma which takes on 610 different values indicating geographic regions smaller than a state Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 453 Running the simple regression of lweekinc on educ gives a slope coefficient equal to 1083 to four decimal places The heteroskedasticityrobust standard error is about 0024 The standard error clus tered at the puma level is about 0027 and the standard error clustered at the state level is about 0033 For computing a confidence interval which of these standard errors is the most reliable Explain Computer Exercises C1 Use the data in RENTAL for this exercise The data on rental prices and other variables for college towns are for the years 1980 and 1990 The idea is to see whether a stronger presence of students affects rental rates The unobserved effects model is log1rentit2 5 b0 1 d0y90t 1 b1log1popit2 1 b2log1avgincit2 1 b3 pctstuit 1 ai 1 uit where pop is city population avginc is average income and pctstu is student population as a percent age of city population during the school year i Estimate the equation by pooled OLS and report the results in standard form What do you make of the estimate on the 1990 dummy variable What do you get for b pctstu ii Are the standard errors you report in part i valid Explain iii Now difference the equation and estimate by OLS Compare your estimate of bpctstu with that from part i Does the relative size of the student population appear to affect rental prices iv Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part iii C2 Use CRIME4 for this exercise i Reestimate the unobserved effects model for crime in Example 139 but use fixed effects rather than differencing Are there any notable sign or magnitude changes in the coefficients What about statistical significance ii Add the logs of each wage variable in the data set and estimate the model by fixed effects How does including these variables affect the coefficients on the criminal justice variables in part i iii Do the wage variables in part ii all have the expected sign Explain Are they jointly significant C3 For this exercise we use JTRAIN to determine the effect of the job training grant on hours of job train ing per employee The basic model for the three years is hrsempit 5 b0 1 d1d88t 1 d2d89t 1 b1grantit 1 b2grantit2l 1 b3log1employit2 1 ai 1 uit i Estimate the equation using fixed effects How many firms are used in the FE estimation How many total observations would be used if each firm had data on all variables in particular hrsemp for all three years ii Interpret the coefficient on grant and comment on its significance iii Is it surprising that grant21 is insignificant Explain iv Do larger firms provide their employees with more or less training on average How big are the differences For example if a firm has 10 more employees what is the change in average hours of training C4 In Example 138 we used the unemployment claims data from Papke 1994 to estimate the effect of enterprise zones on unemployment claims Papke also uses a model that allows each city to have its own time trend log1uclmsit2 5 ai 1 cit 1 b1ezit 1 uit where ai and ci are both unobserved effects This allows for more heterogeneity across cities Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 454 i Show that when the previous equation is first differenced we obtain Dlog1uclmsit2 5 ci 1 b1Dezit 1 Duit t 5 2 p T Notice that the differenced equation contains a fixed effect ci ii Estimate the differenced equation by fixed effects What is the estimate of b1 Is it very different from the estimate obtained in Example 138 Is the effect of enterprise zones still statistically significant iii Add a full set of year dummies to the estimation in part ii What happens to the estimate of b1 C5 i In the wage equation in Example 144 explain why dummy variables for occupation might be important omitted variables for estimating the union wage premium ii If every man in the sample stayed in the same occupation from 1981 through 1987 would you need to include the occupation dummies in a fixed effects estimation Explain iii Using the data in WAGEPAN include eight of the occupation dummy variables in the equation and estimate the equation using fixed effects Does the coefficient on union change by much What about its statistical significance C6 Add the interaction term unionit t to the equation estimated in Table 142 to see if wage growth depends on union status Estimate the equation by random and fixed effects and compare the results C7 Use the statelevel data on murder rates and executions in MURDER for the following exercise i Consider the unobserved effects model mrdrteit 5 ht 1 b1execit 1 b2unemit 1 ai 1 uit where ht simply denotes different year intercepts and ai is the unobserved state effect If past executions of convicted murderers have a deterrent effect what should be the sign of b1 What sign do you think b2 should have Explain ii Using just the years 1990 and 1993 estimate the equation from part i by pooled OLS Ignore the serial correlation problem in the composite errors Do you find any evidence for a deterrent effect iii Now using 1990 and 1993 estimate the equation by fixed effects You may use first differencing since you are only using two years of data Is there evidence of a deterrent effect How strong iv Compute the heteroskedasticityrobust standard error for the estimation in part ii v Find the state that has the largest number for the execution variable in 1993 The variable exec is total executions in 1991 1992 and 1993 How much bigger is this value than the next highest value vi Estimate the equation using first differencing dropping Texas from the analysis Compute the usual and heteroskedasticityrobust standard errors Now what do you find What is going on vii Use all three years of data and estimate the model by fixed effects Include Texas in the analysis Discuss the size and statistical significance of the deterrent effect compared with only using 1990 and 1993 C8 Use the data in MATHPNL for this exercise You will do a fixed effects version of the first differencing done in Computer Exercise 11 in Chapter 13 The model of interest is math4it 5 d1y94t 1 p 1 d5y98t 1 g1log1rexppit2 1 g2log1rexppit212 1 c1log1enrolit2 1 c2lunchit 1 ai 1 uit where the first available year the base year is 1993 because of the lagged spending variable i Estimate the model by pooled OLS and report the usual standard errors You should include an intercept along with the year dummies to allow ai to have a nonzero expected value What are the estimated effects of the spending variables Obtain the OLS residuals vit Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 455 ii Is the sign of the lunchit coefficient what you expected Interpret the magnitude of the coefficient Would you say that the district poverty rate has a big effect on test pass rates iii Compute a test for AR1 serial correlation using the regression vit on vit21 You should use the years 1994 through 1998 in the regression Verify that there is strong positive serial correlation and discuss why iv Now estimate the equation by fixed effects Is the lagged spending variable still significant v Why do you think in the fixed effects estimation the enrollment and lunch program variables are jointly insignificant vi Define the total or longrun effect of spending as u1 5 g1 1 g2 Use the substitution g1 5 u1 2 g2 to obtain a standard error for u 1 Hint Standard fixed effects estimation using log1rexppit2 and zit 5 log1rexppi t212 2 log1rexppit2 as explanatory variables should do it C9 The file PENSION contains information on participantdirected pension plans for US workers Some of the observations are for couples within the same family so this data set constitutes a small cluster sample with cluster sizes of two i Ignoring the clustering by family use OLS to estimate the model pctstck 5 b0 1 b1choice 1 b2 prftshr 1 b3 female 1 b4age 1 b5educ 1 b6 finc25 1 b7 finc35 1 b8 finc50 1 b9 finc75 1 b10 finc100 1 b11 finc101 1 b12wealth89 1 b13stckin89 1 b14irain89 1 u where the variables are defined in the data set The variable of most interest is choice which is a dummy variable equal to one if the worker has a choice in how to allocate pension funds among different investments What is the estimated effect of choice Is it statistically significant ii Are the income wealth stock holding and IRA holding control variables important Explain iii Determine how many different families there are in the data set iv Now obtain the standard errors for OLS that are robust to cluster correlation within a family Do they differ much from the usual OLS standard errors Are you surprised v Estimate the equation by differencing across only the spouses within a family Why do the explanatory variables asked about in part ii drop out in the firstdifferenced estimation vi Are any of the remaining explanatory variables in part v significant Are you surprised C10 Use the data in AIRFARE for this exercise We are interested in estimating the model log1 fareit2 5 ht 1 b1concenit 1 b2log1disti2 1 b33log1disti2 42 1 ai 1 uit t 5 1 p 4 where ht means that we allow for different year intercepts i Estimate the above equation by pooled OLS being sure to include year dummies If Dconcen 5 10 what is the estimated percentage increase in fare ii What is the usual OLS 95 confidence interval for b1 Why is it probably not reliable If you have access to a statistical package that computes fully robust standard errors find the fully robust 95 CI for b1 Compare it to the usual CI and comment iii Describe what is happening with the quadratic in logdist In particular for what value of dist does the relationship between logfare and dist become positive Hint First figure out the turning point value for logdist and then exponentiate Is the turning point outside the range of the data iv Now estimate the equation using random effects How does the estimate of b1 change v Now estimate the equation using fixed effects What is the FE estimate of b1 Why is it fairly similar to the RE estimate Hint What is u for RE estimation vi Name two characteristics of a route other than distance between stops that are captured by ai Might these be correlated with concenit vii Are you convinced that higher concentration on a route increases airfares What is your best estimate Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 456 C11 This question assumes that you have access to a statistical package that computes standard errors robust to arbitrary serial correlation and heteroskedasticity for panel data methods i For the pooled OLS estimates in Table 141 obtain the standard errors that allow for arbitrary serial correlation in the composite errors vit 5 ai 1 uit and heteroskedasticity How do the robust standard errors for educ married and union compare with the nonrobust ones ii Now obtain the robust standard errors for the fixed effects estimates that allow arbitrary serial correlation and heteroskedasticity in the idiosyncratic errors uit How do these compare with the nonrobust FE standard errors iii For which method pooled OLS or FE is adjusting the standard errors for serial correlation more important Why C12 Use the data in ELEM9495 to answer this question The data are on elementary schools in Michigan In this exercise we view the data as a cluster sample where each school is part of a district cluster i What are the smallest and largest number of schools in a district What is the average number of schools per district ii Using pooled OLS that is pooling across all 1848 schools estimate a model relating lavgsal to bs lenrol lstaff and lunch see also Computer Exercise 11 from Chapter 9 What are the coefficient and standard error on bs iii Obtain the standard errors that are robust to cluster correlation within district and also heteroskedasticity What happens to the t statistic for bs iv Still using pooled OLS drop the four observations with bs 5 and obtain b bs and its cluster robust standard error Now is there much evidence for a salarybenefits tradeoff v Estimate the equation by fixed effects allowing for a common district effect for schools within a district Again drop the observations with bs 5 Now what do you conclude about the salarybenefits tradeoff vi In light of your estimates from parts iv and v discuss the importance of allowing teacher compensation to vary systematically across districts via a district fixed effect C13 The data set DRIVING includes statelevel panel data for the 48 continental US states from 1980 through 2004 for a total of 25 years Various driving laws are indicated in the data set including the alcohol level at which drivers are considered legally intoxicated There are also indicators for per se lawswhere licenses can be revoked without a trialand seat belt laws Some economics and demo graphic variables are also included i How is the variable totfatrte defined What is the average of this variable in the years 1980 1992 and 2004 Run a regression of totfatrte on dummy variables for the years 1981 through 2004 and describe what you find Did driving become safer over this period Explain ii Add the variables bac08 bac10 perse sbprim sbsecon sl70plus gdl perc1424 unem and vehicmilespc to the regression from part i Interpret the coefficients on bac8 and bac10 Do per se laws have a negative effect on the fatality rate What about having a primary seat belt law Note that if a law was enacted sometime within a year the fraction of the year is recorded in place of the zeroone indicator iii Reestimate the model from part ii using fixed effects at the state level How do the coefficients on bac08 bac10 perse and sbprim compare with the pooled OLS estimates Which set of estimates do you think is more reliable iv Suppose that vehicmilespc the number of miles driven per capita increases by 1000 Using the FE estimates what is the estimated effect on totfatrte Be sure to interpret the estimate as if explaining to a layperson v If there is serial correlation or heteroskedasticity in the idiosyncratic errors of the model then the standard errors in part iii are invalid If possible use cluster robust standard errors for the fixed effects estimates What happens to the statistical significance of the policy variables in part iii Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 457 C14 Use the data set in AIRFARE to answer this question The estimates can be compared with those in Computer Exercise 10 in this Chapter i Compute the time averages of the variable concen call these concenbar How many different time averages can there be Report the smallest and the largest ii Estimate the equation lfareit 5 b0 1 d1y98t 1 d2y99t 1 d3y00t 1 b1concenit 1 b2ldisti 1 b3ldistsqi 1 g1concenbari 1 ai 1 uit by random effects Verify that b 1 is identical to the FE estimate computed in C10 iii If you drop ldist and ldistsq from the estimation in part i but still include concenbari what happens to the estimate of b 1 What happens to the estimate of g1 iv Using the equation in part ii and the usual RE standard error test H0 g1 5 0 against the two sided alternative Report the pvalue What do you conclude about RE versus FE for estimating b1 in this application v If possible for the test in part iv obtain a tstatistic and therefore pvalue that is robust to arbitrary serial correlation and heteroskedasticity Does this change the conclusion reached in part iv C15 Use the data in COUNTYMURDERS to answer this question The data set covers murders and execu tions capital punishment for 2197 counties in the United States See also Computer Exercise C16 in Chapter 13 i Consider the model murdrateit 5 ut 1 d0execsit 1 d1execsit21 1 d2execsit22 1 d3execsit23 1 b5 percblackit 1 b6 percmaleit 1 b7 perc1019it 1 b8 perc2029it 1 ai 1 uit where ut represents a different intercept for each time period ai is the county fixed effect and uit is the idiosyncratic error Why does it make sense to include lags of the key variable execs in the equation ii Apply OLS to the equation from part i and report the estimates of d0 d1 d2 and d3 along with the usual pooled OLS standard errors Do you estimate that executions have a deterrent effect on murders Provide an explanation that involves ai iii Now estimate the equation in part i using fixed effects to remove ai What are the new estimates of the dj Are they very different from the estimates from part ii iv Obtain the longrun propensity from estimates in part iii Using the usual FE standard errors is the LRP statistically different from zero v If possible obtain the standard errors for the FE estimates that are robust to arbitrary heteroskedasticity and serial correlation in the 5uit6 What happens to the statistical significance of the d j What about the estimated LRP APPEndix 14A 14A1 Assumptions for Fixed and Random Effects In this appendix we provide statements of the assumptions for fixed and random effects estimation We also provide a discussion of the properties of the estimators under different sets of assumptions Verification of these claims is somewhat involved but can be found in Wooldridge 2010 Chapter 10 Assumption FE1 For each i the model is yit 5 b1xit1 1 p 1 bkxitk 1 ai 1 uit t 5 1 p T where the bj are the parameters to estimate and ai is the unobserved effect Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 458 Assumption FE2 We have a random sample from the cross section Assumption FE3 Each explanatory variable changes over time for at least some i and no perfect linear relationships exist among the explanatory variables Assumption FE4 For each t the expected value of the idiosyncratic error given the explanatory variables in all time periods and the unobserved effect is zero E1uit0Xi ai2 5 0 Under these first four assumptionswhich are identical to the assumptions for the first differencing estimatorthe fixed effects estimator is unbiased Again the key is the strict exogene ity assumption FE4 Under these same assumptions the FE estimator is consistent with a fixed T as N S Assumption FE5 Var1uit0Xi ai2 5 Var1uit2 5 su 2 for all t 5 1 p T Assumption FE6 For all t 2 s the idiosyncratic errors are uncorrelated conditional on all explanatory variables and ai Cov1uituis0Xi ai2 5 0 Under Assumptions FE1 through FE6 the fixed effects estimator of the bj is the best linear unbiased estimator Since the FD estimator is linear and unbiased it is necessarily worse than the FE estimator The assumption that makes FE better than FD is FE6 which implies that the idiosyncratic errors are serially uncorrelated Assumption FE7 Conditional on Xi and ai the uit are independent and identically distributed as Normal10 su 22 Assumption FE7 implies FE4 FE5 and FE6 but it is stronger because it assumes a normal dis tribution for the idiosyncratic errors If we add FE7 the FE estimator is normally distributed and t and F statistics have exact t and F distributions Without FE7 we can rely on asymptotic approxima tions But without making special assumptions these approximations require large N and small T The ideal random effects assumptions include FE1 FE2 FE4 FE5 and FE6 FE7 could be added but it gains us little in practice because we have to estimate u Because we are only subtract ing a fraction of the time averages we can now allow timeconstant explanatory variables So FE3 is replaced with the following assumption Assumption RE1 There are no perfect linear relationships among the explanatory variables The cost of allowing timeconstant regressors is that we must add assumptions about how the unob served effect ai is related to the explanatory variables Assumption RE2 In addition to FE4 the expected value of ai given all explanatory variables is constant E1ai0Xi2 5 b0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 459 This is the assumption that rules out correlation between the unobserved effect and the explana tory variables and it is the key distinction between fixed effects and random effects Because we are assuming ai is uncorrelated with all elements of xit we can include timeconstant explanatory vari ables Technically the quasitimedemeaning only removes a fraction of the time average and not the whole time average We allow for a nonzero expectation for ai in stating Assumption RE4 so that the model under the random effects assumptions contains an intercept b0 as in equation 147 Remember we would typically include a set of timeperiod intercepts too with the first year acting as the base year We also need to impose homoskedasticity on ai as follows Assumption RE3 In addition to FE5 the variance of ai given all explanatory variables is constant Var1ai0Xi2 5 s 2 a Under the six random effects assumptions FE1 FE2 RE3 RE4 RE5 and FE6 the RE estimator is consistent and asymptotically normally distributed as N gets large for fixed T Actu ally consistency and asymptotic normality follow under the first four assumptions but without the last two assumptions the usual RE standard errors and test statistics would not be valid In addition under the six RE assumptions the RE estimators are asymptotically efficient This means that in large samples the RE estimators will have smaller standard errors than the corresponding pooled OLS estimators when the proper robust standard errors are used for pooled OLS For coefficients on timevarying explanatory variables the only ones estimable by FE the RE estimator is more efficient than the FE estimatoroften much more efficient But FE is not meant to be efficient under the RE assumptions FE is intended to be robust to correlation between ai and the xitj As often hap pens in econometrics there is a tradeoff between robustness and efficiency See Wooldridge 2010 Chapter 10 for verification of the claims made here 14A2 Inference Robust to Serial Correlation and Heteroskedasticity for Fixed Effects and Random Effects One of the key assumptions for performing inference using the FE RE and even the CRE approach to panel data models is the assumption of no serial correlation in the idiosyncratic errors 5uit t 5 1 p T6see Assumption FE6 Of course heteroskedasticity can also be an issue but this is also ruled out for standard inference see Assumption FE5 As discussed in the appendix to Chapter 13 the same issues can arise with first differencing estimation when we have T 3 time periods Fortunately as with FD estimation there are now simple solutions for fully robust inference inference that is robust to arbitrary violations of Assumptions FE5 and FE6 and when applying the RE or CRE approaches to Assumption RE5 As with FD estimation the general approach to obtaining fully robust standard errors and test statistics is known as clustering Now however the clustering is applied to a different equation For example for FE estimation the clustering is applied to the timedemeaned equation 145 For RE estimation the clustering gets applied to the quasi timedemeaned equation 1411 and a similar comment holds for CRE but there the time aver ages are included as separate explanatory variables The details which can be found in Wooldridge 2010 Chapter 10 are too advanced for this text But understanding the purpose of clustering is not if possible we should compute standard errors confidence intervals and test statistics that are valid in large cross sections under the weakest set of assumptions The FE estimator requires only Assumptions FE1 to FE4 for unbiasedness and consistency as N S with T fixed Thus a care ful researcher at least checks whether inference made robust to serial correlation and heteroskedas ticity in the errors affects inference Experience shows that it often does Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 460 Applying clusterrobust inference to account for serial correlation within a panel data context is easily justified when N is substantially larger than T Under certain restrictions on the time series dependence of the sort discussed in Chapter 11 clusterrobust inference for the fixed effects estima tor can be justified when T is of a similar magnitude as N provided both are not small This follows from the work by Hansen 2007 Generally clustering is not theoretically justified when N is small and T is large Computing the clusterrobust statistics after FE or RE estimation is simple in many economet rics packages often only requiring an option of the form clusterid appended to the end of FE and RE estimation commands As in the FD case id refers to a crosssection identifier Similar com ments hold when applying FE or RE to cluster samples as the cluster identifier Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 461 I n this chapter we further study the problem of endogenous explanatory variables in multiple regression models In Chapter 3 we derived the bias in the OLS estimators when an important variable is omitted in Chapter 5 we showed that OLS is generally inconsistent under omitted variables Chapter 9 demonstrated that omitted variables bias can be eliminated or at least mitigated when a suitable proxy variable is given for an unobserved explanatory variable Unfortunately suitable proxy variables are not always available In the previous two chapters we explained how fixed effects estimation or first differencing can be used with panel data to estimate the effects of timevarying independent variables in the presence of timeconstant omitted variables Although such methods are very useful we do not always have access to panel data Even if we can obtain panel data it does us little good if we are interested in the effect of a variable that does not change over time first differencing or fixed effects estimation elimi nates timeconstant explanatory variables In addition the panel data methods that we have studied so far do not solve the problem of timevarying omitted variables that are correlated with the explanatory variables In this chapter we take a different approach to the endogeneity problem You will see how the method of instrumental variables IV can be used to solve the problem of endogeneity of one or more explanatory variables The method of two stage least squares 2SLS or TSLS is second in popularity only to ordinary least squares for estimating linear equations in applied econometrics We begin by showing how IV methods can be used to obtain consistent estimators in the pres ence of omitted variables IV can also be used to solve the errorsinvariables problem at least Instrumental Variables Estimation and Two Stage Least Squares c h a p t e r 15 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 462 under certain assumptions The next chapter will demonstrate how to estimate simultaneous equations models using IV methods Our treatment of instrumental variables estimation closely follows our development of ordinary least squares in Part 1 where we assumed that we had a random sample from an underlying popula tion This is a desirable starting point because in addition to simplifying the notation it emphasizes that the important assumptions for IV estimation are stated in terms of the underlying population just as with OLS As we showed in Part 2 OLS can be applied to time series data and the same is true of instrumental variables methods Section 157 discusses some special issues that arise when IV meth ods are applied to time series data In Section 158 we cover applications to pooled cross sections and panel data 151 Motivation Omitted Variables in a Simple Regression Model When faced with the prospect of omitted variables bias or unobserved heterogeneity we have so far discussed three options 1 we can ignore the problem and suffer the consequences of biased and inconsistent estimators 2 we can try to find and use a suitable proxy variable for the unobserved variable or 3 we can assume that the omitted variable does not change over time and use the fixed effects or firstdifferencing methods from Chapters 13 and 14 The first response can be satisfactory if the estimates are coupled with the direction of the biases for the key parameters For example if we can say that the estimator of a positive parameter say the effect of job training on subsequent wages is biased toward zero and we have found a statistically significant positive estimate we have still learned something job training has a positive effect on wages and it is likely that we have underes timated the effect Unfortunately the opposite case where our estimates may be too large in magni tude often occurs which makes it very difficult for us to draw any useful conclusions The proxy variable solution discussed in Section 92 can also produce satisfying results but it is not always possible to find a good proxy This approach attempts to solve the omitted variable prob lem by replacing the unobservable with one or more proxy variables Another approach leaves the unobserved variable in the error term but rather than estimating the model by OLS it uses an estimation method that recognizes the presence of the omitted variable This is what the method of instrumental variables does For illustration consider the problem of unobserved ability in a wage equation for working adults A simple model is log1wage2 5 b0 1 b1educ 1 b2abil 1 e where e is the error term In Chapter 9 we showed how under certain assumptions a proxy variable such as IQ can be substituted for ability and then a consistent estimator of b1 is available from the regression of log1wage2 on educ IQ Suppose however that a proxy variable is not available or does not have the properties needed to produce a consistent estimator of b1 Then we put abil into the error term and we are left with the simple regression model log1wage2 5 b0 1 b1educ 1 u 151 where u contains abil Of course if equation 151 is estimated by OLS a biased and inconsistent estimator of b1 results if educ and abil are correlated Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 463 It turns out that we can still use equation 151 as the basis for estimation provided we can find an instrumental variable for educ To describe this approach the simple regression model is written as y 5 b0 1 b1x 1 u 152 where we think that x and u are correlated have nonzero covariance Cov1xu2 2 0 153 The method of instrumental variables works whether or not x and u are correlated but for reasons we will see later OLS should be used if x is uncorrelated with u In order to obtain consistent estimators of b0 and b1 when x and u are correlated we need some additional information The information comes by way of a new variable that satisfies certain prop erties Suppose that we have an observable variable z that satisfies these two assumptions 1 z is uncorrelated with u that is Cov1zu2 5 0 154 2 z is correlated with x that is Cov1z x2 2 0 155 Then we call z an instrumental variable for x or sometimes simply an instrument for x The requirement that the instrument z satisfies 154 is summarized by saying z is exogenous in equation 152 and so we often refer to 154 as instrument exogeneity In the context of omitted vari ables instrument exogeneity means that z should have no partial effect on y after x and omitted variables have been controlled for and z should be uncorrelated with the omitted variables Equation 155 means that z must be related either positively or negatively to the endogenous explanatory variable x This condi tion is sometimes referred to as instrument relevance as in z is relevant for explaining variation in x There is a very important difference between the two requirements for an instrumental variable Because 154 involves the covariance between z and the unobserved error u we cannot generally hope to test this assumption in the vast majority of cases we must maintain Cov1zu2 5 0 by appeal ing to economic behavior or introspection In unusual cases we might have an observable proxy variable for some factor contained in u in which case we can check to see if z and the proxy variable are roughly uncorrelated Of course if we have a good proxy for an important element of u we might just add the proxy as an explanatory variable and estimate the expanded equation by ordinary least squares See Section 92 By contrast the condition that z is correlated with x in the population can be tested given a random sample from the population The easiest way to do this is to estimate a simple regression between x and z In the population we have x 5 p0 1 p1z 1 v 156 Then because p1 5 Cov1z x2Var1z2 assumption 155 holds if and only if p1 2 0 Thus we should be able to reject the null hypothesis H0 p1 5 0 157 against the twosided alternative H0 p1 2 0 at a sufficiently small significance level say 5 or 1 If this is the case then we can be fairly confident that 155 holds For the logwage equation in 151 an instrumental variable z for educ must be 1 uncorrelated with ability and any other unobserved factors affecting wage and 2 correlated with education Something such as the last digit of an individuals Social Security Number almost certainly satisfies the first requirement it is uncorrelated with ability because it is determined randomly However it is precisely because of the randomness of the last digit of the SSN that it is not correlated with educa tion either therefore it makes a poor instrumental variable for educ because it violates the instrument relevance requirement in equation 155 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 464 What we have called a proxy variable for the omitted variable makes a poor IV for the opposite reason For example in the logwage example with omitted ability a proxy variable for abil should be as highly correlated as possible with abil An instrumental variable must be uncorrelated with abil Therefore while IQ is a good candidate as a proxy variable for abil it is not a good instrumental vari able for educ because it violates the instrument exogeneity requirement in equation 154 Whether other possible instrumental variable candidates satisfy the exogeneity requirement in 154 is less clearcut In wage equations labor economists have used family background variables as IVs for education For example mothers education motheduc is positively correlated with childs education as can be seen by collecting a sample of data on working people and running a simple regression of educ on motheduc Therefore motheduc satisfies equation 155 The problem is that mothers education might also be correlated with childs ability through mothers ability and perhaps quality of nurturing at an early age in which case 154 fails Another IV choice for educ in 151 is number of siblings while growing up sibs Typically having more siblings is associated with lower average levels of education Thus if number of siblings is uncorrelated with ability it can act as an instrumental variable for educ As a second example consider the problem of estimating the causal effect of skipping classes on final exam score In a simple regression framework we have score 5 b0 1 b1skipped 1 u 158 where score is the final exam score and skipped is the total number of lectures missed during the semester We certainly might be worried that skipped is correlated with other factors in u more able highly motivated students might miss fewer classes Thus a simple regression of score on skipped may not give us a good estimate of the causal effect of missing classes What might be a good IV for skipped We need something that has no direct effect on score and is not correlated with student ability and motivation At the same time the IV must be correlated with skipped One option is to use distance between living quarters and campus Some students at a large university will commute to campus which may increase the likelihood of missing lectures due to bad weather oversleeping and so on Thus skipped may be positively correlated with distance this can be checked by regressing skipped on distance and doing a t test as described earlier Is distance uncorrelated with u In the simple regression model 158 some factors in u may be correlated with distance For example students from lowincome families may live off campus if income affects student performance this could cause distance to be correlated with u Section 152 shows how to use IV in the context of multiple regression so that other factors affecting score can be included directly in the model Then distance might be a good IV for skipped An IV approach may not be necessary at all if a good proxy exists for student ability such as cumulative GPA prior to the semester There is a final point worth emphasizing before we turn to the mechanics of IV estimation namely in using the simple regression in equation 156 to test 157 it is important to take note of the sign and even magnitude of p 1 and not just its statistical significance Arguments for why a variable z makes a good IV candidate for an endogenous explanatory variable x should include a discussion about the nature of the relationship between x and z For example due to genetics and background influences it makes sense that childs education x and mothers education z are posi tively correlated If in your sample of data you find that they are actually negatively correlatedthat is p 1 0then your use of mothers education as an IV for childs education is likely to be uncon vincing And this has nothing to do with whether condition 154 is likely to hold In the example of measuring whether skipping classes has an effect on test performance one should find a positive statistically significant relationship between skipped and distance in order to justify using distance as an IV for skipped a negative relationship would be difficult to justify and would suggest that there are important omitted variables driving a negative correlationvariables that might themselves have to be included in the model 158 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 465 We now demonstrate that the availability of an instrumental variable can be used to estimate consistently the parameters in equation 152 In particular we show that assumptions 154 and 155 serve to identify the parameter b1 Identification of a parameter in this context means that we can write b1 in terms of population moments that can be estimated using a sample of data To write b1 in terms of population covariances we use equation 152 the covariance between z and y is Cov1z y2 5 b1Cov1z x2 1 Cov1zu2 Now under assumption 154 Cov1zu2 5 0 and under assumption 155 Cov1z x2 2 0 Thus we can solve for b1 as b1 5 Cov1z y2 Cov1z x2 159 Notice how this simple algebra fails if z and x are uncorrelated that is if Cov1z x2 5 0 Equation 159 shows that b1 is the population covariance between z and y divided by the population covari ance between z and x which shows that b1 is identified Given a random sample we estimate the population quantities by the sample analogs After canceling the sample sizes in the numerator and denominator we get the instrumental variables IV estimator of b1 b 1 5 a n i51 1zi 2 z2 1yi 2 y2 a n i51 1zi 2 z2 1xi 2 x2 1510 Given a sample of data on x y and z it is simple to obtain the IV estimator in 1510 The IV estima tor of b0 is simply b 0 5 y 2 b 1x which looks just like the OLS intercept estimator except that the slope estimator b 1 is now the IV estimator It is no accident that when z 5 x we obtain the OLS estimator of b1 In other words when x is exogenous it can be used as its own IV and the IV estimator is then identical to the OLS estimator A simple application of the law of large numbers shows that the IV estimator is consistent for b1 plim1b 12 5 b1 provided assumptions 154 and 155 are satisfied If either assumption fails the IV estimators are not consistent more on this later One feature of the IV estimator is that when x and u are in fact correlatedso that instrumental variables estimation is actually neededit is essentially never unbiased This means that in small samples the IV estimator can have a substantial bias which is one reason why large samples are preferred When discussing the application of instrumental variables it is important to be careful with lan guage Like OLS IV is an estimation method It makes little sense to refer to an instrumental varia bles modeljust as the phrase OLS model makes little sense As we know a model is an equation such as 158 which is a special case of the generic model in equation 152 When we have a model such as 152 we can choose to estimate the parameters of that model in many different ways Prior to this chapter we focused primarily on OLS but for example we also know from Chapter 8 that one can use weighted least squares as an alternative estimation method and there are unlimited possibili ties for the weights If we have an instrumental variable candidate z for x then we can instead apply instrumental variables estimation It is certainly true that the estimation method we apply is motivated by the model and assumptions we make about that model But the estimators are well defined and exist apart from any underlying model or assumptions remember an estimator is simply a rule for combining data The bottom line is that while we probably know what a researcher means when using a phrase such as I estimated an IV model such language betrays a lack of understanding about the difference between a model and an estimation method Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 466 151a Statistical Inference with the IV Estimator Given the similar structure of the IV and OLS estimators it is not surprising that the IV estimator has an approximate normal distribution in large sample sizes To perform inference on b1 we need a standard error that can be used to compute t statistics and confidence intervals The usual approach is to impose a homoskedasticity assumption just as in the case of OLS Now the homoskedasticity assumption is stated conditional on the instrumental variable z not the endogenous explanatory vari able x Along with the previous assumptions on u x and z we add E1u20z2 5 s2 5 Var1u2 1511 It can be shown that under 154 155 and 1511 the asymptotic variance of b 1 is s2 ns2 xr2 x z 1512 where s2 x is the population variance of x s2 is the population variance of u and r2 x z is the square of the population correlation between x and z This tells us how highly correlated x and z are in the popu lation As with the OLS estimator the asymptotic variance of the IV estimator decreases to zero at the rate of 1n where n is the sample size Equation 1512 is interesting for two reasons First it provides a way to obtain a standard error for the IV estimator All quantities in 1512 can be consistently estimated given a random sample To estimate s2 x we simply compute the sample variance of xi to estimate r2 x z we can run the regression of xi on zi to obtain the Rsquared say R2 x z Finally to estimate s2 we can use the IV residuals u i 5 yi 2 b 0 2 b 1xi i 5 1 2 p n where b 0 and b 1 are the IV estimates A consistent estimator of s2 looks just like the estimator of s2 from a simple OLS regression s 2 5 1 n 2 2 a n i51 u2 i where it is standard to use the degrees of freedom correction even though this has little effect as the sample size grows The asymptotic standard error of b 1 is the square root of the estimated asymptotic variance the latter of which is given by s 2 SSTx R2 x z 1513 where SSTx is the total sum of squares of the xi Recall that the sample variance of xi is SSTxn and so the sample sizes cancel to give us 1513 The resulting standard error can be used to construct either t statistics for hypotheses involving b1 or confidence intervals for b1 b 0 also has a standard error that we do not present here Any modern econometrics package computes the standard error after any IV estimation there is rarely any reason to perform the calculations by hand A second reason 1512 is interesting is that it allows us to compare the asymptotic vari ances of the IV and the OLS estimators when x and u are uncorrelated Under the GaussMarkov assumptions the variance of the OLS estimator is s2SSTx while the comparable formula for the IV estimator is s21SSTx R2 x z2 they differ only in that R2 x z appears in the denominator of the IV variance Because an Rsquared is always less than one the IV variance is always larger than the OLS variance when OLS is valid If R2 x z is small then the IV variance can be much larger than the OLS vari ance Remember R2 x z measures the strength of the linear relationship between x and z in the sample Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 467 If x and z are only slightly correlated R2 x z can be small and this can translate into a very large sam pling variance for the IV estimator The more highly correlated z is with x the closer R2 x z is to one and the smaller is the variance of the IV estimator In the case that z 5 x R2 x z 5 1 and we get the OLS variance as expected The previous discussion highlights an important cost of performing IV estimation when x and u are uncorrelated the asymptotic variance of the IV estimator is always larger and sometimes much larger than the asymptotic variance of the OLS estimator ExamplE 151 Estimating the Return to Education for married Women We use the data on married working women in MROZ to estimate the return to education in the simple regression model log1wage2 5 b0 1 b1educ 1 u 1514 For comparison we first obtain the OLS estimates log1wage2 5 2185 1 109 educ 11852 10142 1515 n 5 428 R2 5 118 The estimate for b1 implies an almost 11 return for another year of education Next we use fathers education fatheduc as an instrumental variable for educ We have to main tain that fatheduc is uncorrelated with u The second requirement is that educ and fatheduc are cor related We can check this very easily using a simple regression of educ on fatheduc using only the working women in the sample educ 5 1024 1 269 fatheduc 1282 10292 1516 n 5 428 R2 5 173 The t statistic on fatheduc is 928 which indicates that educ and fatheduc have a statistically signifi cant positive correlation In fact fatheduc explains about 17 of the variation in educ in the sample Using fatheduc as an IV for educ gives log1wage2 5 441 1 059 educ 14462 10352 1517 n 5 428 R2 5 093 The IV estimate of the return to education is 59 which is barely more than onehalf of the OLS esti mate This suggests that the OLS estimate is too high and is consistent with omitted ability bias But we should remember that these are estimates from just one sample we can never know whether 109 is above the true return to education or whether 059 is closer to the true return to education Further the standard error of the IV estimate is two and onehalf times as large as the OLS standard error this is expected for the reasons we gave earlier The 95 confidence interval for b1 using OLS is much tighter than that using the IV in fact the IV confidence interval actually contains the OLS estimate Therefore although the differences between 1515 and 1517 are practically large we cannot say whether the difference is statistically significant We will show how to test this in Section 155 In the previous example the estimated return to education using IV was less than that using OLS which corresponds to our expectations But this need not have been the case as the following example demonstrates Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 468 ExamplE 152 Estimating the Return to Education for men We now use WAGE2 to estimate the return to education for men We use the variable sibs number of siblings as an instrument for educ These are negatively correlated as we can verify from a simple regression educ 5 1414 2 228 sibs 1112 10302 n 5 935 R2 5 057 This equation implies that every sibling is associated with on average about 23 less of a year of edu cation If we assume that sibs is uncorrelated with the error term in 1514 then the IV estimator is consistent Estimating equation 1514 using sibs as an IV for educ gives log1wage2 5 513 1 122 educ 1362 10262 n 5 935 The Rsquared is computed to be negative so we do not report it A discussion of Rsquared in the context of IV estimation follows For comparison the OLS estimate of b1 is 059 with a standard error of 006 Unlike in the previous example the IV estimate is now much higher than the OLS esti mate While we do not know whether the difference is statistically significant this does not mesh with the omitted ability bias from OLS It could be that sibs is also correlated with ability more siblings means on average less parental attention which could result in lower ability Another interpretation is that the OLS estimator is biased toward zero because of measurement error in educ This is not entirely convincing because as we discussed in Section 93 educ is unlikely to satisfy the classical errorsinvariables model In the previous examples the endogenous explanatory variable educ and the instrumental variables fatheduc sibs have quantitative meaning But nothing prevents the explanatory variable or IV from being binary variables Angrist and Krueger 1991 in their simplest analysis came up with a clever binary instrumental variable for educ using census data on men in the United States Let frstqrt be equal to one if the man was born in the first quarter of the year and zero otherwise It seems that the error term in 1514and in particular abilityshould be unrelated to quarter of birth But frstqrt also needs to be correlated with educ It turns out that years of education do differ systematically in the population based on quarter of birth Angrist and Krueger argued persuasively that this is due to compulsory school attendance laws in effect in all states Briefly students born early in the year typically begin school at an older age Therefore they reach the compulsory schooling age 16 in most states with somewhat less education than students who begin school at a younger age For students who finish high school Angrist and Krueger verified that there is no relationship between years of education and quarter of birth Because years of education varies only slightly across quarter of birthwhich means R2 x z in 1513 is very smallAngrist and Krueger needed a very large sample size to get a reasonably precise IV estimate Using 247199 men born between 1920 and 1929 the OLS estimate of the return to education was 0801 standard error 0004 and the IV estimate was 0715 0219 these are reported in Table III of Angrist and Kruegers paper Note how large the t statistic is for the OLS estimate about 200 whereas the t statistic for the IV estimate is only 326 Thus the IV estimate is statistically different from zero but its confidence interval is much wider than that based on the OLS estimate An interesting finding by Angrist and Krueger is that the IV estimate does not differ much from the OLS estimate In fact using men born in the next decade the IV estimate is somewhat higher Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 469 than the OLS estimate One could interpret this as showing that there is no omitted ability bias when wage equations are estimated by OLS However the Angrist and Krueger paper has been criticized on econometric grounds As discussed by Bound Jaeger and Baker 1995 it is not obvious that season of birth is unrelated to unobserved factors that affect wage As we will explain in the next subsection even a small amount of correlation between z and u can cause serious problems for the IV estimator For policy analysis the endogenous explanatory variable is often a binary variable For example Angrist 1990 studied the effect that being a veteran of the Vietnam War had on lifetime earnings A simple model is log1earns2 5 b0 1 b1veteran 1 u 1518 where veteran is a binary variable The problem with estimating this equation by OLS is that there may be a selfselection problem as we mentioned in Chapter 7 perhaps people who get the most out of the military choose to join or the decision to join is correlated with other characteristics that affect earnings These will cause veteran and u to be correlated Angrist pointed out that the Vietnam draft lottery provided a natural experiment see also Chapter 13 that created an instrumental variable for veteran Young men were given lottery numbers that deter mined whether they would be called to serve in Vietnam Because the numbers given were eventu ally randomly assigned it seems plausible that draft lottery number is uncorrelated with the error term u But those with a low enough number had to serve in Vietnam so that the probability of being a vet eran is correlated with lottery number If both of these assertions are true draft lottery number is a good IV candidate for veteran It is also possible to have a binary endogenous explanatory variable and a binary instrumental variable See Problem 1 for an example 151b Properties of IV with a Poor Instrumental Variable We have already seen that though IV is consistent when z and u are uncorrelated and z and x have any positive or negative correlation IV estimates can have large standard errors especially if z and x are only weakly correlated Weak correlation between z and x can have even more serious consequences the IV estimator can have a large asymptotic bias even if z and u are only moderately correlated We can see this by studying the probability limit of the IV estimator when z and u are possibly correlated Letting b 1 IV denote the IV estimator we can write plim b 1 IV 5 b1 1 Corr1zu2 Corr1z x2 su sx 1519 where su and sx are the standard deviations of u and x in the population respectively The interest ing part of this equation involves the correlation terms It shows that even if Corr1zu2 is small the inconsistency in the IV estimator can be very large if Corr1z x2 is also small Thus even if we focus only on consistency it is not necessarily better to use IV than OLS if the correlation between z and u is smaller than that between x and u Using the fact that Corr1xu2 5 Cov1xu21sx su2 along with equation 53 we can write the plim of the OLS estimatorcall it b 1 OLSas plim b 1 OLS 5 b1 1 Corr1xu2 su sx 1520 Comparing these formulas shows that it is possible for the directions of the asymptotic biases to be different for IV and OLS For example suppose Corr1xu2 0 Corr1z x2 0 and Corr1z u2 0 If some men who were assigned low draft lottery numbers obtained additional schooling to reduce the probability of being drafted is lottery number a good instrument for veteran in 1518 Exploring FurthEr 151 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 470 Then the IV estimator has a downward bias whereas the OLS estimator has an upward bias asymptotically In practice this situation is probably rare More problematic is when the direction of the bias is the same and the correlation between z and x is small For concreteness suppose x and z are both positively correlated with u and Corr1z x2 0 Then the asymptotic bias in the IV estima tor is less than that for OLS only if Corr1zu2Corr1z x2 Corr1xu2 If Corr1z x2 is small then a seemingly small correlation between z and u can be magnified and make IV worse than OLS even if we restrict attention to bias For example if Corr1z x2 5 2 Corr1zu2 must be less than onefifth of Corr1zu2 before IV has less asymptotic bias than OLS In many applications the correlation between the instrument and x is less than 2 Unfortunately because we rarely have an idea about the relative magnitudes of Corr1zu2 and Corr1xu2 we can never know for sure which estimator has the largest asymptotic bias unless of course we assume Corr1zu2 5 0 In the Angrist and Krueger 1991 example mentioned earlier where x is years of schooling and z is a binary variable indicating quarter of birth the correlation between z and x is very small Bound Jaeger and Baker 1995 discussed reasons why quarter of birth and u might be somewhat correlated From equation 1519 we see that this can lead to a substantial bias in the IV estimator When z and x are not correlated at all things are especially bad whether or not z is uncorre lated with u The following example illustrates why we should always check to see if the endogenous explanatory variable is correlated with the IV candidate ExamplE 153 Estimating the Effect of Smoking on Birth Weight In Chapter 6 we estimated the effect of cigarette smoking on child birth weight Without other explanatory variables the model is log1bwght2 5 b0 1 b1packs 1 u 1521 where packs is the number of packs smoked by the mother per day We might worry that packs is cor related with other health factors or the availability of good prenatal care so that packs and u might be correlated A possible instrumental variable for packs is the average price of cigarettes in the state of residence cigprice We will assume that cigprice and u are uncorrelated even though state support for health care could be correlated with cigarette taxes If cigarettes are a typical consumption good basic economic theory suggests that packs and cig price are negatively correlated so that cigprice can be used as an IV for packs To check this we regress packs on cigprice using the data in BWGHT packs 5 067 1 0003 cigprice 11032 100082 n 5 1388 R2 5 0000 R2 5 20006 This indicates no relationship between smoking during pregnancy and cigarette prices which is perhaps not too surprising given the addictive nature of cigarette smoking Because packs and cigprice are not correlated we should not use cigprice as an IV for packs in 1521 But what happens if we do The IV results would be log1bwght2 5 445 1 299 packs 1912 18702 n 5 1388 the reported Rsquared is negative The coefficient on packs is huge and of an unexpected sign The standard error is also very large so packs is not significant But the estimates are meaningless because cigprice fails the one requirement of an IV that we can always test assumption 155 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 471 The previous example shows that IV estimation can produce strange results when the instrument relevance condition Corr1z x2 2 0 fails Of practically greater interest is the socalled problem of weak instruments which is loosely defined as the problem of low but not zero correlation between z and x In a particular application it is difficult to define how low is too low but recent theoretical research supplemented by simulation studies has shed considerable light on the issue Staiger and Stock 1997 formalized the problem of weak instruments by modeling the correlation between z and x as a function of the sample size in particular the correlation is assumed to shrink to zero at the rate 1n Not surprisingly the asymptotic distribution of the instrumental variables esti mator is different compared with the usual asymptotics where the correlation is assumed to be fixed and nonzero One of the implications of the StockStaiger work is that the usual statistical inference based on t statistics and the standard normal distribution can be seriously misleading We discuss this further in Section 153 151c Computing RSquared after IV Estimation Most regression packages compute an Rsquared after IV estimation using the standard formula R2 5 1 2 SSRSST where SSR is the sum of squared IV residuals and SST is the total sum of squares of y Unlike in the case of OLS the Rsquared from IV estimation can be negative because SSR for IV can actually be larger than SST Although it does not really hurt to report the Rsquared for IV estimation it is not very useful either When x and u are correlated we cannot decompose the variance of y into b2 1Var1x2 1 Var1u2 and so the Rsquared has no natural interpretation In addition as we will discuss in Section 153 these Rsquareds cannot be used in the usual way to compute F tests of joint restrictions If our goal was to produce the largest Rsquared we would always use OLS IV methods are intended to provide better estimates of the ceteris paribus effect of x on y when x and u are correlated goodnessoffit is not a factor A high Rsquared resulting from OLS is of little comfort if we cannot consistently estimate b1 152 IV Estimation of the Multiple Regression Model The IV estimator for the simple regression model is easily extended to the multiple regression case We begin with the case where only one of the explanatory variables is correlated with the error In fact consider a standard linear model with two explanatory variables y1 5 b0 1 b1y2 1 b2z1 1 u1 1522 We call this a structural equation to emphasize that we are interested in the bj which simply means that the equation is supposed to measure a causal relationship We use a new notation here to dis tinguish endogenous from exogenous variables The dependent variable y1 is clearly endogenous as it is correlated with u1 The variables y2 and z1 are the explanatory variables and u1 is the error As usual we assume that the expected value of u1 is zero E1u12 5 0 We use z1 to indicate that this variable is exogenous in 1522 z1 is uncorrelated with u1 We use y2 to indicate that this variable is suspected of being correlated with u1 We do not specify why y2 and u1 are correlated but for now it is best to think of u1 as containing an omitted variable correlated with y2 The notation in equation 1522 originates in simultaneous equations models which we cover in Chapter 16 but we use it more generally to easily distinguish exogenous from endogenous explanatory variables in a multiple regression model An example of 1522 is log1wage2 5 b0 1 b1educ 1 b2exper 1 u1 1523 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 472 where y1 5 log1wage2 y2 5 educ and z1 5 exper In other words we assume that exper is exogenous in 1523 but we allow that educfor the usual reasonsis correlated with u1 We know that if 1522 is estimated by OLS all of the estimators will be biased and inconsist ent Thus we follow the strategy suggested in the previous section and seek an instrumental variable for y2 Since z1 is assumed to be uncorrelated with u1 can we use z1 as an instrument for y2 assum ing y2 and z1 are correlated The answer is no Since z1 itself appears as an explanatory variable in 1522 it cannot serve as an instrumental variable for y2 We need another exogenous variablecall it z2that does not appear in 1522 Therefore key assumptions are that z1 and z2 are uncorrelated with u1 we also assume that u1 has zero expected value which is without loss of generality when the equation contains an intercept E1u12 5 0 Cov1z1u12 5 0 and Cov1z2u12 5 0 1524 Given the zero mean assumption the latter two assumptions are equivalent to E1z1u12 5 E1z2u12 5 0 and so the method of moments approach suggests obtaining estimators b 0 b 1 and b 2 by solving the sample counterparts of 1524 a n i51 1yi1 2 b 0 2 b 1yi2 2 b 2zi12 5 0 a n i51 zi11yi1 2 b 0 2 b 1yi2 2 b 2zi12 5 0 1525 a n i51 zi21yi1 2 b 0 2 b 1yi2 2 b 2zi12 5 0 This is a set of three linear equations in the three unknowns b 0 b 1 and b 2 and it is easily solved given the data on y1 y2 z1 and z2 The estimators are called instrumental variables estimators If we think y2 is exogenous and we choose z2 5 y2 equations 1525 are exactly the first order conditions for the OLS estimators see equations 313 We still need the instrumental variable z2 to be correlated with y2 but the sense in which these two variables must be correlated is complicated by the presence of z1 in equation 1522 We now need to state the assumption in terms of partial correlation The easiest way to state the condition is to write the endogenous explanatory variable as a linear function of the exogenous variables and an error term y2 5 p0 1 p1z1 1 p2z2 1 v2 1526 where by construction E1v22 5 0 Cov1z1v22 5 0 and Cov1z2v22 5 0 and the pj are unknown para meters The key identification condition along with 1524 is that p2 2 0 1527 In other words after partialling out z1 y2 and z2 are still correlated This correlation can be positive or negative but it cannot be zero Testing 1527 is easy we estimate 1526 by OLS and use a t test possibly making it robust to heteroskedasticity We should always test this assumption Unfortunately we cannot test that z1 and z2 are uncorrelated with u1 hopefully we can make the case based on economic reasoning or introspection Suppose we wish to estimate the effect of marijuana usage on college grade point average For the population of college seniors at a university let daysused denote the number of days in the past month on which a student smoked marijuana and consider the structural equation colGPA 5 b0 1 b1daysused 1 b2SAT 1 u i Let percHS denote the percentage of a students high school graduating class that reported regular use of marijuana If this is an IV candidate for daysused write the reduced form for daysused Do you think 1527 is likely to be true ii Do you think percHS is truly exogenous in the structural equation What problems might there be Exploring FurthEr 152 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 473 Equation 1526 is an example of a reduced form equation which means that we have written an endogenous variable in terms of exogenous variables This name comes from simultaneous equa tions modelswhich we study in the next chapterbut it is a useful concept whenever we have an endogenous explanatory variable The name helps distinguish it from the structural equation 1522 Adding more exogenous explanatory variables to the model is straightforward Write the struc tural model as y1 5 b0 1 b1y2 1 b2z1 1 p 1 bkzk21 1 u1 1528 where y2 is thought to be correlated with u1 Let zk be a variable not in 1528 that is also exogenous Therefore we assume that E1u12 5 0 Cov1zju12 5 0 j 5 1 p k 1529 Under 1529 z1 p zk21 are the exogenous variables appearing in 1528 In effect these act as their own instrumental variables in estimating the bj in 1528 The special case of k 5 2 is given in the equations in 1525 along with z2 z1 appears in the set of moment conditions used to obtain the IV estimates More generally z1 p zk21 are used in the moment conditions along with the instru mental variable for y2 zk The reduced form for y2 is y2 5 p0 1 p1z1 1 p 1 pk21zk21 1 pkzk 1 v2 1530 and we need some partial correlation between zk and y2 pk 2 0 1531 Under 1529 and 1531 zk is a valid IV for y2 We do not care about the remaining pj in 1530 some or all of them could be zero A minor additional assumption is that there are no perfect linear relationships among the exogenous variables this is analogous to the assumption of no perfect col linearity in the context of OLS For standard statistical inference we need to assume homoskedasticity of u1 We give a careful statement of these assumptions in a more general setting in Section 153 ExamplE 154 Using College proximity as an IV for Education Card 1995 used wage and education data for a sample of men in 1976 to estimate the return to educa tion He used a dummy variable for whether someone grew up near a fouryear college nearc4 as an instrumental variable for education In a logwage equation he included other standard controls expe rience a black dummy variable dummy variables for living in an SMSA and living in the South and a full set of regional dummy variables and an SMSA dummy for where the man was living in 1966 In order for nearc4 to be a valid instrument it must be uncorrelated with the error term in the wage equationwe assume thisand it must be partially correlated with educ To check the latter require ment we regress educ on nearc4 and all of the exogenous variables appearing in the equation That is we estimate the reduced form for educ Using the data in CARD we obtain in condensed form educ 5 1664 1 320 nearc4 2 413 exper 1 p 1242 10882 10342 n 5 3010 R2 5 477 We are interested in the coefficient and t statistic on nearc4 The coefficient implies that in 1976 other things being fixed experience race region and so on people who lived near a college in 1966 had on average about onethird of a year more education than those who did not grow up near a college The t statistic on nearc4 is 364 which gives a pvalue that is zero in the first three decimals Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 474 As discussed earlier we should not make anything of the smaller Rsquared in the IV estimation by definition the OLS Rsquared will always be larger because OLS minimizes the sum of squared residuals Therefore if nearc4 is uncorrelated with unobserved factors in the error term we can use nearc4 as an IV for educ The OLS and IV estimates are given in Table 151 Like the OLS standard errors the reported IV standard errors employ a degreesoffreedom adjustment in estimating the error variance In some statistical packages the degreesoffreedom adjustment is the default in others it is not Interestingly the IV estimate of the return to education is almost twice as large as the OLS esti mate but the standard error of the IV estimate is over 18 times larger than the OLS standard error The 95 confidence interval for the IV estimate is between 024 and 239 which is a very wide range The presence of larger confidence intervals is a price we must pay to get a consistent estimator of the return to education when we think educ is endogenous TAblE 151 Dependent Variable logwage Explanatory Variables OLS IV educ 075 003 132 055 exper 085 007 108 024 exper2 20023 0003 20023 0003 black 2199 018 2147 054 smsa 136 020 112 032 south 2148 026 2145 027 Observations Rsquared 3010 300 3010 238 Other controls smsa66 reg662 reg669 It is worth noting especially for studying the effects of policy interventions that a reduced form equation exists for y1 too In the context of equation 1528 with zk an IV for y2 the reduced form for y1 always has the form y1 5 g0 1 g1z1 1 p 1 gkzk 1 e1 1532 where gj 5 bj 1 b1pj for j k gk 5 b1pk and e1 5 u1 1 b1v2as can be verified by plugging 1530 into 1528 and rearranging Because the zj are exogenous in 1532 the gj can be consist ently estimated by OLS In other words we regress y1 on all of the exogenous variables including zk the IV for y2 Only if we want to estimate b1 in 1528 do we need to apply IV When y2 is a zeroone variable denoting participation and zk is a zeroone variable representing eligibility for program participationwhich is hopefully either randomized across individuals or at most a function of the other exogenous variables z1 p zk21 such as incomethe coefficient gk has an interesting interpretation Rather than an estimate of the effect of the program itself it is an Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 475 estimate of the effect of offering the program Unlike b1 in 1528which measures the effect of the program itselfgk accounts for the possibility that some units made eligible will choose not to participate In the program evaluation literature gk is an example of an intentiontotreat parameter it measures the effect of being made eligible and not the effect of actual participation The intention totreat coefficient gk 5 b1pk depends on the effect of participating b1 and the change typically increase in the probability of participating due to being eligible pk When y2 is binary equation 1530 is a linear probability model and therefore pk measures the ceteris paribus change in prob ability that y2 5 1 as zk switches from zero to one 153 Two Stage Least Squares In the previous section we assumed that we had a single endogenous explanatory variable 1y22 along with one instrumental variable for y2 It often happens that we have more than one exogenous variable that is excluded from the structural model and might be correlated with y2 which means they are valid IVs for y2 In this section we discuss how to use multiple instrumental variables 153a A Single Endogenous Explanatory Variable Consider again the structural model 1522 which has one endogenous and one exogenous explana tory variable Suppose now that we have two exogenous variables excluded from 1522 z2 and z3 Our assumptions that z2 and z3 do not appear in 1522 and are uncorrelated with the error u1 are known as exclusion restrictions If z2 and z3 are both correlated with y2 we could just use each as an IV as in the previous sec tion But then we would have two IV estimators and neither of these would in general be efficient Since each of z1 z2 and z3 is uncorrelated with u1 any linear combination is also uncorrelated with u1 and therefore any linear combination of the exogenous variables is a valid IV To find the best IV we choose the linear combination that is most highly correlated with y2 This turns out to be given by the reduced form equation for y2 Write y2 5 p0 1 p1z1 1 p2z2 1 p3z3 1 v2 1533 where E1v22 5 0 Cov1z1v22 5 0 Cov1z2v22 5 0 and Cov1z3v22 5 0 Then the best IV for y2 under the assumptions given in the chapter appendix is the linear combina tion of the zj in 1533 which we call yp 2 yp 2 5 p0 1 p1z1 1 p2z2 1 p3z3 1534 For this IV not to be perfectly correlated with z1 we need at least one of p2 or p3 to be different from zero p2 2 0 or p3 2 0 1535 This is the key identification assumption once we assume the zj are all exogenous The value of p1 is irrelevant The structural equation 1522 is not identified if p2 5 0 and p3 5 0 We can test H0 p2 5 0 and p3 5 0 against 1535 using an F statistic A useful way to think of 1533 is that it breaks y2 into two pieces The first is yp 2 this is the part of y2 that is uncorrelated with the error term u1 The second piece is v2 and this part is possibly cor related with u1which is why y2 is possibly endogenous Given data on the zj we can compute yp 2 for each observation provided we know the population parameters pj This is never true in practice Nevertheless as we saw in the previous section we can Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 476 always estimate the reduced form by OLS Thus using the sample we regress y2 on z1 z2 and z3 and obtain the fitted values y2 5 p 0 1 p 1z1 1 p 2z2 1 p 3z3 1536 that is we have yi2 for each i At this point we should verify that z2 and z3 are jointly significant in 1533 at a reasonably small significance level no larger than 5 If z2 and z3 are not jointly signifi cant in 1533 then we are wasting our time with IV estimation Once we have y2 we can use it as the IV for y2 The three equations for estimating b0 b1 and b2 are the first two equations of 1525 with the third replaced by a n i51 yi21yi1 2 b 0 2 b 1yi2 2 b 2zi12 5 0 1537 Solving the three equations in three unknowns gives us the IV estimators With multiple instruments the IV estimator using yi2 as the instrument is also called the two stage least squares 2SLS estimator The reason is simple Using the algebra of OLS it can be shown that when we use y2 as the IV for y2 the IV estimates b 0 b 1 and b 2 are identical to the OLS estimates from the regression of y1 on y2 and z1 1538 In other words we can obtain the 2SLS estimator in two stages The first stage is to run the regression in 1536 where we obtain the fitted values y2 The second stage is the OLS regression 1538 Because we use y2 in place of y2 the 2SLS estimates can differ substantially from the OLS estimates Some economists like to interpret the regression in 1538 as follows The fitted value y2 is the estimated version of yp 2 and yp 2 is uncorrelated with u1 Therefore 2SLS first purges y2 of its correla tion with u1 before doing the OLS regression in 1538 We can show this by plugging y2 5 yp 2 1 v2 into 1522 y1 5 b0 1 b1yp 2 1 b2z1 1 u1 1 b1v2 1539 Now the composite error u1 1 b1v2 has zero mean and is uncorrelated with yp 2 and z1 which is why the OLS regression in 1538 works Most econometrics packages have special commands for 2SLS so there is no need to perform the two stages explicitly In fact in most cases you should avoid doing the second stage manually as the standard errors and test statistics obtained in this way are not valid The reason is that the error term in 1539 includes v2 but the standard errors involve the variance of u1 only Any regression software that supports 2SLS asks for the dependent variable the list of explanatory variables both exogenous and endogenous and the entire list of instrumental variables that is all exogenous vari ables The output is typically quite similar to that for OLS In model 1528 with a single IV for y2 the IV estimator from Section 152 is identical to the 2SLS estimator Therefore when we have one IV for each endogenous explanatory variable we can call the estimation method IV or 2SLS Adding more exogenous variables changes very little For example suppose the wage equation is log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 1 u1 1540 where u1 is uncorrelated with both exper and exper2 Suppose that we also think mothers and fathers educations are uncorrelated with u1 Then we can use both of these as IVs for educ The reduced form equation for educ is educ 5 p0 1 p1exper 1 p2exper2 1 p3 motheduc 1 p4 fatheduc 1 v2 1541 and identification requires that p3 2 0 or p4 2 0 or both of course Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 477 ExamplE 155 Return to Education for Working Women We estimate equation 1540 using the data in MROZ First we test H0 p3 5 0 p4 5 0 in 1541 using an F test The result is F 5 12476 and pvalue 5 0000 As expected educ is partially cor related with parents education When we estimate 1540 by 2SLS we obtain in equation form log1wage2 5 048 1 061 educ 1 044 exper 2 0009 exper2 14002 10312 10132 100042 n 5 428 R2 5 136 The estimated return to education is about 61 compared with an OLS estimate of about 108 Because of its relatively large standard error the 2SLS estimate is barely statistically significant at the 5 level against a twosided alternative The assumptions needed for 2SLS to have the desired large sample properties are given in the chapter appendix but it is useful to briefly summarize them here If we write the structural equation as in 1528 y1 5 b0 1 b1y2 1 b2z1 1 p 1 bkzk21 1 u1 1542 then we assume each zj to be uncorrelated with u1 In addition we need at least one exogenous vari able not in 1542 that is partially correlated with y2 This ensures consistency For the usual 2SLS standard errors and t statistics to be asymptotically valid we also need a homoskedasticity assump tion the variance of the structural error u1 cannot depend on any of the exogenous variables For time series applications we need more assumptions as we will see in Section 157 153b Multicollinearity and 2SLS In Chapter 3 we introduced the problem of multicollinearity and showed how correlation among regres sors can lead to large standard errors for the OLS estimates Multicollinearity can be even more serious with 2SLS To see why the asymptotic variance of the 2SLS estimator of b1 can be approximated as s23SST211 2 R 2 22 4 1543 where s2 5 Var1u12 SST2 is the total variation in y2 and R 2 2 is the Rsquared from a regression of y2 on all other exogenous variables appearing in the structural equation There are two reasons why the variance of the 2SLS estimator is larger than that for OLS First y2 by construction has less variation than y2 Remember Total sum of squares 5 explained sum of squares 1 residual sum of squares the variation in y2 is the total sum of squares while the variation in y2 is the explained sum of squares from the first stage regression Second the correlation between y2 and the exogenous variables in 1542 is often much higher than the correlation between y2 and these variables This essentially defines the multicollinearity problem in 2SLS As an illustration consider Example 154 When educ is regressed on the exogenous variables in Table 151 not including nearc4 Rsquared 5 475 this is a moderate degree of multicollinearity but the important thing is that the OLS standard error on b educ is quite small When we obtain the first stage fitted values educ and regress these on the exogenous variables in Table 151 Rsquared 5 995 which indicates a very high degree of multicollinearity between educ and the remaining exogenous variables in the table This high Rsquared is not too surprising because educ is a function of all the exogenous variables in Table 151 plus nearc4 Equation 1543 shows that an R 2 2 close to one can result in a very large standard error for the 2SLS estimator But as with OLS a large sample size can help offset a large R 2 2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 478 153c Detecting Weak Instruments In Section 151 we briefly discussed the problem of weak instruments We focused on equation 1519 which demonstrates how a small correlation between the instrument and error can lead to very large inconsistency and therefore bias if the instrument z also has little correlation with the explanatory variable x The same problem can arise in the context of the multiple equation model in equation 1542 whether we have one instrument for y2 or more instruments than we need We also mentioned the findings of Staiger and Stock 1997 and we now discuss the practical implications of this research in a bit more depth Importantly Staiger and Stock study the case of where all instrumental variables are exogenous With the exogeneity requirement satisfied by the instruments they focus on the case where the instruments are weakly correlated with y2 and they study the validity of standard errors confidence intervals and t statistics involving the coefficient b1 on y2 The mecha nism they used to model weak correlation led to an important finding even with very large sample sizes the 2SLS estimator can be biased and a distribution that is very different from standard normal Building on Staiger and Stock 1997 Stock and Yogo 2005 SY for short proposed methods for detecting situations where weak instruments will lead to substantial bias and distorted statistical inference Conveniently Stock and Yogo obtained rules concerning the size of the t statistic with one instrument or the F statistic with more than one instrument from the firststage regression The the ory is much too involved to pursue here Instead we describe some simple rules of thumb proposed by Stock and Yogo that are easy to implement The key implication of the SY work is that one needs more than just a statistical rejection of the null hypothesis in the first stage regression at the usual significance levels For example in equation 156 it is not enough to reject the null hypothesis stated in 157 at the 5 significance level Using bias calcu lations for the instrumental variables estimator SY recommend that one can proceed with the usual IV inference if the firststage t statistic has absolute value larger than 10 32 Readers will recognize this value as being well above the 95th percentile of the standard normal distribution 196 which is what we would use for a standard 5 significance level This same rule of thumb applies in the multiple regres sion model with a single endogenous explanatory variable y2 and a single instrumental variable zk In particular the t statistic in testing hypothesis 1531 should be at least 32 in absolute value SY cover the case of 2SLS too In this case we must focus on the firststage F statistic for exclusion of the instrumental variables for y2 and the SY rule is F 10 Notice this is the same rule based on the t statistic when there is only one instrument as t2 5 F For example consider equation 1534 where we have two instruments for y2 z2 and z3 Then the F statistic for the null hypothesis H0 p2 5 0 p3 5 0 should have F 10 Remember this is not the overall F statistic for all of the exogenous variables in 1534 We test only the coefficients on the proposed IVs for y2 that is the exogenous variables that do not appear in 1522 In Example 155 the relevant F statistic is 12476 which is well above 10 implying that we do not have to worry about weak instruments Of course the exogeneity of the par ents education variables is in doubt The rule of thumb of requiring the F statistic to be larger than 10 works well in most models and is easy to remember However like all rules of thumb involving statistical inference it makes no sense to use 10 as a knifeedge cutoff For example one can probably proceed if F 5 994 as it is pretty close to 10 The rule of thumb should be used as a guideline SY have more detailed suggestions for cases where there are many instruments for y2 say five or more The interested reader is referred to the SY paper Most empirical researchers adopt 10 as the target value 153d Multiple Endogenous Explanatory Variables Two stage least squares can also be used in models with more than one endogenous explanatory vari able For example consider the model y1 5 b0 1 b1y2 1 b2y3 1 b3z1 1 b4z2 1 b5z3 1 u1 1544 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 479 where E1u12 5 0 and u1 is uncorrelated with z1 z2 and z3 The variables y2 and y3 are endogenous explanatory variables each may be correlated with u1 To estimate 1544 by 2SLS we need at least two exogenous variables that do not appear in 1544 but that are correlated with y2 and y3 Suppose we have two excluded exogenous variables say z4 and z5 Then from our analysis of a single endogenous explanatory variable we need either z4 or z5 to appear in each reduced form for y2 and y3 As before we can use F statistics to test this Although this is necessary for identification unfortunately it is not sufficient Suppose that z4 appears in each reduced form but z5 appears in neither Then we do not really have two exogenous variables partially correlated with y2 and y3 Two stage least squares will not produce consistent esti mators of the bj Generally when we have more than one endogenous explanatory variable in a regression model identification can fail in several complicated ways But we can easily state a necessary condition for identification which is called the order condition Order Condition for Identification of an Equation We need at least as many excluded exogenous variables as there are included endogenous explanatory vari ables in the structural equation The order condition is simple to check as it only involves counting en dogenous and exogenous variables The sufficient condition for identification is called the rank condi tion We have seen special cases of the rank condition beforefor example in the discussion surrounding equation 1535 A general statement of the rank condition requires matrix algebra and is beyond the scope of this text See Wooldridge 2010 Chapter 5 It is even more difficult to obtain diagnostics for weak instruments 153e Testing Multiple Hypotheses after 2SLS Estimation We must be careful when testing multiple hypotheses in a model estimated by 2SLS It is tempting to use either the sum of squared residuals or the Rsquared form of the F statistic as we learned with OLS in Chapter 4 The fact that the Rsquared in 2SLS can be negative suggests that the usual way of computing F statistics might not be appropriate this is the case In fact if we use the 2SLS residu als to compute the SSRs for both the restricted and unrestricted models there is no guarantee that SSRr SSRur if the reverse is true the F statistic would be negative It is possible to combine the sum of squared residuals from the second stage regression such as 1538 with SSRur to obtain a statistic with an approximate F distribution in large samples Because many econometrics packages have simpletouse test commands that can be used to test multiple hypotheses after 2SLS estimation we omit the details Davidson and MacKinnon 1993 and Wooldridge 2010 Chapter 5 contain discussions of how to compute Ftype statistics for 2SLS 154 IV Solutions to ErrorsinVariables Problems In the previous sections we presented the use of instrumental variables as a way to solve the omitted variables problem but they can also be used to deal with the measurement error problem As an illus tration consider the model y 5 b0 1 b1xp 1 1 b2x2 1 u 1545 The following model explains violent crime rates at the city level in terms of a binary variable for whether gun control laws exist and other controls violent 5 b0 1 b1guncontrol 1 b2unem 1 b3 popul 1 b4percblck 1 b5age18221 1 p Some researchers have estimated similar equations using variables such as the num ber of National Rifle Association members in the city and the number of subscribers to gun magazines as instrumental variables for gun control see for example Kleck and Patterson 1993 Are these convincing instruments Exploring FurthEr 153 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 480 where y and x2 are observed but xp 1 is not Let x1 be an observed measurement of xp 1 x1 5 xp 1 1 e1 where e1 is the measurement error In Chapter 9 we showed that correlation between x1 and e1 causes OLS where x1 is used in place of xp 1 to be biased and inconsistent We can see this by writing y 5 b0 1 b1x1 1 b2x2 1 1u 2 b1e12 1546 If the classical errorsinvariables CEV assumptions hold the bias in the OLS estimator of b1 is toward zero Without further assumptions we can do nothing about this In some cases we can use an IV procedure to solve the measurement error problem In 1545 we assume that u is uncorrelated with xp 1 x1 and x2 in the CEV case we assume that e1 is uncor related with xp 1 and x2 These imply that x2 is exogenous in 1546 but that x1 is correlated with e1 What we need is an IV for x1 Such an IV must be correlated with x1 uncorrelated with uso that it can be excluded from 1545and uncorrelated with the measurement error e1 One possibility is to obtain a second measurement on xp 1 say z1 Because it is xp 1 that affects y it is only natural to assume that z1 is uncorrelated with u If we write z1 5 xp 1 1 a1 where a1 is the meas urement error in z1 then we must assume that a1 and e1 are uncorrelated In other words x1 and z1 both mismeasure xp 1 but their measurement errors are uncorrelated Certainly x1 and z1 are correlated through their dependence on xp 1 so we can use z1 as an IV for x1 Where might we get two measurements on a variable Sometimes when a group of workers is asked for their annual salary their employers can provide a second measure For married couples each spouse can independently report the level of savings or family income In the Ashenfelter and Krueger 1994 study cited in Section 143 each twin was asked about his or her siblings years of education this gives a second measure that can be used as an IV for selfreported education in a wage equation Ashenfelter and Krueger combined differencing and IV to account for the omitted ability problem as well more on this in Section 158 Generally though having two measures of an explanatory variable is rare An alternative is to use other exogenous variables as IVs for a potentially mismeasured variable For example our use of motheduc and fatheduc as IVs for educ in Example 155 can serve this pur pose If we think that educ 5 educp 1 e1 then the IV estimates in Example 155 do not suffer from measurement error if motheduc and fatheduc are uncorrelated with the measurement error e1 This is probably more reasonable than assuming motheduc and fatheduc are uncorrelated with ability which is contained in u in 1545 IV methods can also be adopted when using things like test scores to control for unobserved characteristics In Section 92 we showed that under certain assumptions proxy variables can be used to solve the omitted variables problem In Example 93 we used IQ as a proxy variable for unobserved ability This simply entails adding IQ to the model and performing an OLS regression But there is an alternative that works when IQ does not fully satisfy the proxy variable assumptions To illustrate write a wage equation as log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 1 abil 1 u 1547 where we again have the omitted ability problem But we have two test scores that are indicators of ability We assume that the scores can be written as test1 5 g1abil 1 e1 and test2 5 d1abil 1 e2 where g1 0 d1 0 Since it is ability that affects wage we can assume that test1 and test2 are uncorrelated with u If we write abil in terms of the first test score and plug the result into 1547 we get log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 1 a1test1 1 1u 2 a1e12 1548 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 481 where a1 5 1g1 Now if we assume that e1 is uncorrelated with all the explanatory variables in 1547 including abil then e1 and test1 must be correlated Notice that educ is not endogenous in 1548 however test1 is This means that estimating 1548 by OLS will produce inconsistent estimators of the bj and a1 Under the assumptions we have made test1 does not satisfy the proxy variable assumptions If we assume that e2 is also uncorrelated with all the explanatory variables in 1547 and that e1 and e2 are uncorrelated then e1 is uncorrelated with the second test score test2 Therefore test2 can be used as an IV for test1 ExamplE 156 Using Two Test Scores as Indicators of ability We use the data in WAGE2 to implement the preceding procedure where IQ plays the role of the first test score and KWW knowledge of the world of work is the second test score The explanatory vari ables are the same as in Example 93 educ exper tenure married south urban and black Rather than adding IQ and doing OLS as in column 2 of Table 92 we add IQ and use KWW as its instru ment The coefficient on educ is 025 se 5 017 This is a low estimate and it is not statistically dif ferent from zero This is a puzzling finding and it suggests that one of our assumptions fails perhaps e1 and e2 are correlated 155 Testing for Endogeneity and Testing Overidentifying Restrictions In this section we describe two important tests in the context of instrumental variables estimation 155a Testing for Endogeneity The 2SLS estimator is less efficient than OLS when the explanatory variables are exogenous as we have seen the 2SLS estimates can have very large standard errors Therefore it is useful to have a test for endogeneity of an explanatory variable that shows whether 2SLS is even necessary Obtaining such a test is rather simple To illustrate suppose we have a single suspected endogenous variable y1 5 b0 1 b1y2 1 b2z1 1 b3z2 1 u1 1549 where z1 and z2 are exogenous We have two additional exogenous variables z3 and z4 which do not appear in 1549 If y2 is uncorrelated with u1 we should estimate 1549 by OLS How can we test this Hausman 1978 suggested directly comparing the OLS and 2SLS estimates and determining whether the differences are statistically significant After all both OLS and 2SLS are consistent if all variables are exogenous If 2SLS and OLS differ significantly we conclude that y2 must be endog enous maintaining that the zj are exogenous It is a good idea to compute OLS and 2SLS to see if the estimates are practically different To determine whether the differences are statistically significant it is easier to use a regression test This is based on estimating the reduced form for y2 which in this case is y2 5 p0 1 p1z1 1 p2z2 1 p3z3 1 p4z4 1 v2 1550 Now since each zj is uncorrelated with u1 y2 is uncorrelated with u1 if and only if v2 is uncor related with u1 this is what we wish to test Write u1 5 d1v2 1 e1 where e1 is uncorrelated with v2 and has zero mean Then u1 and v2 are uncorrelated if and only if d1 5 0 The easiest way to test this is to include v2 as an additional regressor in 1549 and to do a t test There is only one problem with implementing this v2 is not observed because it is the error term in 1550 Because we can Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 482 estimate the reduced form for y2 by OLS we can obtain the reduced form residuals v2 Therefore we estimate y1 5 b0 1 b1y2 1 b2z1 1 b3z2 1 d1v2 1 error 1551 by OLS and test H0 d1 5 0 using a t statistic If we reject H0 at a small significance level we con clude that y2 is endogenous because v2 and u1 are correlated Testing for Endogeneity of a Single Explanatory Variable i Estimate the reduced form for y2 by regressing it on all exogenous variables including those in the structural equation and the additional IVs Obtain the residuals v2 ii Add v2 to the structural equation which includes y2 and test for significance of v2 using an OLS regression If the coefficient on v2 is statistically different from zero we conclude that y2 is indeed endogenous We might want to use a heteroskedasticityrobust t test ExamplE 157 Return to Education for Working Women We can test for endogeneity of educ in 1540 by obtaining the residuals v2 from estimating the reduced form 1541using only working womenand including these in 1540 When we do this the coefficient on v2 is d 1 5 058 and t 5 167 This is moderate evidence of positive correlation between u1 and v2 It is probably a good idea to report both estimates because the 2SLS estimate of the return to education 61 is well below the OLS estimate 108 An interesting feature of the regression from step ii of the test for endogeneity is that the coeffi cient estimates on all explanatory variables except of course v2 are identical to the 2SLS estimates For example estimating 1551 by OLS produces the same b j as estimating 1549 by 2SLS One benefit of this equivalence is that it provides an easy check on whether you have done the proper regression in testing for endogeneity But it also gives a different useful interpretation of 2SLS adding v2 to the origi nal equation as an explanatory variable and applying OLS clears up the endogeneity of y2 So when we start by estimating 1549 by OLS we can quantify the importance of allowing y2 to be endogenous by seeing how much b 1 changes when v2 is added to the equation Irrespective of the outcome of the statistical tests we can see whether the change in b 1 is expected and is practically significant We can also test for endogeneity of multiple explanatory variables For each suspected endog enous variable we obtain the reduced form residuals as in part i Then we test for joint significance of these residuals in the structural equation using an F test Joint significance indicates that at least one suspected explanatory variable is endogenous The number of exclusion restrictions tested is the number of suspected endogenous explanatory variables 155b Testing Overidentification Restrictions When we introduced the simple instrumental variables estimator in Section 151 we emphasized that the instrument must satisfy two requirements it must be uncorrelated with the error exogeneity and correlated with the endogenous explanatory variable relevance We have now seen that even in models with additional explanatory variables the second requirement can be tested using a t test with just one instrument or an F test when there are multiple instruments In the context of the simple IV estimator we noted that the exogeneity requirement cannot be tested However if we have more instruments than we need we can effectively test whether some of them are uncorrelated with the structural error As a specific example again consider equation 1549 with two instrumental variables for y2 z3 and z4 Remember z1 and z2 essentially act as their own instruments Because we have two instru ments for y2 we can estimate 1549 using say only z3 as an IV for y2 let bˇ 1 be the resulting IV Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 483 estimator of b1 Then we can estimate 1549 using only z4 as an IV for y2 call this IV estimator b 1 If all zj are exogenous and if z3 and z4 are each partially correlated with y2 then bˇ 1 and b 1 are both consistent for b1 Therefore if our logic for choosing the instruments is sound bˇ 1 and b 1 should differ only by sampling error Hausman 1978 proposed basing a test of whether z3 and z4 are both exog enous on the difference b1ˇ 2 b 1 Shortly we will provide a simpler way to obtain a valid test but before doing so we should understand how to interpret the outcome of the test If we conclude that bˇ 1 and b 1 are statistically different from one another then we have no choice but to conclude that either z3 z4 or both fail the exogeneity requirement Unfortunately we cannot know which is the case unless we simply assert from the beginning that say z3 is exogenous For example if y2 denotes years of schooling in a log wage equation z3 is mothers education and z4 is fathers education a statistically significant difference in the two IV estimators implies that one or both of the parents education variables are correlated with u1 in 1554 Certainly rejecting that ones instruments are exogenous is serious and requires a new approach But the more serious and subtle problem in comparing IV estimates is that they may be similar even though both instruments fail the exogeneity requirement In the previous example it seems likely that if mothers education is positively correlated with u1 then so is fathers education Therefore the two IV estimates may be similar even though each is inconsistent In effect because the IVs in this example are chosen using similar reasoning their separate use in IV procedures may very well lead to similar estimates that are nevertheless both inconsistent The point is that we should not feel espe cially comfortable if our IV procedures pass the Hausman test Another problem with comparing two IV estimates is that often they may seem practically dif ferent yet statistically we cannot reject the null hypothesis that they are consistent for the same population parameter For example in estimating 1540 by IV using motheduc as the only instru ment the coefficient on educ is 049 037 If we use only fatheduc as the IV for educ the coefficient on educ is 070 034 Perhaps not surprisingly the estimate using both parents education as IVs is in between these two 061 031 For policy purposes the difference between 5 and 7 for the estimated return to a year of schooling is substantial Yet as shown in Example 158 the difference is not statistically significant The procedure of comparing different IV estimates of the same parameter is an example of test ing overidentifying restrictions The general idea is that we have more instruments than we need to estimate the parameters consistently In the previous example we had one more instrument than we need and this results in one overidentifying restriction that can be tested In the general case sup pose that we have q more instruments than we need For example with one endogenous explanatory variable y2 and three proposed instruments for y2 we have q 5 3 2 1 5 2 overidentifying restric tions When q is two or more comparing several IV estimates is cumbersome Instead we can easily compute a test statistic based on the 2SLS residuals The idea is that if all instruments are exog enous the 2SLS residuals should be uncorrelated with the instruments up to sampling error But if there are k 1 1 parameters and k 1 1 1 q instruments the 2SLS residuals have a zero mean and are identically uncorrelated with k linear combinations of the instruments This algebraic fact contains as a special case the fact that the OLS residuals have a zero mean and are uncorrelated with the k explanatory variables Therefore the test checks whether the 2SLS residuals are correlated with q linear functions of the instruments and we need not decide on the functions the test does that for us automatically The following regressionbased test is valid when the homoskedasticity assumption listed as Assumption 2SLS5 in the chapter appendix holds Testing Overidentifying Restrictions i Estimate the structural equation by 2SLS and obtain the 2SLS residuals u 1 ii Regress u 1 on all exogenous variables Obtain the Rsquared say R2 1 iii Under the null hypothesis that all IVs are uncorrelated with u1 nR2 1 ax2 q where q is the number of instrumental variables from outside the model minus the total number of endogenous Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 484 explanatory variables If nR2 1 exceeds say the 5 critical value in the x2 q distribution we reject H0 and conclude that at least some of the IVs are not exogenous ExamplE 158 Return to Education for Working Women When we use motheduc and fatheduc as IVs for educ in 1540 we have a single overidentifying restriction Regressing the 2SLS residuals u 1 on exper exper2 motheduc and fatheduc produces R2 1 5 0009 Therefore nR2 1 5 428100092 5 3852 which is a very small value in a x2 1 distribution 1pvalue 5 5352 Therefore the parents education variables pass the overidentification test When we add husbands education to the IV list we get two overidentifying restrictions and nR2 1 5 111 1pvalue 5 5742 Subject to the preceding cautions it seems reasonable to add huseduc to the IV list as this reduces the standard error of the 2SLS estimate the 2SLS estimate on educ using all three instruments is 080 1se 5 0222 so this makes educ much more significant than when huseduc is not used as an IV 1b educ 5 061 se 5 0312 When q 5 1 a natural question is How does the test obtained from the regressionbased proce dure compare with a test based on directly comparing the estimates In fact the two procedures are asymptotically the same As a practical matter it makes sense to compute the two IV estimates to see how they differ More generally when q 2 one can compare the 2SLS estimates using all IVs to the IV estimates using single instruments By doing so one can see if the various IV estimates are practically different whether or not the overidentification test rejects or fails to reject In the previous example we alluded to a general fact about 2SLS under the standard 2SLS assump tions adding instruments to the list improves the asymptotic efficiency of the 2SLS But this requires that any new instruments are in fact exogenousotherwise 2SLS will not even be consistent and it is only an asymptotic result With the typical sample sizes available adding too many instru mentsthat is increasing the number of overidentifying restrictionscan cause severe biases in 2SLS A detailed discussion would take us too far afield A nice illustration is given by Bound Jaeger and Baker 1995 who argue that the 2SLS estimates of the return to education obtained by Angrist and Krueger 1991 using many instrumental variables are likely to be seriously biased even with hundreds of thousands of observations The overidentification test can be used whenever we have more instruments than we need If we have just enough instruments the model is said to be just identified and the Rsquared in part ii will be identically zero As we mentioned earlier we cannot test exogeneity of the instruments in the just identified case The test can be made robust to heteroskedasticity of arbitrary form for details see Wooldridge 2010 Chapter 5 156 2SLS with Heteroskedasticity Heteroskedasticity in the context of 2SLS raises essentially the same issues as with OLS Most impor tantly it is possible to obtain standard errors and test statistics that are asymptotically robust to heteroskedasticity of arbitrary and unknown form In fact expression 84 continues to be valid if the rij are obtained as the residuals from regressing xij on the other xih where the denotes fitted values from the first stage regressions for endogenous explanatory variables Wooldridge 2010 Chapter 5 contains more details Some software packages do this routinely We can also test for heteroskedasticity using an analog of the BreuschPagan test that we cov ered in Chapter 8 Let u denote the 2SLS residuals and let z1 z2 p zm denote all the exogenous vari ables including those used as IVs for the endogenous explanatory variables Then under reasonable assumptions spelled out for example in Wooldridge 2010 Chapter 5 an asymptotically valid Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 485 statistic is the usual F statistic for joint significance in a regression of u 2 on z1 z2 p zm The null hypothesis of homoskedasticity is rejected if the zj are jointly significant If we apply this test to Example 158 using motheduc fatheduc and huseduc as instruments for educ we obtain F5422 5 253 and pvalue 5 029 This is evidence of heteroskedasticity at the 5 level We might want to compute heteroskedasticityrobust standard errors to account for this If we know how the error variance depends on the exogenous variables we can use a weighted 2SLS procedure essentially the same as in Section 84 After estimating a model for Var1u0z1 z2 p zm2 we divide the dependent variable the explanatory variables and all the instru mental variables for observation i by h i where h i denotes the estimated variance The constant which is both an explanatory variable and an IV is divided by h i see Section 84 Then we apply 2SLS on the transformed equation using the transformed instruments 157 Applying 2SLS to Time Series Equations When we apply 2SLS to time series data many of the considerations that arose for OLS in Chapters 10 11 and 12 are relevant Write the structural equation for each time period as yt 5 b0 1 b1xt1 1 p 1 bk xtk 1 ut 1552 where one or more of the explanatory variables xtj might be correlated with ut Denote the set of exog enous variables by zt1 p ztm E1ut2 5 0 Cov1ztj ut2 5 0 j 5 1 p m Any exogenous explanatory variable is also a ztj For identification it is necessary that m k we have as many exogenous variables as explanatory variables The mechanics of 2SLS are identical for time series or crosssectional data but for time series data the statistical properties of 2SLS depend on the trending and correlation properties of the under lying sequences In particular we must be careful to include trends if we have trending dependent or explanatory variables Since a time trend is exoge nous it can always serve as its own instrumental var iable The same is true of seasonal dummy variables if monthly or quarterly data are used Series that have strong persistence have unit roots must be used with care just as with OLS Often differencing the equation is warranted before estimation and this applies to the instruments as well Under analogs of the assumptions in Chapter 11 for the asymptotic properties of OLS 2SLS using time series data is consistent and asymptotically normally distributed In fact if we replace the explanatory variables with the instrumental variables in stating the assumptions we only need to add the identification assumptions for 2SLS For example the homoskedasticity assumption is stated as E1u2 t 0zt1 p ztm2 5 s2 1553 and the no serial correlation assumption is stated as E1utus0zt zs2 5 0 for all t 2 s 1554 where zt denotes all exogenous variables at time t A full statement of the assumptions is given in the chapter appendix We will provide examples of 2SLS for time series problems in Chapter 16 see also Computer Exercise C4 A model to test the effect of growth in gov ernment spending on growth in output is gGDPt 5 b0 1 b1gGOVt 1 b2INVRATt 1 b3gLABt 1 ut where g indicates growth GDP is real gross domestic product GOV is real government spending INVRAT is the ratio of gross do mestic investment to GDP and LAB is the size of the labor force See equation 6 in Ram 1986 Under what assumptions would a dummy variable indicating whether the president in year t 2 1 is a Republican be a suitable IV for gGOVt Exploring FurthEr 154 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 486 As in the case of OLS the no serial correlation assumption can often be violated with time series data Fortunately it is very easy to test for AR1 serial correlation If we write ut 5 rut21 1 et and plug this into equation 1552 we get yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 rut21 1 et t 2 1555 To test H0 r1 5 0 we must replace ut21 with the 2SLS residuals u t21 Further if xtj is endogenous in 1552 then it is endogenous in 1555 so we still need to use an IV Because et is uncorrelated with all past values of ut u t21 can be used as its own instrument Testing for AR1 Serial Correlation after 2SLS i Estimate 1552 by 2SLS and obtain the 2SLS residuals u t ii Estimate yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ru t21 1 errort t 5 2 p n by 2SLS using the same instruments from part i in addition to u t21 Use the t statistic on r to test H0 r 5 0 As with the OLS version of this test from Chapter 12 the t statistic only has asymptotic justifica tion but it tends to work well in practice A heteroskedasticityrobust version can be used to guard against heteroskedasticity Further lagged residuals can be added to the equation to test for higher forms of serial correlation using a joint F test What happens if we detect serial correlation Some econometrics packages will compute standard errors that are robust to fairly general forms of serial correlation and heteroskedasticity This is a nice simple way to go if your econometrics package does this The computations are very similar to those in Section 125 for OLS See Wooldridge 1995 for formulas and other computational methods An alternative is to use the AR1 model and correct for serial correlation The procedure is similar to that for OLS and places additional restrictions on the instrumental variables The quasi differenced equation is the same as in equation 1232 yt 5 b011 2 r2 1 b1xt1 1 p 1 bkxtk 1 et t 2 1556 where xtj 5 xtj 2 rxt21 j We can use the t 5 1 observation just as in Section 123 but we omit that for simplicity here The question is What can we use as instrumental variables It seems natural to use the quasidifferenced instruments ztj 5 ztj 2 rzt21 j This only works however if in 1552 the original error ut is uncorrelated with the instruments at times t t 2 1 and t 1 1 That is the instru mental variables must be strictly exogenous in 1552 This rules out lagged dependent variables as IVs for example It also eliminates cases where future movements in the IVs react to current and past changes in the error ut 2SLS with AR1 Errors i Estimate 1552 by 2SLS and obtain the 2SLS residuals u t t 5 1 2 p n ii Obtain r from the regression of u t on u t21 t 5 2 p n and construct the quasidifferenced vari ables yt 5 yt 2 ryt21 xtj 5 xtj 2 rxt21 j and ztj 5 ztj 2 rzt21 j for t 2 Remember in most cases some of the IVs will also be explanatory variables iii Estimate 1556 where r is replaced with r by 2SLS using the ztj as the instruments Assuming that 1556 satisfies the 2SLS assumptions in the chapter appendix the usual 2SLS test statistics are asymptotically valid We can also use the first time period as in PraisWinsten estimation of the model with exogenous explanatory variables The transformed variables in the first time periodthe dependent variable explanatory variables and instrumental variablesare obtained simply by multiplying all firstperiod values by 11 2 r 2 12 See also Section 123 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 487 158 Applying 2SLS to Pooled Cross Sections and Panel Data Applying instrumental variables methods to independently pooled cross sections raises no new dif ficulties As with models estimated by OLS we should often include time period dummy variables to allow for aggregate time effects These dummy variables are exogenousbecause the passage of time is exogenousand so they act as their own instruments ExamplE 159 Effect of Education on Fertility In Example 131 we used the pooled cross section in FERTIL1 to estimate the effect of education on womens fertility controlling for various other factors As in Sander 1992 we allow for the possibil ity that educ is endogenous in the equation As instrumental variables for educ we use mothers and fathers education levels meduc feduc The 2SLS estimate of beduc is 2153 1se 5 0392 compared with the OLS estimate 2128 1se 5 0182 The 2SLS estimate shows a somewhat larger effect of education on fertility but the 2SLS standard error is over twice as large as the OLS standard error In fact the 95 confidence interval based on 2SLS easily contains the OLS estimate The OLS and 2SLS estimates of beduc are not statistically different as can be seen by testing for endogeneity of educ as in Section 155 when the reduced form residual v2 is included with the other regressors in Table 131 including educ its t statistic is 702 which is not significant at any reasonable level Therefore in this case we conclude that the difference between 2SLS and OLS could be entirely due to sampling error Instrumental variables estimation can be combined with panel data methods particularly first differencing to estimate parameters consistently in the presence of unobserved effects and endogene ity in one or more timevarying explanatory variables The following simple example illustrates this combination of methods ExamplE 1510 Job Training and Worker productivity Suppose we want to estimate the effect of another hour of job training on worker productivity For the two years 1987 and 1988 consider the simple panel data model log1scrapit2 5 b0 1 d0d88t 1 b1hrsempit 1 ai 1 uit t 5 1 2 where scrapit is firm is scrap rate in year t and hrsempit is hours of job training per employee As usual we allow different year intercepts and a constant unobserved firm effect ai For the reasons discussed in Section 132 we might be concerned that hrsempit is correlated with ai the latter of which contains unmeasured worker ability As before we difference to remove ai Dlog1scrapi2 5 d0 1 b1Dhrsempi 1 Dui 1557 Normally we would estimate this equation by OLS But what if Dui is correlated with Dhrsempi For example a firm might hire more skilled workers while at the same time reducing the level of job train ing In this case we need an instrumental variable for Dhrsempi Generally such an IV would be hard to find but we can exploit the fact that some firms received job training grants in 1988 If we assume that grant designation is uncorrelated with Duisomething that is reasonable because the grants were given at the beginning of 1988then Dgranti is valid as an IV provided Dhrsemp and Dgrant are correlated Using the data in JTRAIN differenced between 1987 and 1988 the first stage regression is Dhrsemp 5 51 1 2788 Dgrant 11562 13132 n 5 45 R2 5 392 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 488 This confirms that the change in hours of job training per employee is strongly positively related to receiving a job training grant in 1988 In fact receiving a job training grant increased peremployee training by almost 28 hours and grant designation accounted for almost 40 of the variation in Dhrsemp Two stage least squares estimation of 1557 gives Dlog1scrap2 5 2033 2 014 Dhrsemp 11272 10082 n 5 45 R2 5 016 This means that 10 more hours of job training per worker are estimated to reduce the scrap rate by about 14 For the firms in the sample the average amount of job training in 1988 was about 17 hours per worker with a minimum of zero and a maximum of 88 For comparison OLS estimation of 1557 gives b 1 5 20076 1se 5 00452 so the 2SLS esti mate of b1 is almost twice as large in magnitude and is slightly more statistically significant When T 3 the differenced equation may contain serial correlation The same test and cor rection for AR1 serial correlation from Section 157 can be used where all regressions are pooled across i as well as t Because we do not want to lose an entire time period the PraisWinsten transfor mation should be used for the initial time period Unobserved effects models containing lagged dependent variables also require IV methods for consistent estimation The reason is that after differencing Dyi t21 is correlated with Duit because yit21 and uit21 are correlated We can use two or more lags of y as IVs for Dyi t21 See Wooldridge 2010 Chapter 11 for details Instrumental variables after differencing can be used on matched pairs samples as well Ashenfelter and Krueger 1994 differenced the wage equation across twins to eliminate unobserved ability log1wage22 2 log1wage12 5 d0 1 b11educ22 2 educ112 1 1u2 2 u12 where educ1 1 is years of schooling for the first twin as reported by the first twin and educ2 2 is years of schooling for the second twin as reported by the second twin To account for possible measure ment error in the selfreported schooling measures Ashenfelter and Krueger used 1educ21 2 educ122 as an IV for 1educ22 2 educ112 where educ21 is years of schooling for the second twin as reported by the first twin and educ1 2 is years of schooling for the first twin as reported by the second twin The IV estimate of b1 is 1671t 5 3882 compared with the OLS estimate on the first differences of 0921t 5 3832 see Ashenfelter and Krueger 1994 Table 3 Summary In Chapter 15 we have introduced the method of instrumental variables as a way to estimate the param eters in a linear model consistently when one or more explanatory variables are endogenous An instrumen tal variable must have two properties 1 it must be exogenous that is uncorrelated with the error term of the structural equation 2 it must be partially correlated with the endogenous explanatory variable Find ing a variable with these two properties is usually challenging The method of two stage least squares which allows for more instrumental variables than we have explanatory variables is used routinely in the empirical social sciences When used properly it can allow us to estimate ceteris paribus effects in the presence of endogenous explanatory variables This is true in crosssectional time series and panel data applications But when instruments are poorwhich means they are correlated with the error term only weakly correlated with the endogenous explanatory variable or boththen 2SLS can be worse than OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 489 When we have valid instrumental variables we can test whether an explanatory variable is endog enous using the test in Section 155 In addition though we can never test whether all IVs are exogenous we can test that at least some of them areassuming that we have more instruments than we need for consistent estimation that is the model is overidentified Heteroskedasticity and serial correlation can be tested for and dealt with using methods similar to the case of models with exogenous explanatory variables In this chapter we used omitted variables and measurement error to illustrate the method of instru mental variables IV methods are also indispensable for simultaneous equations models which we will cover in Chapter 16 Key Terms Endogenous Explanatory Variables ErrorsinVariables Exclusion Restrictions Exogenous Explanatory Variables Exogenous Variables Identification Instrument Instrumental Variable Instrumental Variables IV Estimator Instrument Exogeneity Instrument Relevance Natural Experiment Omitted Variables Order Condition Overidentifying Restrictions Rank Condition Reduced Form Equation Structural Equation Two Stage Least Squares 2SLS Estimator Weak Instruments Problems 1 Consider a simple model to estimate the effect of personal computer PC ownership on college grade point average for graduating seniors at a large public university GPA 5 b0 1 b1PC 1 u where PC is a binary variable indicating PC ownership i Why might PC ownership be correlated with u ii Explain why PC is likely to be related to parents annual income Does this mean parental income is a good IV for PC Why or why not iii Suppose that four years ago the university gave grants to buy computers to roughly one half of the incoming students and the students who received grants were randomly chosen Carefully explain how you would use this information to construct an instrumental variable for PC 2 Suppose that you wish to estimate the effect of class attendance on student performance as in Example 63 A basic model is stndfnl 5 b0 1 b1atndrte 1 b2priGPA 1 b3ACT 1 u where the variables are defined as in Chapter 6 i Let dist be the distance from the students living quarters to the lecture hall Do you think dist is uncorrelated with u ii Assuming that dist and u are uncorrelated what other assumption must dist satisfy to be a valid IV for atndrte iii Suppose as in equation 618 we add the interaction term priGPAatndrte stndfnl 5 b0 1 b1atndrte 1 b2priGPA 1 b3ACT 1 b4priGPA atndrte 1 u If atndrte is correlated with u then in general so is priGPAatndrte What might be a good IV for priGPAatndrte Hint If EupriGPA ACT dist 0 as happens when priGPA ACT and dist are all exogenous then any function of priGPA and dist is uncorrelated with u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 490 3 Consider the simple regression model y 5 b0 1 b1x 1 u and let z be a binary instrumental variable for x Use 1510 to show that the IV estimator b 1 can be written as b 1 5 1y1 2 y021x1 2 x02 where y0 and x0 are the sample averages of yi and xi over the part of the sample with zi 5 0 and where y1 and x1 are the sample averages of yi and xi over the part of the sample with zi 5 1 This estimator known as a grouping estimator was first suggested by Wald 1940 4 Suppose that for a given state in the United States you wish to use annual time series data to estimate the effect of the statelevel minimum wage on the employment of those 18 to 25 years old EMP A simple model is gEMPt 5 b0 1 b1gMINt 1 b2gPOPt 1 b3gGSPt 1 b4gGDPt 1 ut where MINt is the minimum wage in real dollars POPt is the population from 18 to 25 years old GSPt is gross state product and GDPt is US gross domestic product The g prefix indicates the growth rate from year t 2 1 to year t which would typically be approximated by the difference in the logs i If we are worried that the state chooses its minimum wage partly based on unobserved to us factors that affect youth employment what is the problem with OLS estimation ii Let USMINt be the US minimum wage which is also measured in real terms Do you think gUSMINt is uncorrelated with ut iii By law any states minimum wage must be at least as large as the US minimum Explain why this makes gUSMINt a potential IV candidate for gMINt 5 Refer to equations 1519 and 1520 Assume that su 5 sx so that the population variation in the error term is the same as it is in x Suppose that the instrumental variable z is slightly correlated with u Corr1z u2 5 1 Suppose also that z and x have a somewhat stronger correlation Corr1z x2 5 2 i What is the asymptotic bias in the IV estimator ii How much correlation would have to exist between x and u before OLS has more asymptotic bias than 2SLS 6 i In the model with one endogenous explanatory variable one exogenous explanatory variable and one extra exogenous variable take the reduced form for y2 1526 and plug it into the struc tural equation 1522 This gives the reduced form for y1 y1 5 a0 1 a1z1 1 a2z2 1 v1 Find the aj in terms of the bj and the pj ii Find the reduced form error v1 in terms of u1 v2 and the parameters iii How would you consistently estimate the aj 7 The following is a simple model to measure the effect of a school choice program on standardized test performance see Rouse 1998 for motivation and Computer Exercise C11 for an analysis of a subset of Rouses data score 5 b0 1 b1choice 1 b2faminc 1 u1 where score is the score on a statewide test choice is a binary variable indicating whether a student attended a choice school in the last year and faminc is family income The IV for choice is grant the dollar amount granted to students to use for tuition at choice schools The grant amount differed by family income level which is why we control for faminc in the equation i Even with faminc in the equation why might choice be correlated with u1 ii If within each income class the grant amounts were assigned randomly is grant uncorrelated with u1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 491 iii Write the reduced form equation for choice What is needed for grant to be partially correlated with choice iv Write the reduced form equation for score Explain why this is useful Hint How do you inter pret the coefficient on grant 8 Suppose you want to test whether girls who attend a girls high school do better in math than girls who attend coed schools You have a random sample of senior high school girls from a state in the United States and score is the score on a standardized math test Let girlhs be a dummy variable indicating whether a student attends a girls high school i What other factors would you control for in the equation You should be able to reasonably collect data on these factors ii Write an equation relating score to girlhs and the other factors you listed in part i iii Suppose that parental support and motivation are unmeasured factors in the error term in part ii Are these likely to be correlated with girlhs Explain iv Discuss the assumptions needed for the number of girls high schools within a 20mile radius of a girls home to be a valid IV for girlhs v Suppose that when you estimate the reduced form for girlshs you find that the coefficient on numghs the number of girls high schools within a 20mile radius is negative and statistically significant Would you feel comfortable proceeding with IV estimation where numghs is used as an IV for girlshs Explain 9 Suppose that in equation 158 you do not have a good instrumental variable candidate for skipped But you have two other pieces of information on students combined SAT score and cumulative GPA prior to the semester What would you do instead of IV estimation 10 In a recent article Evans and Schwab 1995 studied the effects of attending a Catholic high school on the probability of attending college For concreteness let college be a binary variable equal to unity if a student attends college and zero otherwise Let CathHS be a binary variable equal to one if the student attends a Catholic high school A linear probability model is college 5 b0 1 b1CathHS 1 other factors 1 u where the other factors include gender race family income and parental education i Why might CathHS be correlated with u ii Evans and Schwab have data on a standardized test score taken when each student was a sopho more What can be done with this variable to improve the ceteris paribus estimate of attending a Catholic high school iii Let CathRel be a binary variable equal to one if the student is Catholic Discuss the two require ments needed for this to be a valid IV for CathHS in the preceding equation Which of these can be tested iv Not surprisingly being Catholic has a significant positive effect on attending a Catholic high school Do you think CathRel is a convincing instrument for CathHS 11 Consider a simple time series model where the explanatory variable has classical measurement error yt 5 b0 1 b1xp t 1 ut 1558 xt 5 xp t 1 et where ut has zero mean and is uncorrelated with xp t and et We observe yt and xt only Assume that et has zero mean and is uncorrelated with xp t and that xp t also has a zero mean this last assumption is only to simplify the algebra i Write xp t 5 xt 2 et and plug this into 1558 Show that the error term in the new equation say vt is negatively correlated with xt if b1 0 What does this imply about the OLS estimator of b1 from the regression of yt on xt Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 492 ii In addition to the previous assumptions assume that ut and et are uncorrelated with all past values of xp t and et in particular with xp t 2 1 and et21 Show that E1xt21vt2 5 0 where vt is the error term in the model from part i iii Are xt and xt21 likely to be correlated Explain iv What do parts ii and iii suggest as a useful strategy for consistently estimating b0 and b1 Computer Exercises C1 Use the data in WAGE2 for this exercise i In Example 152 if sibs is used as an instrument for educ the IV estimate of the return to education is 122 To convince yourself that using sibs as an IV for educ is not the same as just plugging sibs in for educ and running an OLS regression run the regression of logwage on sibs and explain your findings ii The variable brthord is birth order brthord is one for a firstborn child two for a secondborn child and so on Explain why educ and brthord might be negatively correlated Regress educ on brthord to determine whether there is a statistically significant negative correlation iii Use brthord as an IV for educ in equation 151 Report and interpret the results iv Now suppose that we include number of siblings as an explanatory variable in the wage equation this controls for family background to some extent log1wage2 5 b0 1 b1educ 1 b2sibs 1 u Suppose that we want to use brthord as an IV for educ assuming that sibs is exogenous The reduced form for educ is educ 5 p0 1 p1sibs 1 p2brthord 1 v State and test the identification assumption v Estimate the equation from part iv using brthord as an IV for educ and sibs as its own IV Comment on the standard errors for b educ and b sibs vi Using the fitted values from part iv educ compute the correlation between educ and sibs Use this result to explain your findings from part v C2 The data in FERTIL2 include for women in Botswana during 1988 information on number of chil dren years of education age and religious and economic status variables i Estimate the model children 5 b0 1 b1educ 1 b2age 1 b3age2 1 u by OLS and interpret the estimates In particular holding age fixed what is the estimated effect of another year of education on fertility If 100 women receive another year of education how many fewer children are they expected to have ii The variable frsthalf is a dummy variable equal to one if the woman was born during the first six months of the year Assuming that frsthalf is uncorrelated with the error term from part i show that frsthalf is a reasonable IV candidate for educ Hint You need to do a regression iii Estimate the model from part i by using frsthalf as an IV for educ Compare the estimated effect of education with the OLS estimate from part i iv Add the binary variables electric tv and bicycle to the model and assume these are exogenous Estimate the equation by OLS and 2SLS and compare the estimated coefficients on educ Interpret the coefficient on tv and explain why television ownership has a negative effect on fertility C3 Use the data in CARD for this exercise i The equation we estimated in Example 154 can be written as log1wage2 5 b0 1 b1educ 1 b2exper 1 p 1 u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 493 where the other explanatory variables are listed in Table 151 In order for IV to be consistent the IV for educ nearc4 must be uncorrelated with u Could nearc4 be correlated with things in the error term such as unobserved ability Explain ii For a subsample of the men in the data set an IQ score is available Regress IQ on nearc4 to check whether average IQ scores vary by whether the man grew up near a fouryear college What do you conclude iii Now regress IQ on nearc4 smsa66 and the 1966 regional dummy variables reg662 reg669 Are IQ and nearc4 related after the geographic dummy variables have been partialled out Reconcile this with your findings from part ii iv From parts ii and iii what do you conclude about the importance of controlling for smsa66 and the 1966 regional dummies in the logwage equation C4 Use the data in INTDEF for this exercise A simple equation relating the threemonth Tbill rate to the inflation rate constructed from the Consumer Price Index is i3t 5 b0 1 b1inft 1 ut i Estimate this equation by OLS omitting the first time period for later comparisons Report the results in the usual form ii Some economists feel that the Consumer Price Index mismeasures the true rate of inflation so that the OLS from part i suffers from measurement error bias Reestimate the equation from part i using inft21 as an IV for inft How does the IV estimate of b1 compare with the OLS estimate iii Now first difference the equation Di3t 5 b0 1 b1Dinft 1 Dut Estimate this by OLS and compare the estimate of b1 with the previous estimates iv Can you use Dinft21 as an IV for Dinft in the differenced equation in part iii Explain Hint Are Dinft and Dinft21 sufficiently correlated C5 Use the data in CARD for this exercise i In Table 151 the difference between the IV and OLS estimates of the return to education is economically important Obtain the reduced form residuals v2 from the reduced form regression educ on nearc4 exper exper2 black smsa south smsa66 reg662 reg669see Table 151 Use these to test whether educ is exogenous that is determine if the difference between OLS and IV is statistically significant ii Estimate the equation by 2SLS adding nearc2 as an instrument Does the coefficient on educ change much iii Test the single overidentifying restriction from part ii C6 Use the data in MURDER for this exercise The variable mrdrte is the murder rate that is the number of murders per 100000 people The variable exec is the total number of prisoners executed for the cur rent and prior two years unem is the state unemployment rate i How many states executed at least one prisoner in 1991 1992 or 1993 Which state had the most executions ii Using the two years 1990 and 1993 do a pooled regression of mrdrte on d93 exec and unem What do you make of the coefficient on exec iii Using the changes from 1990 to 1993 only for a total of 51 observations estimate the equation Dmrdrte 5 d0 1 b1Dexec 1 b2Dunem 1 Du by OLS and report the results in the usual form Now does capital punishment appear to have a deterrent effect iv The change in executions may be at least partly related to changes in the expected murder rate so that Dexec is correlated with Du in part iii It might be reasonable to assume that Dexec21 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 494 is uncorrelated with Du After all Dexec21 depends on executions that occurred three or more years ago Regress Dexec on Dexec21 to see if they are sufficiently correlated interpret the coefficient on Dexec21 v Reestimate the equation from part iii using Dexec21 as an IV for Dexec Assume that Dunem is exogenous How do your conclusions change from part iii C7 Use the data in PHILLIPS for this exercise i In Example 115 we estimated an expectations augmented Phillips curve of the form Dinft 5 b0 1 b1unemt 1 et where Dinft 5 inft 2 inft21 In estimating this equation by OLS we assumed that the supply shock et was uncorrelated with unemt If this is false what can be said about the OLS estimator of b1 ii Suppose that et is unpredictable given all past information E1et0inft21 unemt21 p2 5 0 Explain why this makes unemt21 a good IV candidate for unemt iii Regress unemt on unemt21 Are unemt and unemt21 significantly correlated iv Estimate the expectations augmented Phillips curve by IV Report the results in the usual form and compare them with the OLS estimates from Example 115 C8 Use the data in 401KSUBS for this exercise The equation of interest is a linear probability model pira 5 b0 1 b1p401k 1 b2inc 1 b3inc2 1 b4age 1 b5age2 1 u The goal is to test whether there is a tradeoff between participating in a 401k plan and having an individual retirement account IRA Therefore we want to estimate b1 i Estimate the equation by OLS and discuss the estimated effect of p401k ii For the purposes of estimating the ceteris paribus tradeoff between participation in two different types of retirement savings plans what might be a problem with ordinary least squares iii The variable e401k is a binary variable equal to one if a worker is eligible to participate in a 401k plan Explain what is required for e401k to be a valid IV for p401k Do these assumptions seem reasonable iv Estimate the reduced form for p401k and verify that e401k has significant partial correlation with p401k Since the reduced form is also a linear probability model use a heteroskedasticity robust standard error v Now estimate the structural equation by IV and compare the estimate of b1 with the OLS estimate Again you should obtain heteroskedasticityrobust standard errors vi Test the null hypothesis that p401k is in fact exogenous using a heteroskedasticityrobust test C9 The purpose of this exercise is to compare the estimates and standard errors obtained by correctly using 2SLS with those obtained using inappropriate procedures Use the data file WAGE2 i Use a 2SLS routine to estimate the equation log1wage2 5 b0 1 b1educ 1 b2exper 1 b3tenure 1 b4black 1 u where sibs is the IV for educ Report the results in the usual form ii Now manually carry out 2SLS That is first regress educi on sibsi experi tenurei and blacki and obtain the fitted values educi i 5 1 p n Then run the second stage regression log1wagei2 on educi experi tenurei and blacki i 5 1 p n Verify that the b j are identical to those obtained from part i but that the standard errors are somewhat different The standard errors obtained from the second stage regression when manually carrying out 2SLS are generally inappropriate iii Now use the following twostep procedure which generally yields inconsistent parameter estimates of the bj and not just inconsistent standard errors In step one regress educi on sibsi only and obtain the fitted values say educi Note that this is an incorrect first stage regression Then in the second step run the regression of log1wagei2 on educi experi tenurei and blacki i 5 1 p n How does the estimate from this incorrect twostep procedure compare with the correct 2SLS estimate of the return to education Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 495 C10 Use the data in HTV for this exercise i Run a simple OLS regression of logwage on educ Without controlling for other factors what is the 95 confidence interval for the return to another year of education ii The variable ctuit in thousands of dollars is the change in college tuition facing students from age 17 to age 18 Show that educ and ctuit are essentially uncorrelated What does this say about ctuit as a possible IV for educ in a simple regression analysis iii Now add to the simple regression model in part i a quadratic in experience and a full set of regional dummy variables for current residence and residence at age 18 Also include the urban indicators for current and age 18 residences What is the estimated return to a year of education iv Again using ctuit as a potential IV for educ estimate the reduced form for educ Naturally the reduced form for educ now includes the explanatory variables in part iii Show that ctuit is now statistically significant in the reduced form for educ v Estimate the model from part iii by IV using ctuit as an IV for educ How does the confidence interval for the return to education compare with the OLS CI from part iii vi Do you think the IV procedure from part v is convincing C11 The data set in VOUCHER which is a subset of the data used in Rouse 1998 can be used to estimate the effect of school choice on academic achievement Attendance at a choice school was paid for by a voucher which was determined by a lottery among those who applied The data subset was chosen so that any student in the sample has a valid 1994 math test score the last year available in Rouses sample Unfortunately as pointed out by Rouse many students have missing test scores possibly due to attrition that is leaving the Milwaukee public school district These data include students who applied to the voucher program and were accepted students who applied and were not accepted and students who did not apply Therefore even though the vouchers were chosen by lottery among those who applied we do not necessarily have a random sample from a population where being selected for a voucher has been ran domly determined An important consideration is that students who never applied to the program may be systematically different from those who didand in ways that we cannot know based on the data Rouse 1998 uses panel data methods of the kind we discussed in Chapter 14 to allow student fixed effects she also uses instrumental variables methods This problem asks you to do a crosssec tional analysis where winning the lottery for a voucher acts as an instrumental variable for attending a choice school Actually because we have multiple years of data on each student we construct two variables The first choiceyrs is the number of years from 1991 to 1994 that a student attended a choice school this variable ranges from zero to four The variable selectyrs indicates the number of years a student was selected for a voucher If the student applied for the program in 1990 and received a voucher then selectyrs 5 4 if he or she applied in 1991 and received a voucher then selectyrs 5 3 and so on The outcome of interest is mnce the students percentile score on a math test administered in 1994 i Of the 990 students in the sample how many were never awarded a voucher How many had a voucher available for four years How many students actually attended a choice school for four years ii Run a simple regression of choiceyrs on selectyrs Are these variables related in the direction you expected How strong is the relationship Is selectyrs a sensible IV candidate for choiceyrs iii Run a simple regression of mnce on choiceyrs What do you find Is this what you expected What happens if you add the variables black hispanic and female iv Why might choiceyrs be endogenous in an equation such as mnce 5 b0 1 b1choiceyrs 1 b2black 1 b3hispanic 1 b4female 1 u1 v Estimate the equation in part iv by instrumental variables using selectyrs as the IV for choiceyrs Does using IV produce a positive effect of attending a choice school What do you make of the coefficients on the other explanatory variables vi To control for the possibility that prior achievement affects participating in the lottery as well as predicting attrition add mnce90the math score in 1990to the equation in part iv Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 496 Estimate the equation by OLS and IV and compare the results for b1 For the IV estimate how much is each year in a choice school worth on the math percentile score Is this a practically large effect vii Why is the analysis from part vi not entirely convincing Hint Compared with part v what happens to the number of observations and why viii The variables choiceyrs1 choiceyrs2 and so on are dummy variables indicating the different number of years a student could have been in a choice school from 1991 to 1994 The dummy variables selectyrs1 selectyrs2 and so on have a similar definition but for being selected from the lottery Estimate the equation mnce 5 b0 1 b1choiceyrs1 1 b2choiceyrs2 1 b3choiceyrs3 1 b4choiceyrs4 1 b5black 1 b6hispanic 1 b7 female 1 b8mnce90 1 u1 by IV using as instruments the four selectyrs dummy variables As before the variables black hispanic and female act as their own IVs Describe your findings Do they make sense C12 Use the data in CATHOLIC to answer this question The model of interest is math12 5 b0 1 b1cathhs 1 b2lfaminc 1 b3motheduc 1 b4fatheduc 1 u where cathhs is a binary indicator for whether a student attends a Catholic high school i How many students are in the sample What percentage of these students attend a Catholic high school ii Estimate the above equation by OLS What is the estimate of b1 What is its 95 confidence interval iii Using parcath as an instrument for cathhs estimate the reduced form for cathhs What is the t statistic for parcath Is there evidence of a weak instrument problem iv Estimate the above equation by IV using parcath as an IV for cathhs How does the estimate and 95 CI compare with the OLS quantities v Test the null hypothesis that cathhs is exogenous What is the pvalue of the test vi Suppose you add the interaction between cathhs motheduc to the above model Why is it generally endogenous Why is pareduc motheduc a good IV candidate for cathhs motheduc vii Before you create the interactions in part vi first find the sample average of motheduc and create cathhs 1motheduc 2 motheduc2 and parcath 1motheduc 2 motheduc2 Add the first interaction to the model and use the second as an IV Of course cathhs is also instrumented Is the interaction term statistically significant viii Compare the coefficient on cathhs in vii to that in part iv Is including the interaction important for estimating the average partial effect APPEndix 15A 15A1 Assumptions for Two Stage Least Squares This appendix covers the assumptions under which 2SLS has desirable large sample properties We first state the assumptions for crosssectional applications under random sampling Then we discuss what needs to be added for them to apply to time series and panel data Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 497 15A2 Assumption 2SLS1 Linear in Parameters The model in the population can be written as y 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u where b0 b1 p bk are the unknown parameters constants of interest and u is an unobserved ran dom error or random disturbance term The instrumental variables are denoted as zj It is worth emphasizing that Assumption 2SLS1 is virtually identical to MLR1 with the minor exception that 2SLS1 mentions the notation for the instrumental variables zj In other words the model we are interested in is the same as that for OLS estimation of the bj Sometimes it is easy to lose sight of the fact that we can apply different estimation methods to the same model Unfortu nately it is not uncommon to hear researchers say I estimated an OLS model or I used a 2SLS model Such statements are meaningless OLS and 2SLS are different estimation methods that are applied to the same model It is true that they have desirable statistical properties under different sets of assumptions on the model but the relationship they are estimating is given by the equation in 2SLS1 or MLR1 The point is similar to that made for the unobserved effects panel data model covered in Chapters 13 and 14 pooled OLS first differencing fixed effects and random effects are different estimation methods for the same model 15A3 Assumption 2SLS2 Random Sampling We have a random sample on y the xj and the zj 15A4 Assumption 2SLS3 Rank Condition i There are no perfect linear relationships among the instrumental variables ii The rank condition for identification holds With a single endogenous explanatory variable as in equation 1542 the rank condition is easily described Let z1 p zm denote the exogenous variables where zk p zm do not appear in the structural model 1542 The reduced form of y2 is y2 5 p0 1 p1z1 1 p2z2 1 p 1 pk21zk21 1 pkzk 1 p 1 pmzm 1 v2 Then we need at least one of pk p pm to be nonzero This requires at least one exogenous vari able that does not appear in 1542 the order condition Stating the rank condition with two or more endogenous explanatory variables requires matrix algebra See Wooldridge 2010 Chapter 5 15A5 Assumption 2SLS4 Exogenous Instrumental Variables The error term u has zero mean and each IV is uncorrelated with u Remember that any xj that is uncorrelated with u also acts as an IV 15A6 Theorem 15A1 Under Assumptions 2SLS1 through 2SLS4 the 2SLS estimator is consistent 15A7 Assumption 2SLS5 Homoskedasticity Let z denote the collection of all instrumental variables Then E1u20z2 5 s2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 498 15A8 Theorem 15A2 Under Assumptions 2SLS1 through 2SLS5 the 2SLS estimators are asymptotically normally dis tributed Consistent estimators of the asymptotic variance are given as in equation 1543 where s2 is replaced with s 2 5 1n 2 k 2 12 21g n i51 u 2 i and the u i are the 2SLS residuals The 2SLS estimator is also the best IV estimator under the five assumptions given We state the result here A proof can be found in Wooldridge 2010 Chapter 5 15A9 Theorem 15A3 Under Assumptions 2SLS1 through 2SLS5 the 2SLS estimator is asymptotically efficient in the class of IV estimators that uses linear combinations of the exogenous variables as instruments If the homoskedasticity assumption does not hold the 2SLS estimators are still asymptotically normal but the standard errors and t and F statistics need to be adjusted many econometrics pack ages do this routinely Moreover the 2SLS estimator is no longer the asymptotically efficient IV esti mator in general We will not study more efficient estimators here see Wooldridge 2010 Chapter 8 For time series applications we must add some assumptions First as with OLS we must as sume that all series including the IVs are weakly dependent this ensures that the law of large num bers and the central limit theorem hold For the usual standard errors and test statistics to be valid as well as for asymptotic efficiency we must add a no serial correlation assumption 15A10 Assumption 2SLS6 No Serial Correlation Equation 1554 holds A similar no serial correlation assumption is needed in panel data applications Tests and cor rections for serial correlation were discussed in Section 157 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 499 I n the previous chapter we showed how the method of instrumental variables can solve two kinds of endogeneity problems omitted variables and measurement error Conceptually these problems are straightforward In the omitted variables case there is a variable or more than one that we would like to hold fixed when estimating the ceteris paribus effect of one or more of the observed explanatory variables In the measurement error case we would like to estimate the effect of certain explanatory variables on y but we have mismeasured one or more variables In both cases we could estimate the parameters of interest by OLS if we could collect better data Another important form of endogeneity of explanatory variables is simultaneity This arises when one or more of the explanatory variables is jointly determined with the dependent variable typically through an equilibrium mechanism as we will see later In this chapter we study methods for estimating simple simultaneous equations models SEMs Although a complete treatment of SEMs is beyond the scope of this text we are able to cover models that are widely used The leading method for estimating simultaneous equations models is the method of instrumental variables Therefore the solution to the simultaneity problem is essentially the same as the IV solutions to the omitted variables and measurement error problems However crafting and interpreting SEMs is challenging Therefore we begin by discussing the nature and scope of simultaneous equa tions models in Section 161 In Section 162 we confirm that OLS applied to an equation in a simultaneous system is generally biased and inconsistent Section 163 provides a general description of identification and estimation in a twoequation system while Section 164 briefly covers models with more than two equations Simultaneous Simultaneous Equations Models c h a p t e r 16 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 500 equations models are used to model aggregate time series and in Section 165 we include a discus sion of some special issues that arise in such models Section 166 touches on simultaneous equations models with panel data 161 The Nature of Simultaneous Equations Models The most important point to remember in using simultaneous equations models is that each equation in the system should have a ceteris paribus causal interpretation Because we only observe the out comes in equilibrium we are required to use counterfactual reasoning in constructing the equations of a simultaneous equations model We must think in terms of potential as well as actual outcomes The classic example of an SEM is a supply and demand equation for some commodity or input to production such as labor For concreteness let hs denote the annual labor hours supplied by workers in agriculture measured at the county level and let w denote the average hourly wage offered to such workers A simple labor supply function is hs 5 a1w 1 b1z1 1 u1 161 where z1 is some observed variable affecting labor supplysay the average manufacturing wage in the county The error term u1 contains other factors that affect labor supply Many of these factors are observed and could be included in equation 161 to illustrate the basic concepts we include only one such factor z1 Equation 161 is an example of a structural equation This name comes from the fact that the labor supply function is derivable from economic theory and has a causal interpretation The coefficient a1 measures how labor supply changes when the wage changes if hs and w are in logarithmic form a1 is the labor supply elasticity Typically we expect a1 to be posi tive although economic theory does not rule out a1 0 Labor supply elasticities are important for determining how workers will change the number of hours they desire to work when tax rates on wage income change If z1 is the manufacturing wage we expect b1 0 other factors equal if the manufacturing wage increases more workers will go into manufacturing than into agriculture When we graph labor supply we sketch hours as a function of wage with z1 and u1 held fixed A change in z1 shifts the labor supply function as does a change in u1 The difference is that z1 is observed while u1 is not Sometimes z1 is called an observed supply shifter and u1 is called an unob served supply shifter How does equation 161 differ from those we have studied previously The difference is subtle Although equation 161 is supposed to hold for all possible values of wage we cannot generally view wage as varying exogenously for a cross section of counties If we could run an experiment where we vary the level of agricultural and manufacturing wages across a sample of counties and survey workers to obtain the labor supply hs for each county then we could estimate 161 by OLS Unfortunately this is not a manageable experiment Instead we must collect data on average wages in these two sectors along with how many person hours were spent in agricultural production In decid ing how to analyze these data we must understand that they are best described by the interaction of labor supply and demand Under the assumption that labor markets clear we actually observe equilib rium values of wages and hours worked To describe how equilibrium wages and hours are determined we need to bring in the demand for labor which we suppose is given by hd 5 a2w 1 b2z2 1 u2 162 where hd is hours demanded As with the supply function we graph hours demanded as a function of wage w keeping z2 and u2 fixed The variable z2say agricultural land areais an observable demand shifter while u2 is an unobservable demand shifter Just as with the labor supply equation the labor demand equation is a structural equation it can be obtained from the profit maximization considerations of farmers If hd and w are in logarithmic Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 501 form a2 is the labor demand elasticity Economic theory tells us that a2 0 Because labor and land are complements in production we expect b2 0 Notice how equations 161 and 162 describe entirely different relationships Labor supply is a behavioral equation for workers and labor demand is a behavioral relationship for farmers Each equation has a ceteris paribus interpretation and stands on its own They become linked in an econometric analysis only because observed wage and hours are determined by the intersection of supply and demand In other words for each county i observed hours hi and observed wage wi are determined by the equilibrium condition his 5 hid 163 Because we observe only equilibrium hours for each county i we denote observed hours by hi When we combine the equilibrium condition in 163 with the labor supply and demand equations we get hi 5 a1wi 1 b1zi1 1 ui1 164 and hi 5 a2wi 1 b2zi2 1 ui2 165 where we explicitly include the i subscript to emphasize that hi and wi are the equilibrium observed values for county i These two equations constitute a simultaneous equations model SEM which has several important features First given zi1 zi2 ui1 and ui2 these two equations determine hi and wi Actually we must assume that a1 2 a2 which means that the slopes of the supply and demand functions differ see Problem 1 For this reason hi and wi are the endogenous variables in this SEM What about zi1 and zi2 Because they are determined outside of the model we view them as exogenous variables From a statistical standpoint the key assumption concerning zi1 and zi2 is that they are both uncorrelated with the supply and demand errors ui1 and ui2 respectively These are examples of structural errors because they appear in the structural equations A second important point is that without including z1 and z2 in the model there is no way to tell which equation is the supply function and which is the demand function When z1 represents manufactur ing wage economic reasoning tells us that it is a factor in agricultural labor supply because it is a measure of the opportunity cost of working in agriculture when z2 stands for agricultural land area production theory implies that it appears in the labor demand function Therefore we know that 164 represents labor supply and 165 represents labor demand If z1 and z2 are the samefor example average educa tion level of adults in the county which can affect both supply and demandthen the equations look identical and there is no hope of estimating either one In a nutshell this illustrates the identification problem in simultaneous equations models which we will discuss more generally in Section 163 The most convincing examples of SEMs have the same flavor as supply and demand examples Each equation should have a behavioral ceteris paribus interpretation on its own Because we only observe equilibrium outcomes specifying an SEM requires us to ask such counterfactual questions as How much labor would workers provide if the wage were different from its equilibrium value Example 161 provides another illustration of an SEM where each equation has a ceteris paribus interpretation ExamplE 161 murder Rates and Size of the police Force Cities often want to determine how much additional law enforcement will decrease their murder rates A simple crosssectional model to address this question is murdpc 5 a1polpc 1 b10 1 b11incpc 1 u1 166 where murdpc is murders per capita polpc is number of police officers per capita and incpc is income per capita Henceforth we do not include an i subscript We take income per capita as exogenous in this equation In practice we would include other factors such as age and gender distributions Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 502 education levels perhaps geographic variables and variables that measure severity of punishment To fix ideas we consider equation 166 The question we hope to answer is If a city exogenously increases its police force will that increase on average lower the murder rate If we could exogenously choose police force sizes for a random sample of cities we could estimate 166 by OLS Certainly we cannot run such an experi ment But can we think of police force size as being exogenously determined anyway Probably not A citys spending on law enforcement is at least partly determined by its expected murder rate To reflect this we postulate a second relationship polpc 5 a2murdpc 1 b20 1 other factors 167 We expect that a2 0 other factors being equal cities with higher expected murder rates will have more police officers per capita Once we specify the other factors in 167 we have a twoequation simultaneous equations model We are really only interested in equation 166 but as we will see in Section 163 we need to know precisely how the second equation is specified in order to estimate the first An important point is that 167 describes behavior by city officials while 166 describes the actions of potential murderers This gives each equation a clear ceteris paribus interpretation which makes equations 166 and 167 an appropriate simultaneous equations model We next give an example of an inappropriate use of SEMs ExamplE 162 Housing Expenditures and Saving Suppose that for a random household in the population we assume that annual housing expenditures and saving are jointly determined by housing 5 a1saving 1 b10 1 b11inc 1 b12educ 1 b13age 1 u1 168 and saving 5 a2housing 1 b20 1 b21inc 1 b22educ 1 b23age 1 u2 169 where inc is annual income and educ and age are measured in years Initially it may seem that these equations are a sensible way to view how housing and saving expenditures are determined But we have to ask What value would one of these equations be without the other Neither has a ceteris paribus inter pretation because housing and saving are chosen by the same household For example it makes no sense to ask this question If annual income increases by 10000 how would housing expenditures change holding saving fixed If family income increases a household will generally change the optimal mix of housing expenditures and saving But equation 168 makes it seem as if we want to know the effect of changing inc educ or age while keeping saving fixed Such a thought experiment is not interesting Any model based on economic principles particularly utility maximization would have households opti mally choosing housing and saving as functions of inc and the relative prices of housing and saving The variables educ and age would affect preferences for consumption saving and risk Therefore housing and saving would each be functions of income education age and other variables that affect the utility maximization problem such as different rates of return on housing and other saving Even if we decided that the SEM in 168 and 169 made sense there is no way to estimate the parameters We discuss this problem more generally in Section 163 The two equations are indistin guishable unless we assume that income education or age appears in one equation but not the other which would make no sense Though this makes a poor SEM example we might be interested in testing whether other factors being fixed there is a tradeoff between housing expenditures and saving But then we would just esti mate say 168 by OLS unless there is an omitted variable or measurement error problem Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 503 Example 162 has the characteristics of all too many SEM applications The problem is that the two endogenous variables are chosen by the same economic agent Therefore neither equation can stand on its own Another example of an inappropriate use of an SEM would be to model weekly hours spent studying and weekly hours working Each student will choose these variables simultaneously presumably as a function of the wage that can be earned working ability as a student enthusiasm for college and so on Just as in Example 162 it makes no sense to specify two equations where each is a function of the other The important lesson is this just because two variables are determined simul taneously does not mean that a simultaneous equa tions model is suitable For an SEM to make sense each equation in the SEM should have a ceteris pari bus interpretation in isolation from the other equation As we discussed earlier supply and demand examples and Example 161 have this feature Usually basic economic reasoning supported in some cases by sim ple economic models can help us use SEMs intelli gently including knowing when not to use an SEM 162 Simultaneity Bias in OLS It is useful to see in a simple model that an explanatory variable that is determined simultaneously with the dependent variable is generally correlated with the error term which leads to bias and incon sistency in OLS We consider the twoequation structural model y1 5 a1y2 1 b1z1 1 u1 1610 y2 5 a2 y1 1 b2z2 1 u2 1611 and focus on estimating the first equation The variables z1 and z2 are exogenous so that each is uncorrelated with u1 and u2 For simplicity we suppress the intercept in each equation To show that y2 is generally correlated with u1 we solve the two equations for y2 in terms of the exogenous variables and the error term If we plug the righthand side of 1610 in for y1 in 1611 we get y2 5 a21a1y2 1 b1z1 1 u12 1 b2z2 1 u2 or 11 2 a2a12y2 5 a2b1z1 1 b2z2 1 a2u1 1 u2 1612 Now we must make an assumption about the parameters in order to solve for y2 a2a1 2 1 1613 Whether this assumption is restrictive depends on the application In Example 161 we think that a1 0 and a2 0 which implies a1a2 0 therefore 1613 is very reasonable for Example 161 Provided condition 1613 holds we can divide 1612 by 11 2 a2a12 and write y2 as y2 5 p21z1 1 p22z2 1 v2 1614 where p21 5 a2b111 2 a2a12 p22 5 b211 2 a2a12 and v2 5 1a2u1 1 u2211 2 a2a12 Equation 1614 which expresses y2 in terms of the exogenous variables and the error terms is the reduced form equation for y2 a concept we introduced in Chapter 15 in the context of instrumental variables Pindyck and Rubinfeld 1992 Section 116 describe a model of advertising where monopolistic firms choose profit maximizing levels of price and advertising expenditures Does this mean we should use an SEM to model these variables at the firm level Exploring FurthEr 161 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 504 estimation The parameters p21 and p22 are called reduced form parameters notice how they are nonlinear functions of the structural parameters which appear in the structural equations 1610 and 1611 The reduced form error v2 is a linear function of the structural error terms u1 and u2 Because u1 and u2 are each uncorrelated with z1 and z2 v2 is also uncorrelated with z1 and z2 Therefore we can consistently estimate p21 and p22 by OLS something that is used for two stage least squares estima tion which we return to in the next section In addition the reduced form parameters are sometimes of direct interest although we are focusing here on estimating equation 1610 A reduced form also exists for y1 under assumption 1613 the algebra is similar to that used to obtain 1614 It has the same properties as the reduced form equation for y2 We can use equation 1614 to show that except under special assumptions OLS estimation of equation 1610 will produce biased and inconsistent estimators of a1 and b1 in equation 1610 Because z1 and u1 are uncorrelated by assumption the issue is whether y2 and u1 are uncorrelated From the reduced form in 1614 we see that y2 and u1 are correlated if and only if v2 and u1 are correlated because z1 and z2 are assumed exogenous But v2 is a linear function of u1 and u2 so it is generally correlated with u1 In fact if we assume that u1 and u2 are uncorrelated then v2 and u1 must be correlated whenever a2 2 0 Even if a2 equals zerowhich means that y1 does not appear in equa tion 1611 v2 and u1 will be correlated if u1 and u2 are correlated When a2 5 0 and u1 and u2 are uncorrelated y2 and u1 are also uncorrelated These are fairly strong requirements if a2 5 0 y2 is not simultaneously determined with y1 If we add zero correla tion between u1 and u2 this rules out omitted variables or measurement errors in u1 that are correlated with y2 We should not be surprised that OLS estimation of equation 1610 works in this case When y2 is correlated with u1 because of simultaneity we say that OLS suffers from simultaneity bias Obtaining the direction of the bias in the coefficients is generally complicated as we saw with omitted variables bias in Chapters 3 and 5 But in simple models we can determine the direction of the bias For example suppose that we simplify equation 1610 by dropping z1 from the equation and we assume that u1 and u2 are uncorrelated Then the covariance between y2 and u1 is Cov1y2u12 5 Cov1v2u12 5 3a211 2 a2a12 4E1u2 12 5 3a211 2 a2a12 4s2 1 where s2 1 5 Var1u12 0 Therefore the asymptotic bias or inconsistency in the OLS estimator of a1 has the same sign as a211 2 a2a12 If a2 0 and a2a1 1 the asymptotic bias is positive Unfortunately just as in our calculation of omitted variables bias from Section 33 the conclusions do not carry over to more general models But they do serve as a useful guide For example in Example 161 we think a2 0 and a2a1 0 which means that the OLS estimator of a1 would have a positive bias If a1 5 0 OLS would on average estimate a positive impact of more police on the murder rate generally the estimator of a1 is biased upward Because we expect an increase in the size of the police force to reduce murder rates ceteris paribus the upward bias means that OLS will underestimate the effectiveness of a larger police force 163 Identifying and Estimating a Structural Equation As we saw in the previous section OLS is biased and inconsistent when applied to a structural equa tion in a simultaneous equations system In Chapter 15 we learned that the method of two stage least squares can be used to solve the problem of endogenous explanatory variables We now show how 2SLS can be applied to SEMs The mechanics of 2SLS are similar to those in Chapter 15 The difference is that because we specify a structural equation for each endogenous variable we can immediately see whether sufficient IVs are available to estimate either equation We begin by discussing the identification problem Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 505 163a Identification in a TwoEquation System We mentioned the notion of identification in Chapter 15 When we estimate a model by OLS the key identification condition is that each explanatory variable is uncorrelated with the error term As we demonstrated in Section 162 this fundamental condition no longer holds in general for SEMs However if we have some instrumental variables we can still identify or consistently estimate the parameters in an SEM equation just as with omitted variables or measurement error Before we consider a general twoequation SEM it is useful to gain intuition by considering a simple supply and demand example Write the system in equilibrium form that is with qs 5 qd 5 q imposed as q 5 a1p 1 b1z1 1 u1 1615 and q 5 a2p 1 u2 1616 For concreteness let q be per capita milk consumption at the county level let p be the average price per gallon of milk in the county and let z1 be the price of cattle feed which we assume is exogenous to the supply and demand equations for milk This means that 1615 must be the supply function as the price of cattle feed would shift supply 1b1 02 but not demand The demand function contains no observed demand shifters Given a random sample on q p z1 which of these equations can be estimated That is which is an identified equation It turns out that the demand equation 1616 is identified but the supply equation is not This is easy to see by using our rules for IV estimation from Chapter 15 we can use z1 as an IV for price in equation 1616 However because z1 appears in equation 1615 we have no IV for price in the supply equation Intuitively the fact that the demand equation is identified follows because we have an observed variable z1 that shifts the supply equation while not affecting the demand equation Given variation in z1 and no errors we can trace out the demand curve as shown in Figure 161 The presence of the price quantity demand equation supply equations FiguRE 161 Shifting supply equations trace out the demand equation Each supply equation is drawn for a different value of the exogenous variable z1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 506 unobserved demand shifter u2 causes us to estimate the demand equation with error but the estimators will be consistent provided z1 is uncorrelated with u2 The supply equation cannot be traced out because there are no exogenous observed factors shift ing the demand curve It does not help that there are unobserved factors shifting the demand function we need something observed If as in the labor demand function 162 we have an observed exog enous demand shiftersuch as income in the milk demand functionthen the supply function would also be identified To summarize In the system of 1615 and 1616 it is the presence of an exogenous variable in the supply equation that allows us to estimate the demand equation Extending the identification discussion to a general twoequation model is not difficult Write the two equations as y1 5 b10 1 a1y2 1 z1b1 1 u1 1617 and y2 5 b20 1 a2y1 1 z2b2 1 u2 1618 where y1 and y2 are the endogenous variables and u1 and u2 are the structural error terms The intercept in the first equation is b10 and the intercept in the second equation is b20 The variable z1 denotes a set of k1 exogenous variables appearing in the first equation z1 5 1z11 z12 p z1k12 Similarly z2 is the set of k2 exogenous variables in the second equation z2 5 1z21 z22 p z2k22 In many cases z1 and z2 will overlap As a shorthand form we use the notation z1b1 5 b11z11 1 b12z12 1 p 1 b1k1z1k1 and z2b2 5 b21z21 1 b22z22 1 p 1 b2k2z2k2 that is z1b1 stands for all exogenous variables in the first equation with each multiplied by a coef ficient and similarly for z2b2 Some authors use the notation z91b1 and z92b2 instead If you have an interest in the matrix algebra approach to econometrics see Appendix E The fact that z1 and z2 generally contain different exogenous variables means that we have imposed exclusion restrictions on the model In other words we assume that certain exogenous vari ables do not appear in the first equation and others are absent from the second equation As we saw with the previous supply and demand examples this allows us to distinguish between the two struc tural equations When can we solve equations 1617 and 1618 for y1 and y2 as linear functions of all exog enous variables and the structural errors u1 and u2 The condition is the same as that in 1613 namely a2a1 2 1 The proof is virtually identical to the simple model in Section 162 Under this assumption reduced forms exist for y1 and y2 The key question is Under what assumptions can we estimate the parameters in say 1617 This is the identification issue The rank condition for identification of equation 1617 is easy to state Rank Condition for Identification of a Structural Equation The first equation in a twoequation simultaneous equations model is identified if and only if the second equation contains at least one exogenous variable with a nonzero coefficient that is excluded from the first equation This is the necessary and sufficient condition for equation 1617 to be identified The order condition which we discussed in Chapter 15 is necessary for the rank condition The order condi tion for identifying the first equation states that at least one exogenous variable is excluded from this equation The order condition is trivial to check once both equations have been specified The rank condition requires more at least one of the exogenous variables excluded from the first equation must have a nonzero population coefficient in the second equation This ensures that at least one of the exogenous variables omitted from the first equation actually appears in the reduced form of y2 so Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 507 that we can use these variables as instruments for y2 We can test this using a t or an F test as in Chapter 15 some examples follow Identification of the second equation is naturally just the mirror image of the statement for the first equation Also if we write the equations as in the labor supply and demand example in Section 161so that y1 appears on the lefthand side in both equations with y2 on the righthand sidethe identification condition is identical ExamplE 163 labor Supply of married Working Women To illustrate the identification issue consider labor supply for married women already in the work force In place of the demand function we write the wage offer as a function of hours and the usual productivity variables With the equilibrium condition imposed the two structural equations are hours 5 a1log1wage2 1 b10 1 b11educ 1 b12age 1 b13kidslt6 1 b14nwifeinc 1 u1 1619 and log1wage2 5 a2hours 1 b20 1 b21educ 1 b22exper 1 b23exper2 1 u2 1620 The variable age is the womans age in years kidslt6 is the number of children less than six years old nwifeinc is the womans nonwage income which includes husbands earnings and educ and exper are years of education and prior experience respectively All variables except hours and logwage are assumed to be exogenous This is a tenuous assumption as educ might be correlated with omit ted ability in either equation But for illustration purposes we ignore the omitted ability problem The functional form in this systemwhere hours appears in level form but wage is in logarithmic formis popular in labor economics We can write this system as in equations 1617 and 1618 by defining y1 5 hours and y2 5 log1wage2 The first equation is the supply function It satisfies the order condition because two exogenous variables exper and exper2 are omitted from the labor supply equation These exclusion restrictions are crucial assumptions we are assuming that once wage education age number of small children and other income are controlled for past experience has no effect on current labor supply One could certainly question this assumption but we use it for illustration Given equations 1619 and 1620 the rank condition for identifying the first equation is that at least one of exper and exper2 has a nonzero coefficient in equation 1620 If b22 5 0 and b23 5 0 there are no exogenous variables appearing in the second equation that do not also appear in the first educ appears in both We can state the rank condition for identification of 1619 equivalently in terms of the reduced form for logwage which is log1wage2 5 p20 1 p21educ 1 p22age 1 p23kidslt6 1621 1 p24nwifeinc 1 p25exper 1 p26exper2 1 v2 For identification we need p25 2 0 or p26 2 0 something we can test using a standard F statistic as we discussed in Chapter 15 The wage offer equation 1620 is identified if at least one of age kidslt6 or nwifeinc has a nonzero coefficient in 1619 This is identical to assuming that the reduced form for hourswhich has the same form as the righthand side of 1621depends on at least one of age kidslt6 or nwi feinc In specifying the wage offer equation we are assuming that age kidslt6 and nwifeinc have no effect on the offered wage once hours education and experience are accounted for These would be poor assumptions if these variables somehow have direct effects on productivity or if women are dis criminated against based on their age or number of small children Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 508 In Example 163 we take the population of interest to be married women who are in the workforce so that equilibrium hours are positive This excludes the group of married women who choose not to work outside the home Including such women in the model raises some difficult problems For instance if a woman does not work we cannot observe her wage offer We touch on these issues in Chapter 17 but for now we must think of equations 1619 and 1620 as holding only for women who have hours 0 ExamplE 164 Inflation and Openness Romer 1993 proposes theoretical models of inflation that imply that more open countries should have lower inflation rates His empirical analysis explains average annual inflation rates since 1973 in terms of the average share of imports in gross domestic or national product since 1973which is his measure of openness In addition to estimating the key equation by OLS he uses instrumental variables While Romer does not specify both equations in a simultaneous system he has in mind a twoequation system inf 5 b10 1 a1open 1 b11log1pcinc2 1 u1 1622 open 5 b20 1 a2inf 1 b21log1pcinc2 1 b22log1land2 1 u2 1623 where pcinc is 1980 per capita income in US dollars assumed to be exogenous and land is the land area of the country in square miles also assumed to be exogenous Equation 1622 is the one of interest with the hypothesis that a1 0 More open economies have lower inflation rates The second equation reflects the fact that the degree of openness might depend on the average infla tion rate as well as other factors The variable logpcinc appears in both equations but logland is assumed to appear only in the second equation The idea is that ceteris paribus a smaller country is likely to be more open so b22 0 Using the identification rule that was stated ear lier equation 1622 is identified provided b22 2 0 Equation 1623 is not identified because it con tains both exogenous variables But we are interested in 1622 163b Estimation by 2SLS Once we have determined that an equation is identified we can estimate it by two stage least squares The instrumental variables consist of the exogenous variables appearing in either equation ExamplE 165 labor Supply of married Working Women We use the data on working married women in MROZ to estimate the labor supply equation 1619 by 2SLS The full set of instruments includes educ age kidslt6 nwifeinc exper and exper2 The estimated labor supply curve is hours 5 222566 1 163956 log1wage2 2 18375 educ 1574562 1470582 159102 2 781 age 2 19815 kidslt6 2 1017 nwifeinc 1624 19382 1182932 16612 n 5 428 If we have money supply growth since 1973 for each country which we assume is exogenous does this help identify equation 1623 Exploring FurthEr 162 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 509 where the reported standard errors are computed using a degreesoffreedom adjustment This equa tion shows that the labor supply curve slopes upward The estimated coefficient on logwage has the following interpretation holding other factors fixed Dhours 1641Dwage2 We can calculate labor supply elasticities by multiplying both sides of this last equation by 100hours 100 1Dhourshours2 11640hours2 1Dwage2 or Dhours 11640hours2 1Dwage2 which implies that the labor supply elasticity with respect to wage is simply 1640hours The elasticity is not constant in this model because hours not loghours is the dependent variable in 1624 At the average hours worked 1303 the estimated elasticity is 1640 1303 126 which implies a greater than 1 increase in hours worked given a 1 increase in wage This is a large esti mated elasticity At higher hours the elasticity will be smaller at lower hours such as hours 5 800 the elasticity is over two For comparison when 1619 is estimated by OLS the coefficient on logwage is 2205 1se 5 54882 which implies no wage effect on hours worked To confirm that logwage is in fact endogenous in 1619 we can carry out the test from Section 155 When we add the reduced form residuals v2 to the equation and estimate by OLS the t statistic on v2 is 2661 which is very significant and so logwage appears to be endogenous The wage offer equation 1620 can also be estimated by 2SLS The result is log1wage2 5 2656 1 00013 hours 1 110 educ 13382 1000252 10162 1 035 exper 2 00071 exper2 1625 10192 1000452 n 5 428 This differs from previous wage equations in that hours is included as an explanatory variable and 2SLS is used to account for endogeneity of hours and we assume that educ and exper are exog enous The coefficient on hours is statistically insignificant which means that there is no evidence that the wage offer increases with hours worked The other coefficients are similar to what we get by dropping hours and estimating the equation by OLS Estimating the effect of openness on inflation by instrumental variables is also straightforward ExamplE 166 Inflation and Openness Before we estimate 1622 using the data in OPENNESS we check to see whether open has suffi cient partial correlation with the proposed IV logland The reduced form regression is open 5 11708 1 546 log1pcinc2 2 757 log1land2 115852 114932 1812 n 5 114 R2 5 449 The t statistic on logland is over nine in absolute value which verifies Romers assertion that smaller countries are more open The fact that logpcinc is so insignificant in this regression is irrelevant Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 510 Estimating 1622 using logland as an IV for open gives inf 5 2690 2 337 open 1 376 log1pcinc2 115402 11442 120152 1626 n 5 114 The coefficient on open is statistically significant at about the 1 level against a onesided alternative 1a1 02 The effect is economically important as well for every percentage point increase in the import share of GDP annual inflation is about onethird of a percentage point lower For comparison the OLS estimate is 2215 1se 5 0952 164 Systems with More Than Two Equations Simultaneous equations models can consist of more than two equations Studying general identifica tion of these models is difficult and requires matrix algebra Once an equation in a general system has been shown to be identified it can be estimated by 2SLS 164a Identification in Systems with Three or More Equations We will use a threeequation system to illustrate the issues that arise in the identification of compli cated SEMs With intercepts suppressed write the model as y1 5 a12y2 1 a13y3 1 b11z1 1 u1 1627 y2 5 a21y1 1 b21z1 1 b22z2 1 b23z3 1 u2 1628 y3 5 a32y2 1 b31z1 1 b32z2 1 b33z3 1 b34z4 1 u3 1629 where the yg are the endogenous variables and the zj are exogenous The first subscript on the parame ters indicates the equation number and the second indicates the variable number we use a for param eters on endogenous variables and b for parameters on exogenous variables Which of these equations can be estimated It is generally difficult to show that an equation in an SEM with more than two equations is identified but it is easy to see when certain equations are not identified In system 1627 through 1629 we can easily see that 1629 falls into this cat egory Because every exogenous variable appears in this equation we have no IVs for y2 Therefore we cannot consistently estimate the parameters of this equation For the reasons we discussed in Section 162 OLS estimation will not usually be consistent What about equation 1627 Things look promising because z2 z3 and z4 are all excluded from the equationthis is another example of exclusion restrictions Although there are two endogenous variables in this equation we have three potential IVs for y2 and y3 Therefore equation 1627 passes the order condition For completeness we state the order condition for general SEMs Order Condition for Identification An equation in any SEM satisfies the order condi tion for identification if the number of excluded exogenous variables from the equation is at least as large as the number of righthand side endogenous variables The second equation 1628 also passes the order condition because there is one excluded exogenous variable z4 and one righthand side endogenous variable y1 How would you test whether the difference between the OLS and IV estimates on open are statistically different Exploring FurthEr 163 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 511 As we discussed in Chapter 15 and in the previous section the order condition is only necessary not sufficient for identification For example if b34 5 0 z4 appears nowhere in the system which means it is not correlated with y1 y2 or y3 If b34 5 0 then the second equation is not identified because z4 is useless as an IV for y1 This again illustrates that identification of an equation depends on the values of the parameters which we can never know for sure in the other equations There are many subtle ways that identification can fail in complicated SEMs To obtain sufficient conditions we need to extend the rank condition for identification in twoequation systems This is possible but it requires matrix algebra see for example Wooldridge 2010 Chapter 9 In many applications one assumes that unless there is obviously failure of identification an equation that satisfies the order condition is identified The nomenclature on overidentified and just identified equations from Chapter 15 originated with SEMs In terms of the order condition 1627 is an overidentified equation because we need only two IVs for y2 and y3 but we have three available z2 z3 and z4 there is one overidentify ing restriction in this equation In general the number of overidentifying restrictions equals the total number of exogenous variables in the system minus the total number of explanatory variables in the equation These can be tested using the overidentification test from Section 155 Equation 1628 is a just identified equation and the third equation is an unidentified equation 164b Estimation Regardless of the number of equations in an SEM each identified equation can be estimated by 2SLS The instruments for a particular equation consist of the exogenous variables appearing anywhere in the system Tests for endogeneity heteroskedasticity serial correlation and overidentifying restric tions can be obtained just as in Chapter 15 It turns out that when any system with two or more equations is correctly specified and certain additional assumptions hold system estimation methods are generally more efficient than estimating each equation by 2SLS The most common system estimation method in the context of SEMs is three stage least squares These methods with or without endogenous explanatory variables are beyond the scope of this text See for example Wooldridge 2010 Chapters 7 and 8 165 Simultaneous Equations Models with Time Series Among the earliest applications of SEMs was estimation of large systems of simultaneous equations that were used to describe a countrys economy A simple Keynesian model of aggregate demand that ignores exports and imports is Ct 5 b0 1 b11Yt 2 Tt2 1 b2rt 1 ut1 1630 It 5 g0 1 g1rt 1 ut2 1631 Yt Ct 1 It 1 Gt 1632 where Ct 5 consumption Yt 5 income Tt 5 tax receipts rt 5 the interest rate It 5 investment and Gt 5 government spending See for example Mankiw 1994 Chapter 9 For concreteness assume t represents year Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 512 The first equation is an aggregate consumption function where consumption depends on dispos able income the interest rate and the unobserved structural error ut1 The second equation is a very simple investment function Equation 1632 is an identity that is a result of national income account ing it holds by definition without error Thus there is no sense in which we estimate 1632 but we need this equation to round out the model Because there are three equations in the system there must also be three endogenous variables Given the first two equations it is clear that we intend for Ct and It to be endogenous In addition because of the accounting identity Yt is endogenous We would assume at least in this model that Tt rt and Gt are exogenous so that they are uncorrelated with ut1 and ut2 We will discuss problems with this kind of assumption later If rt is exogenous then OLS estimation of equation 1631 is natural The consumption function however depends on disposable income which is endogenous because Yt is We have two instruments available under the maintained exogeneity assumptions Tt and Gt Therefore if we follow our prescription for estimating crosssectional equations we would estimate 1630 by 2SLS using instruments 1Tt Gt rt2 Models such as 1630 through 1632 are seldom estimated now for several good reasons First it is very difficult to justify at an aggregate level the assumption that taxes interest rates and government spending are exogenous Taxes clearly depend directly on income for example with a single marginal income tax rate tt in year t Tt 5 ttYt We can easily allow this by replacing 1Yt 2 Tt2 with 11 2 tt2Yt in 1630 and we can still estimate the equation by 2SLS if we assume that govern ment spending is exogenous We could also add the tax rate to the instrument list if it is exogenous But are government spending and tax rates really exogenous They certainly could be in principle if the government sets spending and tax rates independently of what is happening in the economy But it is a difficult case to make in reality government spending generally depends on the level of income and at high levels of income the same tax receipts are collected for lower marginal tax rates In addition assuming that interest rates are exogenous is extremely questionable We could specify a more realistic model that includes money demand and supply and then interest rates could be jointly determined with Ct It and Yt But then finding enough exogenous variables to identify the equations becomes quite difficult and the following problems with these models still pertain Some have argued that certain components of government spending such as defense spending see for example Hall 1988 and Ramey 1991are exogenous in a variety of simultaneous equa tions applications But this is not universally agreed upon and in any case defense spending is not always appropriately correlated with the endogenous explanatory variables see Shea 1993 for dis cussion and Computer Exercises C6 for an example A second problem with a model such as 1630 through 1632 is that it is completely static Especially with monthly or quarterly data but even with annual data we often expect adjustment lags One argument in favor of static Keynesiantype models is that they are intended to describe the long run without worrying about shortrun dynamics Allowing dynamics is not very difficult For example we could add lagged income to equation 1631 It 5 g0 1 g1rt 1 g2Yt21 1 ut2 1633 In other words we add a lagged endogenous variable but not It21 to the investment equation Can we treat Yt21 as exogenous in this equation Under certain assumptions on ut2 the answer is yes But we typically call a lagged endogenous variable in an SEM a predetermined variable Lags of exog enous variables are also predetermined If we assume that ut2 is uncorrelated with current exogenous variables which is standard and all past endogenous and exogenous variables then Yt21 is uncor related with ut2 Given exogeneity of rt we can estimate 1633 by OLS If we add lagged consumption to 1630 we can treat Ct21 as exogenous in this equation under the same assumptions on ut1 that we made for ut2 in the previous paragraph Current disposable income is still endogenous in Ct 5 b0 1 b11Yt 2 Tt2 1 b2rt 1 b3Ct21 1 ut1 1634 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 513 so we could estimate this equation by 2SLS using instruments 1Tt Gt rt Ct2l2 if investment is determined by 1633 Yt21 should be added to the instrument list To see why use 1632 1633 and 1634 to find the reduced form for Yt in terms of the exogenous and predetermined variables Tt rt Gt Ct21 and Yt21 Because Yt21 shows up in this reduced form it should be used as an IV The presence of dynamics in aggregate SEMs is at least for the purposes of forecasting a clear improvement over static SEMs But there are still some important problems with estimating SEMs using aggregate time series data some of which we discussed in Chapters 11 and 15 Recall that the validity of the usual OLS or 2SLS inference procedures in time series applications hinges on the notion of weak dependence Unfortunately series such as aggregate consumption income investment and even interest rates seem to violate the weak dependence requirements In the terminology of Chapter 11 they have unit roots These series also tend to have exponential trends although this can be partly overcome by using the logarithmic transformation and assuming different functional forms Generally even the large sample let alone the small sample properties of OLS and 2SLS are complicated and dependent on vari ous assumptions when they are applied to equations with I1 variables We will briefly touch on these issues in Chapter 18 An advanced general treatment is given by Hamilton 1994 Does the previous discussion mean that SEMs are not usefully applied to time series data Not at all The problems with trends and high persistence can be avoided by specifying systems in first differences or growth rates But one should recognize that this is a different SEM than one specified in levels For example if we specify consumption growth as a function of disposable income growth and interest rate changes this is different from 1630 Also as we discussed earlier incorporat ing dynamics is not especially difficult Finally the problem of finding truly exogenous variables to include in SEMs is often easier with disaggregated data For example for manufacturing industries Shea 1993 describes how output or more precisely growth in output in other industries can be used as an instrument in estimating supply functions Ramey 1991 also has a convincing analysis of estimating industry cost functions by instrumental variables using time series data The next example shows how aggregate data can be used to test an important economic theory the permanent income theory of consumption usually called the permanent income hypothesis PIH The approach used in this example is not strictly speaking based on a simultaneous equations model but we can think of consumption and income growth as well as interest rates as being jointly determined ExamplE 167 Testing the permanent Income Hypothesis Campbell and Mankiw 1990 used instrumental variables methods to test various versions of the PIH We will use the annual data from 1959 through 1995 in CONSUMP to mimic one of their analy ses Campbell and Mankiw used quarterly data running through 1985 One equation estimated by Campbell and Mankiw using our notation is gct 5 b0 1 b1gyt 1 b2r3t 1 ut 1635 where gct 5 Dlog1ct2 5 annual growth in real per capita consumption 1excluding durables2 gyt 5 growth in real disposable income and r3t 5 the 1ex post2 real interest rate as measured by the return on threemonth Tbill rates r3t 5 i3t 2 inft where the inflation rate is based on the Consumer Price Index The growth rates of consumption and disposable income are not trending and they are weakly dependent we will assume this is the case for r3t as well so that we can apply standard asymptotic theory The key feature of equation 1635 is that the PIH implies that the error term ut has a zero mean conditional on all information observed at time t 2 1 or earlier E1ut0It212 5 0 However ut is not Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 514 necessarily uncorrelated with gyt or r3t a traditional way to think about this is that these variables are jointly determined but we are not writing down a full threeequation system Because ut is uncorrelated with all variables dated t 2 1 or earlier valid instruments for estimat ing 1635 are lagged values of gc gy and r3 and lags of other observable variables but we will not use those here What are the hypotheses of interest The pure form of the PIH has b1 5 b2 5 0 Campbell and Mankiw argue that b1 is positive if some fraction of the population consumes current income rather than permanent income The PIH with a nonconstant real interest rate implies that b2 0 When we estimate 1635 by 2SLS using instruments gc21 gy21 and r321 for the endogenous variables gyt and r3t we obtain gct 5 0081 1 586 gyt 2 00027r3t 100322 11352 1000762 1636 n 5 35 R2 5 678 Therefore the pure form of the PIH is strongly rejected because the coefficient on gy is economically large a 1 increase in disposable income increases consumption by over 5 and statistically significant 1t 5 4342 By contrast the real interest rate coefficient is very small and statistically insignificant These findings are qualitatively the same as Campbell and Mankiws The PIH also implies that the errors 5ut6 are serially uncorrelated After 2SLS estimation we obtain the residuals u t and include u t21 as an additional explanatory variable in 1636 we still use instru ments gct21 gyt21 r3t21 and u t21 acts as its own instrument see Section 157 The coefficient on u t21 is r 5 187 1se 5 1332 so there is some evidence of positive serial correlation although not at the 5 significance level Campbell and Mankiw discuss why with the available quarterly data positive serial cor relation might be found in the errors even if the PIH holds some of those concerns carry over to annual data Using growth rates of trending or I1 variables in SEMs is fairly common in time series applications For example Shea 1993 estimates industry supply curves specified in terms of growth rates If a structural model contains a time trend which may capture exogenous trending factors that are not directly modeledthen the trend acts as its own IV 166 Simultaneous Equations Models with Panel Data Simultaneous equations models also arise in panel data contexts For example we can imagine esti mating labor supply and wage offer equations as in Example 163 for a group of people working over a given period of time In addition to allowing for simultaneous determination of variables within each time period we can allow for unobserved effects in each equation In a labor supply function it would be useful to allow an unobserved taste for leisure that does not change over time The basic approach to estimating SEMs with panel data involves two steps 1 eliminate the unobserved effects from the equations of interest using the fixed effects transformation or first Suppose that for a particular city you have monthly data on per capita consumption of fish per capita income the price of fish and the prices of chicken and beef income and chicken and beef prices are exogenous Assume that there is no seasonality in the demand function for fish but there is in the supply of fish How can you use this infor mation to estimate a constant elasticity demandforfish equation Specify an equa tion and discuss identification Hint You should have 11 instrumental variables for the price of fish Exploring FurthEr 164 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 515 differencing and 2 find instrumental variables for the endogenous variables in the transformed equa tion This can be very challenging because for a convincing analysis we need to find instruments that change over time To see why write an SEM for panel data as yit1 5 a1yit2 1 zit1b1 1 ai1 1 uit1 1637 yit2 5 a2yit1 1 zit2b2 1 ai2 1 uit2 1638 where i denotes cross section t denotes time period and zit1b1 or zit2b2 denotes linear functions of a set of exogenous explanatory variables in each equation The most general analysis allows the unob served effects ai1 and ai2 to be correlated with all explanatory variables even the elements in z However we assume that the idiosyncratic structural errors uit1 and uit2 are uncorrelated with the z in both equations and across all time periods this is the sense in which the z are exogenous Except under special circumstances yit2 is correlated with uit1 and yit1 is correlated with uit2 Suppose we are interested in equation 1637 We cannot estimate it by OLS as the composite error ai1 1 uit1 is potentially correlated with all explanatory variables Suppose we difference over time to remove the unobserved effect ai1 Dyit1 5 a1Dyit2 1 Dzit1b1 1 Duit1 1639 As usual with differencing or timedemeaning we can only estimate the effects of variables that change over time for at least some crosssectional units Now the error term in this equation is uncorrelated with Dzit1 by assumption But Dyit2 and Duit1 are possibly correlated Therefore we need an IV for Dyit2 As with the case of pure crosssectional or pure time series data possible IVs come from the other equation elements in zit2 that are not also in zit1 In practice we need timevarying elements in zit2 that are not also in zit1 This is because we need an instrument for Dyit2 and a change in a variable from one period to the next is unlikely to be highly correlated with the level of exogenous variables In fact if we difference 1638 we see that the natural IVs for Dyit2 are those elements in Dzit2 that are not also in Dzit1 As an example of the problems that can arise consider a panel data version of the labor supply function in Example 163 After differencing suppose we have the equation Dhoursit 5 b0 1 a1Dlog1wageit2 1 D1other factorsit2 and we wish to use Dexperit as an instrument for Dlog1wageit2 The problem is that because we are looking at people who work in every time period Dexperit 5 1 for all i and t Each person gets another year of experience after a year passes We cannot use an IV that is the same value for all i and t and so we must look elsewhere Often participation in an experimental program can be used to obtain IVs in panel data contexts In Example 1510 we used receipt of job training grants as an IV for the change in hours of training in determining the effects of job training on worker productivity In fact we could view that in an SEM context job training and worker productivity are jointly determined but receiving a job training grant is exogenous in equation 1557 We can sometimes come up with clever convincing instrumental variables in panel data applica tions as the following example illustrates ExamplE 168 Effect of prison population on Violent Crime Rates In order to estimate the causal effect of prison population increases on crime rates at the state level Levitt 1996 used instances of prison overcrowding litigation as instruments for the growth in prison population The equation Levitt estimated is in first differences we can write an underlying fixed effects model as log1crimeit2 5 ut 1 allog1prisonit2 1 zit1b1 1 ai1 1 uit1 1640 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 516 PART 1 Regression Analysis with CrossSectional Data where ut denotes different time intercepts and crime and prison are measured per 100000 people The prison population variable is measured on the last day of the previous year The vector zit1 con tains log of police per capita log of income per capita the unemployment rate proportions of black and those living in metropolitan areas and age distribution proportions Differencing 1640 gives the equation estimated by Levitt Dlog1crimeit2 5 jt 1 alDlog1prisonit2 1 Dzit1b1 1 Duit1 1641 Simultaneity between crime rates and prison population or more precisely in the growth rates makes OLS estimation of 1641 generally inconsistent Using the violent crime rate and a subset of the data from Levitt in PRISON for the years 1980 through 1993 for 51 14 5 714 total observations we obtain the pooled OLS estimate of a1 which is 2181 1se 5 0482 We also estimate 1641 by pooled 2SLS where the instruments for D logprison are two binary variables one each for whether a final decision was reached on overcrowding litigation in the current year or in the previous two years The pooled 2SLS estimate of a1 is 21032 1se 5 3702 Therefore the 2SLS estimated effect is much larger not surprisingly it is much less precise too Levitt found similar results when using a longer time period but with early observations missing for some states and more instruments Testing for AR1 serial correlation in rit1 5 Duit1 is easy After the pooled 2SLS estimation obtain the residuals rit1 Then include one lag of these residuals in the original equation and esti mate the equation by 2SLS where rit1 acts as its own instrument The first year is lost because of the lagging Then the usual 2SLS t statistic on the lagged residual is a valid test for serial correlation In Example 168 the coefficient on rit1 is only about 076 with t 5 167 With such a small coefficient and modest t statistic we can safely assume serial independence An alternative approach to estimating SEMs with panel data is to use the fixed effects transfor mation and then to apply an IV technique such as pooled 2SLS A simple procedure is to estimate the timedemeaned equation by pooled 2SLS which would look like y it1 5 a1y t2 1 z it1b1 1 u it1 t 5 1 2 p T 1642 where z it1 and z it2 are IVs This is equivalent to using 2SLS in the dummy variable formulation where the unitspecific dummy variables act as their own instruments Ayres and Levitt 1998 applied 2SLS to a timedemeaned equation to estimate the effect of LoJack electronic theft prevention devices on car theft rates in cities If 1642 is estimated directly then the df needs to be corrected to N1T 2 12 2 k1 where k1 is the total number of elements in a1 and b1 Including unitspecific dummy variables and applying pooled 2SLS to the original data produces the correct df A detailed treatment of 2SLS with panel data is given in Wooldridge 2010 Chapter 11 Summary Simultaneous equations models are appropriate when each equation in the system has a ceteris paribus in terpretation Good examples are when separate equations describe different sides of a market or the behav ioral relationships of different economic agents Supply and demand examples are leading cases but there are many other applications of SEMs in economics and the social sciences An important feature of SEMs is that by fully specifying the system it is clear which variables are as sumed to be exogenous and which ones appear in each equation Given a full system we are able to deter mine which equations can be identified that is can be estimated In the important case of a twoequation system identification of say the first equation is easy to state at least one exogenous variable must be excluded from the first equation that appears with a nonzero coefficient in the second equation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 517 As we know from previous chapters OLS estimation of an equation that contains an endogenous explanatory variable generally produces biased and inconsistent estimators Instead 2SLS can be used to estimate any identified equation in a system More advanced system methods are available but they are beyond the scope of our treatment The distinction between omitted variables and simultaneity in applications is not always sharp Both prob lems not to mention measurement error can appear in the same equation A good example is the labor supply of married women Years of education educ appears in both the labor supply and the wage offer functions see equations 1619 and 1620 If omitted ability is in the error term of the labor supply function then wage and education are both endogenous The important thing is that an equation estimated by 2SLS can stand on its own SEMs can be applied to time series data as well As with OLS estimation we must be aware of trend ing integrated processes in applying 2SLS Problems such as serial correlation can be handled as in Sec tion 157 We also gave an example of how to estimate an SEM using panel data where the equation is first differenced to remove the unobserved effect Then we can estimate the differenced equation by pooled 2SLS just as in Chapter 15 Alternatively in some cases we can use timedemeaning of all variables including the IVs and then apply pooled 2SLS this is identical to putting in dummies for each cross sectional observation and using 2SLS where the dummies act as their own instruments SEM applications with panel data are very powerful as they allow us to control for unobserved heterogeneity while dealing with simultaneity They are becoming more and more common and are not especially difficult to estimate Key Terms Endogenous Variables Exclusion Restrictions Exogenous Variables Identified Equation Just Identified Equation Lagged Endogenous Variable Order Condition Overidentified Equation Predetermined Variable Rank Condition Reduced Form Equation Reduced Form Error Reduced Form Parameters Simultaneity Simultaneity Bias Simultaneous Equations Model SEM Structural Equation Structural Errors Structural Parameters Unidentified Equation Problems 1 Write a twoequation system in supply and demand form that is with the same variable yt typically quantity appearing on the lefthand side y1 5 a1y2 1 b1z1 1 u1 y1 5 a2y2 1 b2z2 1 u2 i If a1 5 0 or a2 5 0 explain why a reduced form exists for y1 Remember a reduced form expresses y1 as a linear function of the exogenous variables and the structural errors If a1 2 0 and a2 5 0 find the reduced form for y2 ii If a1 2 0 a2 2 0 and a1 2 a2 find the reduced form for y1 Does y2 have a reduced form in this case iii Is the condition a1 2 a2 likely to be met in supply and demand examples Explain 2 Let corn denote per capita consumption of corn in bushels at the county level let price be the price per bushel of corn let income denote per capita county income and let rainfall be inches of rainfall during the last corngrowing season The following simultaneous equations model imposes the equilibrium condition that supply equals demand corn 5 a1price 1 b1income 1 u1 corn 5 a2 price 1 b2rainfall 1 g2rainfall2 1 u2 Which is the supply equation and which is the demand equation Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 518 PART 1 Regression Analysis with CrossSectional Data 3 In Problem 3 of Chapter 3 we estimated an equation to test for a tradeoff between minutes per week spent sleeping sleep and minutes per week spent working totwrk for a random sample of individu als We also included education and age in the equation Because sleep and totwrk are jointly chosen by each individual is the estimated tradeoff between sleeping and working subject to a simultaneity bias criticism Explain 4 Suppose that annual earnings and alcohol consumption are determined by the SEM log1earnings2 5 b0 1 b1alcohol 1 b2educ 1 u1 alcohol 5 g0 1 g1log1earnings2 1 g2educ 1 g3log1price2 1 u2 where price is a local price index for alcohol which includes state and local taxes Assume that educ and price are exogenous If b1 b2 g1 g2 and g3 are all different from zero which equation is identi fied How would you estimate that equation 5 A simple model to determine the effectiveness of condom usage on reducing sexually transmitted dis eases among sexually active high school students is infrate 5 b0 1 b1conuse 1 b2percmale 1 b3avginc 1 b4city 1 u1 where infrate 5 the percentage of sexually active students who have contracted venereal disease conuse 5 the percentage of boys who claim to use condoms regularly avginc 5 average family income city 5 a dummy variable indicating whether a school is in a city The model is at the school level i Interpreting the preceding equation in a causal ceteris paribus fashion what should be the sign of b1 ii Why might infrate and conuse be jointly determined iii If condom usage increases with the rate of venereal disease so that g1 0 in the equation conuse 5 g0 1 g1infrate 1 other factors what is the likely bias in estimating b1 by OLS iv Let condis be a binary variable equal to unity if a school has a program to distribute condoms Explain how this can be used to estimate b1 and the other betas by IV What do we have to assume about condis in each equation 6 Consider a linear probability model for whether employers offer a pension plan based on the percentage of workers belonging to a union as well as other factors pension 5 b0 1 b1percunion 1 b2avgage 1 b3avgeduc 1 b4 percmale 1 b5 percmarr 1 u1 i Why might percunion be jointly determined with pension ii Suppose that you can survey workers at firms and collect information on workers families Can you think of information that can be used to construct an IV for percunion iii How would you test whether your variable is at least a reasonable IV candidate for percunion 7 For a large university you are asked to estimate the demand for tickets to womens basketball games You can collect time series data over 10 seasons for a total of about 150 observations One possible model is lATTENDt 5 b0 1 b1lPRICEt 1 b2WINPERCt 1 b3RIVALt 1 b4WEEKENDt 1 b5t 1 ut Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 519 where PRICEt 5 the price of admission probably measured in real termssay deflating by a regional consumer price index WINPERCt 5 the teams current winning percentage RIVALt 5 a dummy variable indicating a game against a rival WEEKENDt 5 a dummy variable indicating whether the game is on a weekend The l denotes natural logarithm so that the demand function has a constant price elasticity i Why is it a good idea to have a time trend in the equation ii The supply of tickets is fixed by the stadium capacity assume this has not changed over the 10 years This means that quantity supplied does not vary with price Does this mean that price is necessarily exogenous in the demand equation Hint The answer is no iii Suppose that the nominal price of admission changes slowlysay at the beginning of each season The athletic office chooses price based partly on last seasons average attendance as well as last seasons team success Under what assumptions is last seasons winning percentage 1SEASPERCt212 a valid instrumental variable for lPRICEt iv Does it seem reasonable to include the log of the real price of mens basketball games in the equation Explain What sign does economic theory predict for its coefficient Can you think of another variable related to mens basketball that might belong in the womens attendance equation v If you are worried that some of the series particularly lATTEND and lPRICE have unit roots how might you change the estimated equation vi If some games are sold out what problems does this cause for estimating the demand function Hint If a game is sold out do you necessarily observe the true demand 8 How big is the effect of perstudent school expenditures on local housing values Let HPRICE be the median housing price in a school district and let EXPEND be perstudent expenditures Using panel data for the years 1992 1994 and 1996 we postulate the model lHPRICEit 5 ut 1 b1lEXPENDit 1 b2lPOLICEit 1 b3lMEDINCit 1 b4PROPTAXit 1 ai1 1 uit1 where POLICEit is per capita police expenditures MEDINCit is median income and PROPTAXit is the property tax rate l denotes natural logarithm Expenditures and housing price are simultaneously determined because the value of homes directly affects the revenues available for funding schools Suppose that in 1994 the way schools were funded was drastically changed rather than being raised by local property taxes school funding was largely determined at the state level Let lSTATEALLit denote the log of the state allocation for district i in year t which is exogenous in the preceding equation once we control for expenditures and a district fixed effect How would you estimate the bj Computer Exercises C1 Use SMOKE for this exercise i A model to estimate the effects of smoking on annual income perhaps through lost work days due to illness or productivity effects is log1income2 5 b0 1 b1cigs 1 b2educ 1 b3age 1 b4age2 1 u1 where cigs is number of cigarettes smoked per day on average How do you interpret b1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 520 PART 1 Regression Analysis with CrossSectional Data ii To reflect the fact that cigarette consumption might be jointly determined with income a demand for cigarettes equation is cigs 5 g0 1 g1log1income2 1 g2educ 1 g3age 1 g4age2 1 g5log1cigpric2 1 g6restaurn 1 u2 where cigpric is the price of a pack of cigarettes in cents and restaurn is a binary variable equal to unity if the person lives in a state with restaurant smoking restrictions Assuming these are exogenous to the individual what signs would you expect for g5 and g6 iii Under what assumption is the income equation from part i identified iv Estimate the income equation by OLS and discuss the estimate of b1 v Estimate the reduced form for cigs Recall that this entails regressing cigs on all exogenous variables Are logcigpric and restaurn significant in the reduced form vi Now estimate the income equation by 2SLS Discuss how the estimate of b1 compares with the OLS estimate vii Do you think that cigarette prices and restaurant smoking restrictions are exogenous in the income equation C2 Use MROZ for this exercise i Reestimate the labor supply function in Example 165 using loghours as the dependent variable Compare the estimated elasticity which is now constant to the estimate obtained from equation 1624 at the average hours worked ii In the labor supply equation from part i allow educ to be endogenous because of omitted ability Use motheduc and fatheduc as IVs for educ Remember you now have two endogenous variables in the equation iii Test the overidentifying restrictions in the 2SLS estimation from part ii Do the IVs pass the test C3 Use the data in OPENNESS for this exercise i Because logpcinc is insignificant in both 1622 and the reduced form for open drop it from the analysis Estimate 1622 by OLS and IV without logpcinc Do any important conclusions change ii Still leaving logpcinc out of the analysis is land or logland a better instrument for open Hint Regress open on each of these separately and jointly iii Now return to 1622 Add the dummy variable oil to the equation and treat it as exogenous Estimate the equation by IV Does being an oil producer have a ceteris paribus effect on inflation C4 Use the data in CONSUMP for this exercise i In Example 167 use the method from Section 155 to test the single overidentifying restriction in estimating 1635 What do you conclude ii Campbell and Mankiw 1990 use second lags of all variables as IVs because of potential data measurement problems and informational lags Reestimate 1635 using only gct22 gyt22 and r3t22 as IVs How do the estimates compare with those in 1636 iii Regress gyt on the IVs from part ii and test whether gyt is sufficiently correlated with them Why is this important C5 Use the Economic Report of the President 2005 or later to update the data in CONSUMP at least through 2003 Reestimate equation 1635 Do any important conclusions change C6 Use the data in CEMENT for this exercise i A static inverse supply function for the monthly growth in cement price gprc as a function of growth in quantity gcem is gprct 5 a1gcemt 1 b0 1 b1gprcpet 1 b2febt 1 p 1 b12dect 1 us t Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 521 where gprcpet growth in the price of petroleum is assumed to be exogenous and feb dec are monthly dummy variables What signs do you expect for a1 and b1 Estimate the equation by OLS Does the supply function slope upward ii The variable gdefs is the monthly growth in real defense spending in the United States What do you need to assume about gdefs for it to be a good IV for gcem Test whether gcem is partially correlated with gdefs Do not worry about possible serial correlation in the reduced form Can you use gdefs as an IV in estimating the supply function iii Shea 1993 argues that the growth in output of residential gres and nonresidential gnon construction are valid instruments for gcem The idea is that these are demand shifters that should be roughly uncorrelated with the supply error us t Test whether gcem is partially correlated with gres and gnon again do not worry about serial correlation in the reduced form iv Estimate the supply function using gres and gnon as IVs for gcem What do you conclude about the static supply function for cement The dynamic supply function is apparently upward sloping see Shea 1993 C7 Refer to Example 139 and the data in CRIME4 i Suppose that after differencing to remove the unobserved effect you think Dlog1polpc2 is simultaneously determined with Dlog1crmrte2 in particular increases in crime are associated with increases in police officers How does this help to explain the positive coefficient on Dlog1polpc2 in equation 1333 ii The variable taxpc is the taxes collected per person in the county Does it seem reasonable to exclude this from the crime equation iii Estimate the reduced form for Dlog1polpc2 using pooled OLS including the potential IV Dlog1taxpc2 Does it look like Dlog1taxpc2 is a good IV candidate Explain iv Suppose that in several of the years the state of North Carolina awarded grants to some counties to increase the size of their county police force How could you use this information to estimate the effect of additional police officers on the crime rate C8 Use the data set in FISH which comes from Graddy 1995 to do this exercise The data set is also used in Computer Exercise C9 in Chapter 12 Now we will use it to estimate a demand function for fish i Assume that the demand equation can be written in equilibrium for each time period as log1totqtyt2 5 a1log1avgprct2 1 b10 1 b11mont 1 b12tuest 1 b13wedt 1 b14thurst 1 ut1 so that demand is allowed to differ across days of the week Treating the price variable as endogenous what additional information do we need to estimate the demandequation parameters consistently ii The variables wave2t and wave3t are measures of ocean wave heights over the past several days What two assumptions do we need to make in order to use wave2t and wave3t as IVs for log1avgprct2 in estimating the demand equation iii Regress log1avgprct2 on the dayoftheweek dummies and the two wave measures Are wave2t and wave3t jointly significant What is the pvalue of the test iv Now estimate the demand equation by 2SLS What is the 95 confidence interval for the price elasticity of demand Is the estimated elasticity reasonable v Obtain the 2SLS residuals u t1 Add a single lag u t211 in estimating the demand equation by 2SLS Remember use u t211 as its own instrument Is there evidence of AR1 serial correlation in the demand equation errors vi Given that the supply equation evidently depends on the wave variables what two assumptions would we need to make in order to estimate the price elasticity of supply vii In the reduced form equation for log1avgprct2 are the dayoftheweek dummies jointly significant What do you conclude about being able to estimate the supply elasticity Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 522 PART 1 Regression Analysis with CrossSectional Data C9 For this exercise use the data in AIRFARE but only for the year 1997 i A simple demand function for airline seats on routes in the United States is log1passen2 5 b10 1 a1log1fare2 1 b11log1dist2 1 b123log1dist2 42 1 u1 where passen 5 average passengers per day fare 5 average airfare and dist 5 the route distance 1in miles2 If this is truly a demand function what should be the sign of a1 ii Estimate the equation from part i by OLS What is the estimated price elasticity iii Consider the variable concen which is a measure of market concentration Specifically it is the share of business accounted for by the largest carrier Explain in words what we must assume to treat concen as exogenous in the demand equation iv Now assume concen is exogenous to the demand equation Estimate the reduced form for logfare and confirm that concen has a positive partial effect on logfare v Estimate the demand function using IV Now what is the estimated price elasticity of demand How does it compare with the OLS estimate vi Using the IV estimates describe how demand for seats depends on route distance C10 Use the entire panel data set in AIRFARE for this exercise The demand equation in a simultaneous equations unobserved effects model is log1passenit2 5 ut1 1 a1log1fareit2 1 ai1 1 uit1 where we absorb the distance variables into ai1 i Estimate the demand function using fixed effects being sure to include year dummies to account for the different intercepts What is the estimated elasticity ii Use fixed effects to estimate the reduced form log1fareit2 5 ut2 1 p21concenit 1 ai2 1 vit2 Perform the appropriate test to ensure that concenit can be used as an IV for log1fareit2 iii Now estimate the demand function using the fixed effects transformation along with IV as in equation 1642 What is the estimated elasticity Is it statistically significant C11 A common method for estimating Engel curves is to model expenditure shares as a function of total expenditure and possibly demographic variables A common specification has the form sgood 5 b0 1 b1ltotexpend 1 demographics 1 u where sgood is the fraction of spending on a particular good out of total expenditure and ltotexpend is the log of total expenditure The sign and magnitude of b1 are of interest across various expenditure categories To account for the potential endogeneity of ltotexpendwhich can be viewed as an omitted variables or simultaneous equations problem or boththe log of family income is often used as an instrumental variable Let lincome denote the log of family income For the remainder of this question use the data in EXPENDSHARES which comes from Blundell Duncan and Pendakur 1998 i Use sfood the share of spending on food as the dependent variable What is the range of values of sfood Are you surprised there are no zeros ii Estimate the equation sfood 5 b0 1 b1ltotexpend 1 b2age 1 b3kids 1 u 1643 by OLS and report the coefficient on ltotexpend b OLS1 along with its heteroskedasticityrobust standard error Interpret the result Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 523 iii Using lincome as an IV for ltotexpend estimate the reduced form equation for ltotexpend be sure to include age and kids Assuming lincome is exogenous in 1643 is lincome a valid IV for ltotexpend iv Now estimate 1643 by instrumental variables How does b IV1 compare with b OLS1 What about the robust 95 confidence intervals v Use the test in Section 155 to test the null hypothesis that ltotexpend is exogenous in 1643 Be sure to report and interpret the pvalue Are there any overidentifying restrictions to test vi Substitute salcohol for sfood in 1643 and estimate the equation by OLS and 2SLS Now what do you find for the coefficients on ltotexpend Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 524 I n Chapter 7 we studied the linear probability model which is simply an application of the multiple regression model to a binary dependent variable A binary dependent variable is an example of a limited dependent variable LDV An LDV is broadly defined as a dependent variable whose range of values is substantively restricted A binary variable takes on only two values zero and one In Section 77 we discussed the interpretation of multiple regression estimates for generally discrete response variables focusing on the case where y takes on a small number of integer valuesfor example the number of times a young man is arrested during a year or the number of children born to a woman Elsewhere we have encountered several other limited dependent variables including the percentage of people participating in a pension plan which must be between zero and 100 and college grade point average which is between zero and 40 at most colleges Most economic variables we would like to explain are limited in some way often because they must be positive For example hourly wage housing price and nominal interest rates must be greater than zero But not all such variables need special treatment If a strictly positive variable takes on many different values a special econometric model is rarely necessary When y is discrete and takes on a small number of values it makes no sense to treat it as an approximately continuous variable Discreteness of y does not in itself mean that linear models are inappropriate However as we saw in Chapter 7 for binary response the linear probability model has certain drawbacks In Section 171 we discuss logit and probit models which overcome the shortcomings of the LPM the disadvantage is that they are more difficult to interpret Limited Dependent Variable Models and Sample Selection Corrections c h a p t e r 17 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 525 Other kinds of limited dependent variables arise in econometric analysis especially when the behavior of individuals families or firms is being modeled Optimizing behavior often leads to a corner solution response for some nontrivial fraction of the population That is it is optimal to choose a zero quantity or dollar value for example During any given year a significant number of families will make zero charitable contributions Therefore annual family charitable contributions has a population distribution that is spread out over a large range of positive values but with a pileup at the value zero Although a linear model could be appropriate for capturing the expected value of charitable contributions a linear model will likely lead to negative predictions for some families Taking the natural log is not possible because many observations are zero The Tobit model which we cover in Section 172 is explicitly designed to model corner solution dependent variables Another important kind of LDV is a count variable which takes on nonnegative integer values Section 173 illustrates how Poisson regression models are well suited for modeling count variables In some cases we encounter limited dependent variables due to data censoring a topic we introduce in Section 174 The general problem of sample selection where we observe a nonrandom sample from the underlying population is treated in Section 175 Limited dependent variable models can be used for time series and panel data but they are most often applied to crosssectional data Sample selection problems are usually confined to cross sectional or panel data We focus on crosssectional applications in this chapter Wooldridge 2010 analyzes these problems in the context of panel data models and provides many more details for crosssectional and panel data applications 171 Logit and Probit Models for Binary Response The linear probability model is simple to estimate and use but it has some drawbacks that we dis cussed in Section 75 The two most important disadvantages are that the fitted probabilities can be less than zero or greater than one and the partial effect of any explanatory variable appearing in level form is constant These limitations of the LPM can be overcome by using more sophisticated binary response models In a binary response model interest lies primarily in the response probability P1y 5 10x2 5 P1y 5 10x1 x2 p xk2 171 where we use x to denote the full set of explanatory variables For example when y is an employment indicator x might contain various individual characteristics such as education age marital status and other factors that affect employment status including a binary indicator variable for participation in a recent job training program 171a Specifying Logit and Probit Models In the LPM we assume that the response probability is linear in a set of parameters bj see equation 727 To avoid the LPM limitations consider a class of binary response models of the form P1y 5 10x2 5 G1b0 1 b1x1 1 p 1 bkxk2 5 G1b0 1 xb2 172 where G is a function taking on values strictly between zero and one 0 G1z2 1 for all real numbers z This ensures that the estimated response probabilities are strictly between zero and one As in earlier chapters we write xb 5 b1x1 1 p 1 bkxk Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 526 Various nonlinear functions have been suggested for the function G to make sure that the probabilities are between zero and one The two we will cover here are used in the vast majority of applications along with the LPM In the logit model G is the logistic function G1z2 5 exp1z231 1 exp1z2 4 5 L1z2 173 which is between zero and one for all real numbers z This is the cumulative distribution function cdf for a standard logistic random variable In the probit model G is the standard normal cdf which is expressed as an integral G1z2 5 F1z2 3 z 2 f1v2dv 174 where f1z2 is the standard normal density f1z2 5 12p2 212exp12z222 175 This choice of G again ensures that 172 is strictly between zero and one for all values of the param eters and the xj The G functions in 173 and 174 are both increasing functions Each increases most quickly at z 5 0 G1z2 S 0 as z S 2 and G1z2 S 1 as z S The logistic function is plotted in Figure 171 The standard normal cdf has a shape very similar to that of the logistic cdf Logit and probit models can be derived from an underlying latent variable model Let yp be an unobserved or latent variable and suppose that yp 5 b0 1 xb 1 e y 5 13yp 04 176 where we introduce the notation 13 4 to define a binary outcome The function 13 4 is called the indicator function which takes on the value one if the event in brackets is true and zero otherwise Therefore y is one if yp 0 and y is zero if yp 0 We assume that e is independent of x and that e either has the standard logistic distribution or the standard normal distribution In either case e is Gz 5 expz1 1 expz 3 1 5 0 23 22 21 0 1 2 z FiguRE 171 Graph of the logistic function G 1z 2 5 exp1z 231 1 exp1z 2 4 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 527 symmetrically distributed about zero which means that 1 2 G12z2 5 G1z2 for all real numbers z Economists tend to favor the normality assumption for e which is why the probit model is more pop ular than logit in econometrics In addition several specification problems which we touch on later are most easily analyzed using probit because of properties of the normal distribution From 176 and the assumptions given we can derive the response probability for y P1y 5 10x2 5 P1yp 00x2 5 P3e 21b0 1 xb2 0x4 5 1 2 G321b0 1 xb2 4 5 G1b0 1 xb2 which is exactly the same as 172 In most applications of binary response models the primary goal is to explain the effects of the xj on the response probability P1y 5 10x2 The latent variable formulation tends to give the impres sion that we are primarily interested in the effects of each xj on yp As we will see for logit and probit the direction of the effect of xj on E1yp0x2 5 b0 1 xb and on E1y0x2 5 P1y 5 10x2 5 G1b0 1 xb2 is always the same But the latent variable yp rarely has a welldefined unit of measurement For example yp might be the difference in utility levels from two different actions Thus the magnitudes of each bj are not by themselves especially useful in contrast to the linear probability model For most purposes we want to estimate the effect of xj on the probability of success P1y 5 10x2 but this is complicated by the nonlinear nature of G1 2 To find the partial effect of roughly continuous variables on the response probability we must rely on calculus If xj is a roughly continuous variable its partial effect on p1x2 5 P1y 5 10x2 is obtained from the partial derivative p1x2 xj 5 g1b0 1 xb2bj where g1z2 dG dz 1z2 177 Because G is the cdf of a continuous random variable g is a probability density function pdf In the logit and probit cases G1 2 is a strictly increasing cdf and so g1z2 0 for all z Therefore the partial effect of xj on p1x2 depends on x through the positive quantity g1b0 1 xb2 which means that the partial effect always has the same sign as bj Equation 177 shows that the relative effects of any two continuous explanatory variables do not depend on x the ratio of the partial effects for xj and xh is bjbh In the typical case that g is a sym metric density about zero with a unique mode at zero the largest effect occurs when b0 1 xb 5 0 For example in the probit case with g1z2 5 f1z2 g102 5 f102 5 12p 40 In the logit case g1z2 5 exp1z231 1 exp1z2 42 and so g102 5 25 If say x1 is a binary explanatory variable then the partial effect from changing x1 from zero to one holding all other variables fixed is simply G1b0 1 b1 1 b2x2 1 p 1 bkxk2 2 G1b0 1 b2x2 1 p 1 bkxk2 178 Again this depends on all the values of the other xj For example if y is an employment indicator and x1 is a dummy variable indicating participation in a job training program then 178 is the change in the probability of employment due to the job training program this depends on other characteristics that affect employability such as education and experience Note that knowing the sign of b1 is suffi cient for determining whether the program had a positive or negative effect But to find the magnitude of the effect we have to estimate the quantity in 178 We can also use the difference in 178 for other kinds of discrete variables such as number of children If xk denotes this variable then the effect on the probability of xk going from ck to ck 1 1 is simply G3b0 1 b1x1 1 b2x2 1 p 1 bk1ck 1 12 4 179 2 G1b0 1 b1x1 1 b2x2 1 p 1 bkck2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 528 It is straightforward to include standard functional forms among the explanatory variables For example in the model P1y 5 10z2 5 G1b0 1 b1z1 1 b2z2 1 1 b3log1z22 1 b4z32 the partial effect of z1 on P1y 5 10z2 is P1y 5 10z2z1 5 g1b0 1 xb2 1b1 1 2b2z12 and the partial effect of z2 on the response probability is P1y 5 10z2z2 5 g1b0 1 xb2 1b3z22 where xb 5 b1z1 1 b2z2 1 1 b3log1z22 1 b4z3 Therefore g1b0 1 xb2 1b31002 is the approximate change in the response probability when z2 increases by 1 Sometimes we want to compute the elasticity of the response probability with respect to an explanatory variable although we must be careful in interpreting percentage changes in probabilities For example a change in a probability from 04 to 06 represents a 2percentagepoint increase in the probability but a 50 increase relative to the initial value Using calculus in the preceding model the elasticity of P1y 5 10z2 with respect to z2 can be shown to be b33g1b0 1 xb2G1b0 1 xb2 4 The elasticity with respect to z3 is 1b4z32 3g1b0 1 xb2G1b0 1 xb2 4 In the first case the elasticity is always the same sign as b2 but it generally depends on all parameters and all values of the explana tory variables If z3 0 the second elasticity always has the same sign as the parameter b4 Models with interactions among the explanatory variables can be a bit tricky but one should compute the partial derivatives and then evaluate the resulting partial effects at interesting values When measuring the effects of discrete variablesno matter how complicated the modelwe should use 179 We discuss this further in the subsection on interpreting the estimates on page 530 171b Maximum Likelihood Estimation of Logit and Probit Models How should we estimate nonlinear binary response models To estimate the LPM we can use ordi nary least squares see Section 75 or in some cases weighted least squares see Section 85 Because of the nonlinear nature of E1y0x2 OLS and WLS are not applicable We could use nonlinear versions of these methods but it is no more difficult to use maximum likelihood estimation MLE see Appendix 17A for a brief discussion Up until now we have had little need for MLE although we did note that under the classical linear model assumptions the OLS estimator is the maximum likelihood estimator conditional on the explanatory variables For estimating limited dependent var iable models maximum likelihood methods are indispensable Because MLE is based on the distribu tion of y given x the heteroskedasticity in Var1y0x2 is automatically accounted for Assume that we have a random sample of size n To obtain the maximum likelihood estimator conditional on the explanatory variables we need the density of yi given xi We can write this as f1y0xib2 5 3G1xib2 4y31 2 G1xib2 412y y 5 0 1 1710 where for simplicity we absorb the intercept into the vector xi We can easily see that when y 5 1 we get G1xib2 and when y 5 0 we get 1 2 G1xib2 The loglikelihood function for observation i is a function of the parameters and the data 1xi yi2 and is obtained by taking the log of 1710 i1b2 5 yilog3G1xib2 4 1 11 2 yi2log31 2 G1xib2 4 1711 Because G1 2 is strictly between zero and one for logit and probit i1b2 is well defined for all values of b The loglikelihood for a sample size of n is obtained by summing 1711 across all observa tions 1b2 5 g n i51i1b2 The MLE of b denoted by b maximizes this loglikelihood If G1 2 is the standard logit cdf then b is the logit estimator if G1 2 is the standard normal cdf then b is the probit estimator Because of the nonlinear nature of the maximization problem we cannot write formulas for the logit or probit maximum likelihood estimates In addition to raising computational issues this makes the statistical theory for logit and probit much more difficult than OLS or even 2SLS Nevertheless Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 529 the general theory of MLE for random samples implies that under very general conditions the MLE is consistent asymptotically normal and asymptotically efficient See Wooldridge 2010 Chapter 13 for a general discussion We will just use the results here applying logit and probit mod els is fairly easy provided we understand what the statistics mean Each b j comes with an asymptotic standard error the formula for which is complicated and pre sented in the chapter appendix Once we have the standard errorsand these are reported along with the coefficient estimates by any package that supports logit and probitwe can construct asymp totic t tests and confidence intervals just as with OLS 2SLS and the other estimators we have encountered In particular to test H0 bj 5 0 we form the t statistic b jse1bj 2 and carry out the test in the usual way once we have decided on a one or twosided alternative 171c Testing Multiple Hypotheses We can also test multiple restrictions in logit and probit models In most cases these are tests of mul tiple exclusion restrictions as in Section 45 We will focus on exclusion restrictions here There are three ways to test exclusion restrictions for logit and probit models The Lagrange mul tiplier or score test only requires estimating the model under the null hypothesis just as in the linear case in Section 52 we will not cover the score test here since it is rarely needed to test exclusion restrictions See Wooldridge 2010 Chapter 15 for other uses of the score test in binary response models The Wald test requires estimation of only the unrestricted model In the linear model case the Wald statistic after a simple transformation is essentially the F statistic so there is no need to cover the Wald statistic separately The formula for the Wald statistic is given in Wooldridge 2010 Chapter 15 This statistic is computed by econometrics packages that allow exclusion restrictions to be tested after the unrestricted model has been estimated It has an asymptotic chisquare distribution with df equal to the number of restrictions being tested If both the restricted and unrestricted models are easy to estimateas is usually the case with exclu sion restrictionsthen the likelihood ratio LR test becomes very attractive The LR test is based on the same concept as the F test in a linear model The F test measures the increase in the sum of squared residuals when variables are dropped from the model The LR test is based on the difference in the loglikelihood functions for the unrestricted and restricted models The idea is this Because the MLE maximizes the loglikelihood function dropping variables generally leads to a smalleror at least no largerloglikelihood This is similar to the fact that the Rsquared never increases when variables are dropped from a regression The question is whether the fall in the loglikelihood is large enough to con clude that the dropped variables are important We can make this decision once we have a test statistic and a set of critical values The likelihood ratio statistic is twice the differ ence in the loglikelihoods LR 5 21ur 2 r2 1712 A probit model to explain whether a firm is taken over by another firm during a given year is P1takeover 5 10x2 5 F1b0 1 b1avgprof 1 b2mktval 1 b3debtearn 1 b4ceoten 1 b5ceosal 1 b5ceoage2 where takeover is a binary response variable avgprof is the firms average profit margin over several prior years mktval is market value of the firm debtearn is the debtto earnings ratio and ceoten ceosal and ceo age are the tenure annual salary and age of the chief executive officer respectively State the null hypothesis that other factors being equal variables related to the CEO have no effect on the probability of takeover How many df are in the chisquare distribu tion for the LR or Wald test Exploring FurthEr 171 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 530 where ur is the loglikelihood value for the unrestricted model and r is the loglikelihood value for the restricted model Because ur r LR is nonnegative and usually strictly positive In computing the LR statistic for binary response models it is important to know that the loglikelihood function is always a negative number This fact follows from equation 1711 because yi is either zero or one and both variables inside the log function are strictly between zero and one which means their natural logs are negative That the loglikelihood functions are both negative does not change the way we compute the LR statistic we simply preserve the negative signs in equation 1712 The multiplication by two in 1712 is needed so that LR has an approximate chisquare distribu tion under H0 If we are testing q exclusion restrictions LR a x2 q This means that to test H0 at the 5 level we use as our critical value the 95th percentile in the x2 q distribution Computing pvalues is easy with most software packages 171d Interpreting the Logit and Probit Estimates Given modern computers from a practical perspective the most difficult aspect of logit or probit models is presenting and interpreting the results The coefficient estimates their standard errors and the value of the loglikelihood function are reported by all software packages that do logit and probit and these should be reported in any application The coefficients give the signs of the partial effects of each xj on the response probability and the statistical significance of xj is determined by whether we can reject H0 bj 5 0 at a sufficiently small significance level As we briefly discussed in Section 75 for the linear probability model we can compute a goodnessoffit measure called the percent correctly predicted As before we define a binary pre dictor of yi to be one if the predicted probability is at least 5 and zero otherwise Mathematically yi 5 1 if G1b 0 1 xib 2 5 and yi 5 0 if G1b 0 1 xib 2 5 Given 5yi i 5 1 2 p n6 we can see how well yi predicts yi across all observations There are four possible outcomes on each pair 1yi yi2 when both are zero or both are one we make the correct prediction In the two cases where one of the pair is zero and the other is one we make the incorrect prediction The percentage correctly predicted is the percentage of times that yi 5 yi Although the percentage correctly predicted is useful as a goodnessoffit measure it can be mis leading In particular it is possible to get rather high percentages correctly predicted even when the least likely outcome is very poorly predicted For example suppose that n 5 200 160 observations have yi 5 0 and out of these 160 observations 140 of the yi are also zero so we correctly predict 875 of the zero outcomes Even if none of the predictions is correct when yi 5 1 we still correctly predict 70 of all outcomes 1140200 5 702 Often we hope to have some ability to predict the least likely outcome such as whether someone is arrested for committing a crime and so we should be up front about how well we do in predicting each outcome Therefore it makes sense to also com pute the percentage correctly predicted for each of the outcomes Problem 1 asks you to show that the overall percentage correctly predicted is a weighted average of q 0 the percentage correctly predicted for yi 5 0 and q 1 the percentage correctly predicted for yi 5 1 where the weights are the fractions of zeros and ones in the sample respectively Some have criticized the prediction rule just described for using a threshold value of 5 espe cially when one of the outcomes is unlikely For example if y 5 08 only 8 successes in the sample it could be that we never predict yi 5 1 because the estimated probability of success is never greater than 5 One alternative is to use the fraction of successes in the sample as the threshold08 in the previous example In other words define yi 5 1 when G1b0 1 xib 2 08 and zero other wise Using this rule will certainly increase the number of predicted successes but not without cost we will necessarily make more mistakesperhaps many morein predicting zeros failures In terms of the overall percentage correctly predicted we may do worse than using the 5 threshold A third possibility is to choose the threshold such that the fraction of yi 5 1 in the sample is the same as or very close to y In other words search over threshold values t 0 t 1 such that if we define yi 5 1 when G1b0 1 xib 2 t then g n i51yi g n i51 yi The trial and error required to Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 531 find the desired value of t can be tedious but it is feasible In some cases it will not be possible to make the number of predicted successes exactly the same as the number of successes in the sample Now given this set of yi we can compute the percentage correctly predicted for each of the two out comes as well as the overall percentage correctly predicted There are also various pseudo Rsquared measures for binary response McFadden 1974 sug gests the measure 1 2 uro where ur is the loglikelihood function for the estimated model and 0 is the loglikelihood function in the model with only an intercept Why does this measure make sense Recall that the loglikelihoods are negative and so uro 5 0ur00o0 Further 0ur0 0o0 If the covariates have no explanatory power then uro 5 1 and the pseudo Rsquared is zero just as the usual Rsquared is zero in a linear regression when the covariates have no explanatory power Usually 0ur0 0o0 in which 1 2 ur o 0 If ur were zero the pseudo Rsquared would equal unity In fact ur cannot reach zero in a probit or logit model as that would require the estimated probabilities when yi 5 1 all to be unity and the estimated probabilities when yi 5 0 all to be zero Alternative pseudo Rsquareds for probit and logit are more directly related to the usual Rsquared from OLS estimation of a linear probability model For either probit or logit let yi 5 G1b 0 1 xib 2 be the fitted probabilities Since these probabilities are also estimates of E1yi0xi2 we can base an Rsquared on how close the yi are to the yi One possibility that suggests itself from standard regres sion analysis is to compute the squared correlation between yi and yi Remember in a linear regres sion framework this is an algebraically equivalent way to obtain the usual Rsquared see equation 329 Therefore we can compute a pseudo Rsquared for probit and logit that is directly comparable to the usual Rsquared from estimation of a linear probability model In any case goodnessoffit is usually less important than trying to obtain convincing estimates of the ceteris paribus effects of the explanatory variables Often we want to estimate the effects of the xj on the response probabilities P1y 5 10x2 If xj is roughly continuous then DP 1y 5 10x2 3g1b0 1 xb 2b j4Dxj 1713 for small changes in xj So for Dxj 5 1 the change in the estimated success probability is roughly g1b 0 1 xb 2b j Compared with the linear probability model the cost of using probit and logit mod els is that the partial effects in equation 1713 are harder to summarize because the scale factor g1b 0 1 xb 2 depends on x that is on all of the explanatory variables One possibility is to plug in interesting values for the xj such as means medians minimums maximums and lower and upper quartilesand then see how g1b 0 1 xb 2 changes Although attractive this can be tedious and result in too much information even if the number of explanatory variables is moderate As a quick summary for getting at the magnitudes of the partial effects it is handy to have a sin gle scale factor that can be used to multiply each b j or at least those coefficients on roughly continu ous variables One method commonly used in econometrics packages that routinely estimate probit and logit models is to replace each explanatory variable with its sample average In other words the adjustment factor is g1b 0 1 xb 2 5 g1b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk2 1714 where g1 2 is the standard normal density in the probit case and g1z2 5 exp1z231 1 exp1z2 42 in the logit case The idea behind 1714 is that when it is multiplied by b j we obtain the partial effect of xj for the average person in the sample Thus if we multiply a coefficient by 1714 we generally obtain the partial effect at the average PEA There are at least two potential problems with using PEAs to summarize the partial effects of the explanatory variables First if some of the explanatory variables are discrete the averages of them represent no one in the sample or population for that matter For example if x1 5 female and 475 of the sample is female what sense does it make to plug in x1 5 475 to represent the aver age person Second if a continuous explanatory variable appears as a nonlinear functionsay as a natural log or in a quadraticit is not clear whether we want to average the nonlinear function or Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 532 plug the average into the nonlinear function For example should we use log1sales2 or log1sales2 to represent average firm size Econometrics packages that compute the scale factor in 1714 default to the former the software is written to compute the averages of the regressors included in the probit or logit estimation A different approach to computing a scale factor circumvents the issue of which values to plug in for the explanatory variables Instead the second scale factor results from averaging the individual partial effects across the sample leading to what is called the average partial effect APE or some times the average marginal effect AME For a continuous explanatory variable xj the average partial effect is n21g n i51 3g1b 0 1 xib 2b j4 5 3n21g n i51 g1b 0 1 xib 2 4b i The term multiplying bj acts as a scale factor n21 a n i51 g1b0 1 xib 2 1715 Equation 1715 is easily computed after probit or logit estimation where g1b 0 1 xib 2 5 f1b 0 1 xib 2 in the probit case and g1b 0 1 xib 2 5 exp1b 0 1 xib 231 1 exp1b 0 1 xib 2 42 in the logit case The two scale factors differand are possibly quite differentbecause in 1715 we are using the aver age of the nonlinear function rather than the nonlinear function of the average as in 1714 Because both of the scale factors just described depend on the calculus approximation in 1713 neither makes much sense for discrete explanatory variables Instead it is better to use equation 179 to directly estimate the change in the probability For a change in xk from ck to ck 1 1 the discrete analog of the partial effect based on 1714 is G3b 0 1 b 1x1 1 p 1 b k21xk21 1 b k1ck 1 12 4 1716 2 G1b 0 1 b 1x1 1 p 1 b k21xk21 1 b kck2 where G is the standard normal cdf in the probit case and G1z2 5 exp1z231 1 exp1z2 4 in the logit case The average partial effect which usually is more comparable to LPM estimates is n21 a n i51 5G3b 0 1 b 1xi1 1 p 1 b k21xik21 1 b k1ck 1 12 4 1717 2 G1b 0 1 b 1xi1 1 p 1 b k21xik21 1 b kck2 6 The quantity in equation 1717 is a partial effect because all explanatory variables other than xk are being held fixed at their observed values It is not necessarily a marginal effect because the change in xk from ck to ck 1 1 may not be a marginal or small increase whether it is depends on the definition of xk Obtaining expression 1717 for either probit or logit is actually rather simple First for each observation we estimate the probability of success for the two chosen values of xk plugging in the actual outcomes for the other explanatory variables So we would have n estimated differences Then we average the differences in estimated probabilities across all observations For binary xk both 1716 and 1717 are easily computed using certain econometrics packages such as Stata The expression in 1717 has a particularly useful interpretation when xk is a binary variable For each unit i we estimate the predicted difference in the probability that yi 5 1 when xk 5 1 and xk 5 0 namely G1b 0 1 b 1xi1 1 p 1 b k21xi k21 1 b k2 2 G1b 0 1 b 1xi1 1 p 1 b k21xi k212 For each i this difference is the estimated effect of switching xk from zero to one whether unit i had xik 5 1 or xik 5 0 For example if y is an employment indicator equal to one if the person is employed after participation in a job training program indicated by xk then we can estimate the dif ference in employment probabilities for each person in both states of the world This counterfactual Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 533 reasoning is similar to that in Chapter 16 which we used to motivate simultaneous equations models The estimated effect of the job training program on the employment probability is the average of the estimated differences in probabilities As another example suppose that y indicates whether a family was approved for a mortgage and xk is a binary race indicator say equal to one for nonwhites Then for each family we can estimate the predicted difference in having the mortgage approved as a func tion of income wealth credit rating and so onwhich would be elements of 1xi1 xi2 p xi k212 under the two scenarios that the household head is nonwhite versus white Hopefully we have controlled for enough factors so that averaging the differences in probabilities results in a convincing estimate of the race effect In applications where one applies probit logit and the LPM it makes sense to compute the scale factors described above for probit and logit in making comparisons of partial effects Still sometimes one wants a quicker way to compare magnitudes of the different estimates As mentioned earlier for probit g102 4 and for logit g102 5 25 Thus to make the magnitudes of probit and logit roughly comparable we can multiply the probit coefficients by 425 5 16 or we can multiply the logit esti mates by 625 In the LPM g0 is effectively one so the logit slope estimates can be divided by four to make them comparable to the LPM estimates the probit slope estimates can be divided by 25 to make them comparable to the LPM estimates Still in most cases we want the more accurate com parisons obtained by using the scale factors in 1715 for logit and probit ExamplE 171 married Womens labor Force participation We now use the data on 753 married women in MROZ to estimate the labor force participation model from Example 88see also Section 75by logit and probit We also report the linear probability model estimates from Example 88 using the heteroskedasticityrobust standard errors The results with standard errors in parentheses are given in Table 171 TAblE 171 LPM Logit and Probit Estimates of Labor Force Participation Dependent Variable inlf Independent Variables LPM OLS Logit MLE Probit MLE nwifeinc 20034 0015 2021 008 2012 005 educ 038 007 221 043 131 025 exper 039 006 206 032 123 019 exper 2 200060 00019 20032 0010 20019 0006 age 2016 002 2088 015 2053 008 kidslt6 2262 032 21443 204 2868 119 kidsge6 013 014 060 075 036 043 constant 586 152 425 860 270 509 Percentage correctly predicted Loglikelihood value Pseudo Rsquared 734 264 736 240177 220 734 240130 221 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 534 The estimates from the three models tell a consistent story The signs of the coefficients are the same across models and the same variables are statistically significant in each model The pseudo Rsquared for the LPM is just the usual Rsquared reported for OLS for logit and probit the pseudo Rsquared is the measure based on the loglikelihoods described earlier As we have already emphasized the magnitudes of the coefficient estimates across models are not directly comparable Instead we compute the scale factors in equations 1714 and 1715 If we evaluate the standard normal pdf f1b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk2 at the sample averages of the explanatory variables including the average of exper2 kidslt6 and kidsge6 the result is approxi mately 391 When we compute 1714 for the logit case we obtain about 243 The ratio of these 391243 161 is very close to the simple rule of thumb for scaling up the probit estimates to make them comparable to the logit estimates multiply the probit estimates by 16 Nevertheless for compar ing probit and logit to the LPM estimates it is better to use 1715 These scale factors are about 301 probit and 179 logit For example the scaled logit coefficient on educ is about 17912212 040 and the scaled probit coefficient on educ is about 30111312 039 both are remarkably close to the LPM estimate of 038 Even on the discrete variable kidslt6 the scaled logit and probit coef ficients are similar to the LPM coefficient of 2262 These are 1791214432 2258 logit and 301128682 2261 probit Table 172 reports the average partial effects for all explanatory variables and for each of the three estimated models We obtained the estimates and standard errors from the statistical package Stata 13 These APEs treat all explanatory variables as continuous even the variables for the num ber of children Obtaining the APE for exper requires some care as it must account for the quadratic functional form in exper Even for the linear model we must compute the derivative and then find the average In the LPM column the APE of exper is the average of the derivative with respect to exper so 039 2 0012 experi averaged across all i The remaining APE entries for the LPM column are simply the OLS coefficients in Table 171 The APEs for exper for the logit and probit models also account for the quadratic in exper As is clear from the table the APEs and their statistical signifi cance are very similar for all explanatory variables across all three models The biggest difference between the LPM model and the logit and probit models is that the LPM assumes constant marginal effects for educ kidslt6 and so on while the logit and probit models TAblE 172 Average Partial Effects for the Labor Force Participation Models Independent Variables LPM Logit Probit nwifeinc 0034 0015 0038 0015 0036 0014 educ 038 007 039 007 039 007 exper 027 002 025 002 026 002 age 016 002 016 002 016 002 kidslt6 262 032 258 032 261 032 kidsge6 013 014 011 013 011 013 Using the probit estimates and the calcu lus approximation what is the approximate change in the response probability when exper increases from 10 to 11 Exploring FurthEr 172 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 535 imply diminishing magnitudes of the partial effects In the LPM one more small child is estimated to reduce the probability of labor force participation by about 262 regardless of how many young children the woman already has and regardless of the levels of the other explanatory variables We can contrast this with the estimated marginal effect from probit For concreteness take a woman with nwifeinc 5 2013 educ 5 123 exper 5 106 and age 5 425which are roughly the sample averagesand kidsge6 5 1 What is the estimated decrease in the probability of working in going from zero to one small child We evaluate the standard normal cdf F1b 0 1 b 1x1 1 p 1 b kxk2 with kidslt6 5 1 and kidslt6 5 0 and the other independent variables set at the preceding values We get roughly 373 2 707 5 2334 which means that the labor force participation probability is about 334 lower when a woman has one young child If the woman goes from one to two young chil dren the probability falls even more but the marginal effect is not as large 117 2 373 5 2256 Interestingly the estimate from the linear probability model which is supposed to estimate the effect near the average is in fact between these two estimates Note that the calculations provided here which use coefficients mostly rounded to the third decimal place will differ somewhat from calcula tions obtained within a statistical packagewhich would be subject to less rounding error Figure 172 illustrates how the estimated response probabilities from nonlinear binary response models can differ from the linear probability model The estimated probability of labor force par ticipation is graphed against years of education for the linear probability model and the probit model The graph for the logit model is very similar to that for the probit model In both cases the explanatory variables other than educ are set at their sample averages In particular the two equa tions graphed are inlf 5 102 1 038 educ for the linear model and inlf 5 F121403 1 131 educ2 At lower levels of education the linear probability model estimates higher labor force participation probabilities than the probit model For example at eight years of education the linear probability model estimates a 406 labor force participation probability while the probit model estimates about 361 estimated probability of labor force participation 20 1 0 years of education 9 75 5 25 1 12 16 0 4 8 inlf 5 F 21403 1 131 educ inlf 5 102 1 038 educ FiguRE 172 Estimated response probabilities with respect to education for the linear probability and probit models Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 536 The estimates are the same at around 11 3 years of education At higher levels of education the probit model gives higher labor force participation probabilities In this sample the smallest years of educa tion is 5 and the largest is 17 so we really should not make comparisons outside this range The same issues concerning endogenous explanatory variables in linear models also arise in logit and probit models We do not have the space to cover them but it is possible to test and cor rect for endogenous explanatory variables using methods related to two stage least squares Evans and Schwab 1995 estimated a probit model for whether a student attends college where the key explanatory variable is a dummy variable for whether the student attends a Catholic school Evans and Schwab estimated a model by maximum likelihood that allows attending a Catholic school to be considered endogenous See Wooldridge 2010 Chapter 15 for an explanation of these methods Two other issues have received attention in the context of probit models The first is nonnormal ity of e in the latent variable model 176 Naturally if e does not have a standard normal distribution the response probability will not have the probit form Some authors tend to emphasize the inconsist ency in estimating the bj but this is the wrong focus unless we are only interested in the direction of the effects Because the response probability is unknown we could not estimate the magnitude of partial effects even if we had consistent estimates of the bj A second specification problem also defined in terms of the latent variable model is heteroske dasticity in e If Var1e0x2 depends on x the response probability no longer has the form G1b0 1 xb2 instead it depends on the form of the variance and requires more general estimation Such models are not often used in practice since logit and probit with flexible functional forms in the independent variables tend to work well Binary response models apply with little modification to independently pooled cross sections or to other data sets where the observations are independent but not necessarily identically distributed Often year or other time period dummy variables are included to account for aggregate time effects Just as with linear models logit and probit can be used to evaluate the impact of certain policies in the context of a natural experiment The linear probability model can be applied with panel data typically it would be estimated by fixed effects see Chapter 14 Logit and probit models with unobserved effects have recently become popular These models are complicated by the nonlinear nature of the response probabilities and they are difficult to estimate and interpret See Wooldridge 2010 Chapter 15 172 The Tobit Model for Corner Solution Responses As mentioned in the chapter introduction another important kind of limited dependent variable is a corner solution response Such a variable is zero for a nontrivial fraction of the population but is roughly continuously distributed over positive values An example is the amount an individual spends on alcohol in a given month In the population of people over age 21 in the United States this variable takes on a wide range of values For some significant fraction the amount spent on alcohol is zero The following treatment omits verification of some details concerning the Tobit model These are given in Wooldridge 2010 Chapter 17 Let y be a variable that is essentially continuous over strictly positive values but that takes on a value of zero with positive probability Nothing prevents us from using a linear model for y In fact a linear model might be a good approximation to E1y0x1 x2 p xk2 especially for xj near the mean values But we would possibly obtain negative fitted values which leads to negative predictions for y this is analogous to the problems with the LPM for binary outcomes Also the assumption that an explanatory variable appearing in level form has a constant partial effect on E1y0x2 can be misleading Probably Var1y0x2 would be heteroskedastic although we can easily deal with general heteroskedas ticity by computing robust standard errors and test statistics Because the distribution of y piles up at zero y clearly cannot have a conditional normal distribution So all inference would have only asymp totic justification as with the linear probability model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 537 In some cases it is important to have a model that implies nonnegative predicted values for y and which has sensible partial effects over a wide range of the explanatory variables Plus we sometimes want to estimate features of the distribution of y given x1 xk other than the conditional expecta tion The Tobit model is quite convenient for these purposes Typically the Tobit model expresses the observed response y in terms of an underlying latent variable yp 5 b0 1 xb 1 u u0x Normal10 s22 1718 y 5 max10 yp2 1719 The latent variable yp satisfies the classical linear model assumptions in particular it has a normal homoskedastic distribution with a linear conditional mean Equation 1719 implies that the observed variable y equals yp when yp 0 but y 5 0 when yp 0 Because yp is normally distributed y has a continuous distribution over strictly positive values In particular the density of y given x is the same as the density of yp given x for positive values Further P1y 5 00x2 5 P1yp 00x2 5 P1u 2xb0x2 5 P1us 2xbs0x2 5 F12xbs2 5 1 2 F1xbs2 because us has a standard normal distribution and is independent of x we have absorbed the inter cept into x for notational simplicity Therefore if 1xi yi2 is a random draw from the population the density of yi given xi is 12ps22 212exp321y 2 xib2 212s22 4 5 11s2f3 1y 2 xib2s4 y 0 1720 P1yi 5 00xi2 5 1 2 F1xibs2 1721 where f is the standard normal density function From 1720 and 1721 we can obtain the loglikelihood function for each observation i i1bs2 5 11yi 5 02log31 2 F1xibs2 4 1722 1 11yi 02log511s2f3 1yi 2 xib2s46 notice how this depends on s the standard deviation of u as well as on the bj The loglikelihood for a random sample of size n is obtained by summing 1722 across all i The maximum likelihood estimates of b and s are obtained by maximizing the loglikelihood this requires numerical methods although in most cases this is easily done using a packaged routine As in the case of logit and probit each Tobit esti mate comes with a standard error and these can be used to construct t statistics for each b j the matrix formula used to find the standard errors is compli cated and will not be presented here See for exam ple Wooldridge 2010 Chapter 17 Testing multiple exclusion restrictions is easily done using the Wald test or the likelihood ratio test The Wald test has a form similar to that of the logit or probit case the LR test is always given by 1712 where of course we use the Tobit loglikelihood functions for the restricted and unrestricted models 172a Interpreting the Tobit Estimates Using modern computers it is usually not much more difficult to obtain the maximum likelihood esti mates for Tobit models than the OLS estimates of a linear model Further the outputs from Tobit and OLS are often similar This makes it tempting to interpret the b j from Tobit as if these were estimates from a linear regression Unfortunately things are not so easy Let y be the number of extramarital affairs for a married woman from the US popu lation we would like to explain this vari able in terms of other characteristics of the womanin particular whether she works outside of the home her husband and her family Is this a good candidate for a Tobit model Exploring FurthEr 173 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 538 From equation 1718 we see that the bj measure the partial effects of the xj on E1yp0x2 where yp is the latent variable Sometimes yp has an interesting economic meaning but more often it does not The variable we want to explain is y as this is the observed outcome such as hours worked or amount of charitable contributions For example as a policy matter we are interested in the sensitiv ity of hours worked to changes in marginal tax rates We can estimate P1y 5 00x2 from 1721 which of course allows us to estimate P1y 00x2 What happens if we want to estimate the expected value of y as a function of x In Tobit models two expectations are of particular interest E1y0y 0x2 which is sometimes called the condi tional expectation because it is conditional on y 0 and E1y0x2 which is unfortunately called the unconditional expectation Both expectations are conditional on the explanatory variables The expectation E1y0y 0x2 tells us for given values of x the expected value of y for the subpopulation where y is positive Given E1y0y 0x2 we can easily find E1y0x2 E1y0x2 5 P1y 00x2 E1y0y 0x2 5 F1xbs2 E1y0y 0x2 1723 To obtain E1y0y 0x2 we use a result for normally distributed random variables if z Normal01 then E1z0z c2 5 f1c231 2 F1c2 4 for any constant c But E1y0y 0x2 5 xb 1 E1u0u 2xb2 5 xb 1 sE3 1us2 0 1us2 2xbs4 5 xb 1 sf1xbs2F1xbs2 because f12c2 5 f1c2 1 2 F12c2 5 F1c2 and us has a standard normal distribution independent of x We can summarize this as E1y0y 0x2 5 xb 1 sl1xbs2 1724 where l1c2 5 f1c2F1c2 is called the inverse Mills ratio it is the ratio between the standard normal pdf and standard normal cdf each evaluated at c Equation 1724 is important It shows that the expected value of y conditional on y 0 is equal to xb plus a strictly positive term which is s times the inverse Mills ratio evaluated at xbs This equation also shows why using OLS only for observations where yi 0 will not always consistently estimate b essentially the inverse Mills ratio is an omitted variable and it is generally correlated with the elements of x Combining 1723 and 1724 gives E1y0x2 5 F1xbs2 3xb 1 sl1xbs2 4 5 F1xbs2xb 1 sf1xbs2 1725 where the second equality follows because F1xbs2l1xbs2 5 f1xbs2 This equation shows that when y follows a Tobit model E1y0x2 is a nonlinear function of x and b Although it is not obvi ous the righthand side of equation 1725 can be shown to be positive for any values of x and b Therefore once we have estimates of b we can be sure that predicted values for ythat is estimates of E1y0x2are positive The cost of ensuring positive predictions for y is that equation 1725 is more complicated than a linear model for E1y0x2 Even more importantly the partial effects from 1725 are more complicated than for a linear model As we will see the partial effects of xj on E1y0y 0x2 and E1y0x2 have the same sign as the coefficient bj but the magnitude of the effects depends on the values of all explanatory variables and parameters Because s appears in 1725 it is not surprising that the partial effects depend on s too If xj is a continuous variable we can find the partial effects using calculus First E1y0y 0x2xj 5 bj 1 bj dl dc 1xbs2 assuming that xj is not functionally related to other regressors By differentiating l1c2 5 f1c2F1c2 and using dFdc 5 f1c2 and dfdc 5 2cf1c2 it can be shown that dldc 5 2l1c2 3c 1 l1c2 4 Therefore E1y0y 0x2xj 5 bj51 2 l1xbs2 3xbs 1 l1xbs2 46 1726 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 539 This shows that the partial effect of xj on E1y0y 0x2 is not determined just by bj The adjust ment factor is given by the term in brackets 5 6 and depends on a linear function of x xbs 5 1b0 1 b1x1 1 p 1 bkxk2s It can be shown that the adjustment factor is strictly between zero and one In practice we can estimate 1726 by plugging in the MLEs of the bj and s As with logit and probit models we must plug in values for the xj usually the mean values or other interesting values Equation 1726 reveals a subtle point that is sometimes lost in applying the Tobit model to cor ner solution responses the parameter s appears directly in the partial effects so having an estimate of s is crucial for estimating the partial effects Sometimes s is called an ancillary parameter which means it is auxiliary or unimportant Although it is true that the value of s does not affect the sign of the partial effects it does affect the magnitudes and we are often interested in the economic impor tance of the explanatory variables Therefore characterizing s as ancillary is misleading and comes from a confusion between the Tobit model for corner solution applications and applications to true data censoring For the latter see Section 174 All of the usual economic quantities such as elasticities can be computed For example the elas ticity of y with respect to x1 conditional on y 0 is E1y0y 0 x2 x1 x1 E1y0y 0 x2 1727 This can be computed when x1 appears in various functional forms including level logarithmic and quadratic forms If x1 is a binary variable the effect of interest is obtained as the difference between E1y0y 0 x2 with x1 5 1 and x1 5 0 Partial effects involving other discrete variables such as number of children can be handled similarly We can use 1725 to find the partial derivative of E1y0x2 with respect to continuous xj This derivative accounts for the fact that people starting at y 5 0 might choose y 0 when xj changes E1y0x2 xj 5 P1y 00x2 xj E1y0y 0x2 1 P1y 00x2 E1y0y 0x2 xj 1728 Because P1y 00x2 5 F1xbs2 P1y 00x2 xj 5 1bjs2f1xbs2 1729 so we can estimate each term in 1728 once we plug in the MLEs of the bj and s and particular values of the xj Remarkably when we plug 1726 and 1729 into 1728 and use the fact that F1c2l1c2 5 f1c2 for any c we obtain E1y0x2 xj 5 bjF1xbs2 1730 Equation 1730 allows us to roughly compare OLS and Tobit estimates Equation 1730 also can be derived directly from equation 1725 using the fact that df1z2dz 5 2zf1z2 The OLS slope coefficients say g j from the regression of yi on xi1 xi2 xik i 5 1 nthat is using all of the dataare direct estimates of E1y0x2xj To make the Tobit coefficient b j comparable to g j we must multiply b j by an adjustment factor As in the probit and logit cases there are two common approaches for computing an adjustment factor for obtaining partial effectsat least for continuous explanatory variables Both are based on equation 1730 First the PEA is obtained by evaluating F1xb s 2 which we denote F1xb s 2 We Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 540 can then use this single factor to multiply the coefficients on the continuous explanatory variables The PEA has the same drawbacks here as in the probit and logit cases we may not be interested in the partial effect for the average because the average is either uninteresting or meaningless Plus we must decide whether to use averages of nonlinear functions or plug the averages into the nonlinear functions The average partial effect APE is preferred in most cases Here we compute the scale factor as n21g n i51F1xib s 2 Unlike the PEA the APE does not require us to plug in a fictitious or non existent unit from the population and there are no decisions to make about plugging averages into nonlinear functions Like the PEA the APE scale factor is always between zero and one because 0 F1xb s 2 1 for any values of the explanatory variables In fact P 1yi 00xi2 5 F1xib s 2 and so the APE scale factor and the PEA scale factor tend to be closer to one when there are few observa tions with yi 5 0 In the case that yi 0 for all i the Tobit and OLS estimates of the parameters are identical Of course if yi 0 for all i we cannot justify the use of a Tobit model anyway Using log 1yi2 in a linear regression model makes much more sense Unfortunately for discrete explanatory variables comparing OLS and Tobit estimates is not so easy although using the scale factor for continuous explanatory variables often is a useful approxi mation For Tobit the partial effect of a discrete explanatory variable for example a binary variable should really be obtained by estimating E1y0x2 from equation 1725 For example if x1 is a binary we should first plug in x1 5 1 and then x1 5 0 If we set the other explanatory variables at their sam ple averages we obtain a measure analogous to 1716 for the logit and probit cases If we compute the difference in expected values for each individual and then average the difference we get an APE analogous to 1717 Fortunately many modern statistical packages routinely compute the APES for fairly complicated models including the Tobit model and allow both continuous and discrete explanatory variables ExamplE 172 married Womens annual labor Supply The file MROZ includes data on hours worked for 753 married women 428 of whom worked for a wage outside the home during the year 325 of the women worked zero hours For the women who worked positive hours the range is fairly broad extending from 12 to 4950 Thus annual hours worked is a good candidate for a Tobit model We also estimate a linear model using all 753 obser vations by OLS and compute the heteroskedasticityrobust standard errors The results are given in Table 173 This table has several noteworthy features First the Tobit coefficient estimates have the same sign as the corresponding OLS estimates and the statistical significance of the estimates is similar Possible exceptions are the coefficients on nwifeinc and kidsge6 but the t statistics have similar mag nitudes Second though it is tempting to compare the magnitudes of the OLS and Tobit estimates this is not very informative We must be careful not to think that because the Tobit coefficient on kidslt6 is roughly twice that of the OLS coefficient the Tobit model implies a much greater response of hours worked to young children We can multiply the Tobit estimates by appropriate adjustment factors to make them roughly comparable to the OLS estimates The APE scale factor n21 a i21 n F1xib s 2 turns out to be about 589 which we can use to obtain the average partial effects for the Tobit estimation If for example we multiply the educ coefficient by 589 we get 589180652 4750 that is 475 hours more which is quite a bit larger than the OLS partial effect about 288 hours Table 174 contains the APEs for all variables where the APEs for the linear model are simply the OLS coefficients except for the variable exper which appears as a quadratic The APEs and their standard errors obtained from Stata 13 are rounded to two decimal places and because of rounding can differ slightly from what is obtained by multiplying 589 by the reported Tobit coefficient The Tobit APEs for nwifeinc educ and kidslt6 are all substantially larger in magnitude than the corresponding OLS coefficients Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 541 The APEs for exper and age are similar and for kidsge6 which is nowhere close to being statistically significant the Tobit APE is smaller in magnitude If instead we want the estimated effect of another year of education starting at the average values of all explanatory variables then we compute the PEA scale factor F1xb s 2 This turns out to be about 645 when we use the squared average of experience 1exper2 2 rather than the average of exper2 This partial effect which is about 52 hours is almost twice as large as the OLS estimate We have reported an Rsquared for both the linear regression and the Tobit models The Rsquared for OLS is the usual one For Tobit the Rsquared is the square of the correlation coefficient between TAblE 173 OLS and Tobit Estimation of Annual Hours Worked Dependent Variable hours Independent Variables Linear OLS Tobit MLE nwifeinc 2345 224 2881 446 educ 2876 1304 8065 2158 exper 6567 1079 13156 1728 exper 2 2700 372 2186 054 age 23051 424 25441 742 kidslt6 244209 5746 289402 11188 kidsge6 23278 2280 21622 3864 constant 133048 27488 96531 44644 Loglikelihood value Rsquared s 266 75018 2381909 274 112202 TAblE 174 Average Partial Effects for the Hours Worked Models Independent Variables Linear Tobit nwifeinc 2345 224 2519 262 educ 2876 1304 4747 1262 exper 5078 445 4879 359 age 23051 424 23203 429 kidslt 6 244209 5746 252628 6471 kidsge 6 23278 2280 2955 2275 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 542 yi and yi where yi 5 F1xib s 2xib 1 s f1xib s 2 is the estimate of E1y0x 5 xi2 This is motivated by the fact that the usual Rsquared for OLS is equal to the squared correlation between the yi and the fitted values see equation 329 In nonlinear models such as the Tobit model the squared correla tion coefficient is not identical to an Rsquared based on a sum of squared residuals as in 328 This is because the fitted values as defined earlier and the residuals yi 2 yi are not uncorrelated in the sample An Rsquared defined as the squared correlation coefficient between yi and yi has the advan tage of always being between zero and one an Rsquared based on a sum of squared residuals need not have this feature We can see that based on the Rsquared measures the Tobit conditional mean function fits the hours data somewhat but not substantially better However we should remember that the Tobit esti mates are not chosen to maximize an Rsquaredthey maximize the loglikelihood functionwhereas the OLS estimates are the values that do produce the highest Rsquared given the linear functional form By construction all of the Tobit fitted values for hours are positive By contrast 39 of the OLS fit ted values are negative Although negative predictions are of some concern 39 out of 753 is just over 5 of the observations It is not entirely clear how negative fitted values for OLS translate into dif ferences in estimated partial effects Figure 173 plots estimates of E1hours0x2 as a function of educa tion for the Tobit model the other explanatory variables are set at their average values For the linear model the equation graphed is hours 5 38719 1 2876 educ For the Tobit model the equation gra phed is hours 5 F3 1269412 1 8065 educ21122024 1269412 1 8065 educ2 1 112202 f 3 1269412 1 8065 educ21122024 As can be seen from the figure the linear model gives notably higher estimates of the expected hours worked at even fairly high levels of education For example at eight years of education the OLS predicted value of hours is about 6175 while the Tobit estimate is about 4239 At 12 years of education the predicted hours are about 7327 and 5983 respectively The two prediction lines cross after 17 years of education but no woman in the sample has more than 17 years of education The increasing slope of the Tobit line clearly indicates the increasing marginal effect of education on expected hours worked estimated expected hours 20 1050 0 years of education 900 750 600 450 150 300 12 16 0 4 8 FiguRE 173 Estimated expected values of hours with respect to education for the linear and Tobit models Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 543 172b Specification Issues in Tobit Models The Tobit model and in particular the formulas for the expectations in 1724 and 1725 rely crucially on normality and homoskedasticity in the underlying latent variable model When E1y0x2 5 b0 1 b1x1 1 p 1 bk xk we know from Chapter 5 that conditional normality of y does not play a role in unbiasedness consistency or large sample inference Heteroskedasticity does not affect unbiasedness or consistency of OLS although we must compute robust standard errors and test sta tistics to perform approximate inference In a Tobit model if any of the assumptions in 1718 fail then it is hard to know what the Tobit MLE is estimating Nevertheless for moderate departures from the assumptions the Tobit model is likely to provide good estimates of the partial effects on the con ditional means It is possible to allow for more general assumptions in 1718 but such models are much more complicated to estimate and interpret One potentially important limitation of the Tobit model at least in certain applications is that the expected value conditional on y 0 is closely linked to the probability that y 0 This is clear from equations 1726 and 1729 In particular the effect of xj on P1y 00x2 is proportional to bj as is the effect on E1y0y 0x2 where both functions multiplying bj are positive and depend on x only through xbs This rules out some interesting possibilities For example consider the relation ship between amount of life insurance coverage and a persons age Young people may be less likely to have life insurance at all so the probability that y 0 increases with age at least up to a point Conditional on having life insurance the value of policies might decrease with age since life insur ance becomes less important as people near the end of their lives This possibility is not allowed for in the Tobit model One way to informally evaluate whether the Tobit model is appropriate is to estimate a pro bit model where the binary outcome say w equals one if y 0 and w 5 0 if y 5 0 Then from 1721 w follows a probit model where the coefficient on xj is gj 5 bjs This means we can esti mate the ratio of bj to s by probit for each j If the Tobit model holds the probit estimate g j should be close to b j s where b j and s are the Tobit estimates These will never be identical because of sampling error But we can look for certain problematic signs For example if g j is significant and negative but b j is positive the Tobit model might not be appropriate Or if g j and b j are the same sign but 0b j s 0 is much larger or smaller than 0g j0 this could also indicate problems We should not worry too much about sign changes or magnitude differences on explanatory variables that are insig nificant in both models In the annual hours worked example s 5 112202 When we divide the Tobit coefficient on nwi feinc by s we obtain 2881112202 20079 the probit coefficient on nwifeinc is about 2012 which is different but not dramatically so On kidslt6 the coefficient estimate over s is about 2797 compared with the probit estimate of 2868 Again this is not a huge difference but it indicates that having small children has a larger effect on the initial labor force participation decision than on how many hours a woman chooses to work once she is in the labor force Tobit effectively averages these two effects together We do not know whether the effects are statistically different but they are of the same order of magnitude What happens if we conclude that the Tobit model is inappropriate There are models usually called hurdle or twopart models that can be used when Tobit seems unsuitable These all have the property that P1y 00x2 and E1y0y 0x2 depend on different parameters so xj can have dissimilar effects on these two functions See Wooldridge 2010 Chapter 17 for a description of these models 173 The Poisson Regression Model Another kind of nonnegative dependent variable is a count variable which can take on nonnegative integer values 50 1 2 p6 We are especially interested in cases where y takes on relatively few val ues including zero Examples include the number of children ever born to a woman the number of times someone is arrested in a year or the number of patents applied for by a firm in a year For the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 544 same reasons discussed for binary and Tobit responses a linear model for E1y0x1 p xk2 might not provide the best fit over all values of the explanatory variables Nevertheless it is always informative to start with a linear model as we did in Example 35 As with a Tobit outcome we cannot take the logarithm of a count variable because it takes on the value zero A profitable approach is to model the expected value as an exponential function E1y0x1 x2 p xk2 5 exp1b0 1 b1x1 1 p 1 bk xk2 1731 Because exp1 2 is always positive 1731 ensures that predicted values for y will also be positive The exponential function is graphed in Figure A5 of Appendix A Although 1731 is more complicated than a linear model we basically already know how to interpret the coefficients Taking the log of equation 1731 shows that log3E1y0x1 x2 p xk2 4 5 b0 1 b1x1 1 p 1 bk xk 1732 so that the log of the expected value is linear Therefore using the approximation properties of the log function that we have used often in previous chapters DE1y0x2 1100bj2Dxj In other words 100bj is roughly the percentage change in E1y0x2 given a oneunit increase in xj Sometimes a more accurate estimate is needed and we can easily find one by looking at discrete changes in the expected value Keep all explanatory variables except xk fixed and let x0 k be the initial value and x1 k the subsequent value Then the proportionate change in the expected value is 3exp1b0 1 xk21bk21 1 bk x1 k2exp1b0 1 xk21bk21 1 bkx0 k2 4 2 1 5 exp1bkDxk2 2 1 where xk21bk21 is shorthand for b1x1 1 p 1 bk21xk21 and Dxk 5 x1 k 2 x0 k When Dxk 5 1for example if xk is a dummy variable that we change from zero to onethen the change is exp1bk2 2 1 Given b k we obtain exp1b k2 2 1 and multiply this by 100 to turn the proportionate change into a percentage change If say xj 5 log1zj2 for some variable zj 0 then its coefficient bj is interpreted as an elasticity with respect to zj Technically it is an elasticity of the expected value of y with respect to zj because we cannot compute the percentage change in cases where y 5 0 For our purposes the distinction is unimportant The bottom line is that for practical purposes we can interpret the coefficients in equa tion 1731 as if we have a linear model with log1y2 as the dependent variable There are some subtle differences that we need not study here Because 1731 is nonlinear in its parametersremember exp1 2 is a nonlinear functionwe cannot use linear regression methods We could use nonlinear least squares which just as with OLS minimizes the sum of squared residuals It turns out however that all standard count data distribu tions exhibit heteroskedasticity and nonlinear least squares does not exploit this see Wooldridge 2010 Chapter 12 Instead we will rely on maximum likelihood and the important related method of quasimaximum likelihood estimation In Chapter 4 we introduced normality as the standard distributional assumption for linear regres sion The normality assumption is reasonable for roughly continuous dependent variables that can take on a large range of values A count variable cannot have a normal distribution because the nor mal distribution is for continuous variables that can take on all values and if it takes on very few values the distribution can be very different from normal Instead the nominal distribution for count data is the Poisson distribution Because we are interested in the effect of explanatory variables on y we must look at the Poisson distribution conditional on x The Poisson distribution is entirely determined by its mean so we only need to specify E1y0x2 We assume this has the same form as 1731 which we write in shorthand as exp1xb2 Then the probability that y equals the value h conditional on x is P1y 5 h0x2 5 exp32exp1xb2 43exp1xb2 4hh h 5 0 1 p Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 545 where h denotes factorial see Appendix B This distribution which is the basis for the Poisson regression model allows us to find conditional probabilities for any values of the explanatory vari ables For example P1y 5 00x2 5 exp32exp1xb2 4 Once we have estimates of the bj we can plug them into the probabilities for various values of x Given a random sample 51xi yi2 i 5 1 2 p n6 we can construct the loglikelihood function 1b2 5 a n i51 i1b2 5 a n i51 5yixib 2 exp1xib2 6 1733 where we drop the term log1yi2 because it does not depend on b This loglikelihood function is simple to maximize although the Poisson MLEs are not obtained in closed form The standard errors of the Poisson estimates b j are easy to obtain after the loglikelihood function has been maximized the formula is in Appendix 17B These are reported along with the b j by any software package As with the probit logit and Tobit models we cannot directly compare the magni tudes of the Poisson estimates of an exponential function with the OLS estimates of a lin ear function Nevertheless a rough comparison is possible at least for continuous explanatory variables If 1731 holds then the partial effect of xj with respect to E1y0x1 x2 p xk2 is E1y0x1 x2 p xk2xj 5 exp1b0 1 b1x1 1 p 1 bkxk2 bj This expression follows from the chain rule in calculus because the derivative of the exponential function is just the exponential function If we let g j denote an OLS slope coefficient from the regression y on x1 x2 xk then we can roughly compare the magnitude of the g j and the average partial effect for an exponential regression function Interestingly the APE scale factor in this case n21 a i51 n exp1b 0 1 b 1xi1 1 p 1 bk xik2 5 n21 a n i51yi is simply the sample average y of yi where we define the fitted values as y1 5 exp1b 0 1 xib 2 In other words for Poisson regression with an exponential mean function the average of the fitted val ues is the same as the average of the original outcomes on yijust as in the linear regression case This makes it simple to scale the Poisson estimates b j to make them comparable to the correspond ing OLS estimates g j for a continuous explanatory variable we can compare g j to y b j Although Poisson MLE analysis is a natural first step for count data it is often much too restrictive All of the probabilities and higher moments of the Poisson distribution are determined entirely by the mean In particular the variance is equal to the mean Var1y0x2 5 E1y0x2 1734 This is restrictive and has been shown to be violated in many applications Fortunately the Poisson distribution has a very nice robustness property whether or not the Poisson distribution holds we still get consistent asymptotically normal estimators of the bj See Wooldridge 2010 Chapter 18 for details This is analogous to the OLS estimator which is consistent and asymptotically normal whether or not the normality assumption holds yet OLS is the MLE under normality When we use Poisson MLE but we do not assume that the Poisson distribution is entirely cor rect we call the analysis quasimaximum likelihood estimation QMLE The Poisson QMLE is very handy because it is programmed in many econometrics packages However unless the Poisson variance assumption 1734 holds the standard errors need to be adjusted A simple adjustment to the standard errors is available when we assume that the variance is pro portional to the mean Var1y0x2 5 s2E1y0x2 1735 where s2 0 is an unknown parameter When s2 5 1 we obtain the Poisson variance assumption When s2 1 the variance is greater than the mean for all x this is called overdispersion because the variance is larger than in the Poisson case and it is observed in many applications of count regres sions The case s2 1 called underdispersion is less common but is allowed in 1735 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 546 Under 1735 it is easy to adjust the usual Poisson MLE standard errors Let b j denote the Poisson QMLE and define the residuals as u i 5 yi 2 yi where yi 5 exp1b 0 1 b 1xi1 1 p 1 b kxik2 is the fitted value As usual the residual for observation i is the difference between yi and its fitted value A con sistent estimator of s2 is 1n 2 k 2 12 21 a n i51u 2 iyi where the division by yi is the proper heteroske dasticity adjustment and n 2 k 2 1 is the df given n observations and k 1 1 estimates b 0 b 1 p b k Letting s be the positive square root of s 2 we multiply the usual Poisson standard errors by s If s is notably greater than one the corrected standard errors can be much bigger than the nominal gener ally incorrect Poisson MLE standard errors Even 1735 is not entirely general Just as in the linear model we can obtain standard errors for the Poisson QMLE that do not restrict the variance at all See Wooldridge 2010 Chapter 18 for further explanation Under the Poisson distributional assumption we can use the likelihood ratio statistic to test exclusion restrictions which as always has the form in 1712 If we have q exclusion restrictions the statistic is dis tributed approximately as x2 q under the null Under the less restrictive assumption 1735 a simple adjust ment is available and then we call the statistic the quasilikelihood ratio statistic we divide 1712 by s 2 where s 2 is obtained from the unrestricted model ExamplE 173 poisson Regression for Number of arrests We now apply the Poisson regression model to the arrest data in CRIME1 used among other places in Example 91 The dependent variable narr86 is the number of times a man is arrested during 1986 This variable is zero for 1970 of the 2725 men in the sample and only eight values of narr86 are greater than five Thus a Poisson regression model is more appropriate than a linear regression model Table 175 also presents the results of OLS estimation of a linear regression model The standard errors for OLS are the usual ones we could certainly have made these robust to het eroskedasticity The standard errors for Poisson regression are the usual maximum likelihood stand ard errors Because s 5 1232 the standard errors for Poisson regression should be inflated by this factor so each corrected standard error is about 23 higher For example a more reliable standard error for tottime is 12310152 0185 which gives a t statistic of about 13 The adjustment to the standard errors reduces the significance of all variables but several of them are still very statistically significant The OLS and Poisson coefficients are not directly comparable and they have very different meanings For example the coefficient on pcnv implies that if Dpcnv 5 10 the expected number of arrests falls by 013 pcnv is the proportion of prior arrests that led to conviction The Poisson coefficient implies that Dpcnv 5 10 reduces expected arrests by about 4 4021102 5 0402 and we multiply this by 100 to get the percentage effect As a policy matter this suggests we can reduce overall arrests by about 4 if we can increase the probability of conviction by 1 The Poisson coefficient on black implies that other factors being equal the expected number of arrests for a black man is estimated to be about 100 3exp16612 2 14 937 higher than for a white man with the same values for the other explanatory variables As with the Tobit application in Table 173 we report an Rsquared for Poisson regression the squared correlation coefficient between yi and yi 5 exp1b 0 1 b 1xi1 1 p 1 b kxik2 The motivation for this goodnessoffit measure is the same as for the Tobit model We see that the exponential regression model estimated by Poisson QMLE fits slightly better Remember that the OLS estimates are chosen to maximize the Rsquared but the Poisson estimates are not They are selected to maximize the log likelihood function Suppose that we obtain s 2 5 2 How will the adjusted standard errors compare with the usual Poisson MLE standard errors How will the quasiLR statistic compare with the usual LR statistic Exploring FurthEr 174 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 547 Other count data regression models have been proposed and used in applications which generalize the Poisson distribution in a variety of ways If we are interested in the effects of the xj on the mean response there is little reason to go beyond Poisson regression it is simple often gives good results and has the robustness property discussed earlier In fact we could apply Poisson regression to a y that is a Tobitlike outcome provided 1731 holds This might give good estimates of the mean effects Extensions of Poisson regression are more useful when we are interested in estimating prob abilities such as P1y 10x2 See for example Cameron and Trivedi 1998 174 Censored and Truncated Regression Models The models in Sections 171 172 and 173 apply to various kinds of limited dependent variables that arise frequently in applied econometric work In using these methods it is important to remember that we use a probit or logit model for a binary response a Tobit model for a corner solution out come or a Poisson regression model for a count response because we want models that account for important features of the distribution of y There is no issue of data observability For example in the Tobit application to womens labor supply in Example 172 there is no problem with observing hours worked it is simply the case that a nontrivial fraction of married women in the population choose not to work for a wage In the Poisson regression application to annual arrests we observe the dependent variable for every young man in a random sample from the population but the dependent variable can be zero as well as other small integer values TAblE 175 Determinants of Number of Arrests for Young Men Dependent Variable narr86 Independent Variables Linear OLS Exponential Poisson QMLE pcnv 2132 040 2402 085 avgsen 2011 012 2024 020 tottime 012 009 024 015 ptime86 2041 009 2099 021 qemp86 2051 014 2038 029 inc86 20015 0003 20081 0010 black 327 045 661 074 hispan 194 040 500 074 born60 2022 033 2051 064 constant 577 038 2600 067 Loglikelihood value Rsquared s 073 829 2224876 077 1232 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 548 Unfortunately the distinction between lumpiness in an outcome variable such as taking on the value zero for a nontrivial fraction of the population and problems of data censoring can be confus ing This is particularly true when applying the Tobit model In this book the standard Tobit model described in Section 172 is only for corner solution outcomes But the literature on Tobit models usually treats another situation within the same framework the response variable has been censored above or below some threshold Typically the censoring is due to survey design and in some cases institutional constraints Rather than treat data censoring problems along with corner solution out comes we solve data censoring by applying a censored regression model Essentially the problem solved by a censored regression model is one of missing data on the response variable y Although we are able to randomly draw units from the population and obtain information on the explanatory vari ables for all units the outcome on yi is missing for some i Still we know whether the missing values are above or below a given threshold and this knowledge provides useful information for estimating the parameters A truncated regression model arises when we exclude on the basis of y a subset of the popula tion in our sampling scheme In other words we do not have a random sample from the underlying population but we know the rule that was used to include units in the sample This rule is determined by whether y is above or below a certain threshold We explain more fully the difference between cen sored and truncated regression models later 174a Censored Regression Models While censored regression models can be defined without distributional assumptions in this subsec tion we study the censored normal regression model The variable we would like to explain y follows the classical linear model For emphasis we put an i subscript on a random draw from the population yi 5 b0 1 xib 1 ui ui0xi ci Normal10 s22 1736 wi 5 min1yici2 1737 Rather than observing yi we observe it only if it is less than a censoring value ci Notice that 1736 includes the assumption that ui is independent of ci For concreteness we explicitly consider censor ing from above or right censoring the problem of censoring from below or left censoring is handled similarly One example of right data censoring is top coding When a variable is top coded we know its value only up to a certain threshold For responses greater than the threshold we only know that the variable is at least as large as the threshold For example in some surveys family wealth is top coded Suppose that respondents are asked their wealth but people are allowed to respond with more than 500000 Then we observe actual wealth for those respondents whose wealth is less than 500000 but not for those whose wealth is greater than 500000 In this case the censoring threshold ci is the same for all i In many situations the censoring threshold changes with individual or family characteristics If we observed a random sample for 1x y2 we would simply estimate b by OLS and statistical inference would be standard We again absorb the intercept into x for simplicity The censoring causes problems Using arguments similar to the Tobit model an OLS regression using only the uncensored Let mvpi be the marginal value product for worker i this is the price of a firms good multiplied by the marginal product of the worker Assume mvpi is a linear function of exogenous variables such as educa tion experience and so on and an unob servable error Under perfect competition and without institutional constraints each worker is paid his or her marginal value product Let minwagei denote the minimum wage for worker i which varies by state We observe wagei which is the larger of mvpi and minwagei Write the appropriate model for the observed wage Exploring FurthEr 175 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 549 observationsthat is those with yi ci produces inconsistent estimators of the bj An OLS regres sion of wi on xi using all observations does not consistently estimate the bj unless there is no cen soring This is similar to the Tobit case but the problem is much different In the Tobit model we are modeling economic behavior which often yields zero outcomes the Tobit model is supposed to reflect this With censored regression we have a data collection problem because for some reason the data are censored Under the assumptions in 1736 and 1737 we can estimate b and s2 by maximum like lihood given a random sample on 1xi wi2 For this we need the density of wi given 1xi ci2 For uncensored observations wi 5 yi and the density of wi is the same as that for yi Normal1xibs22 For censored observations we need the probability that wi equals the censoring value ci given xi P1wj 5 ci0xi2 5 P1yi ci0xi2 5 P1ui ci 2 xib2 5 1 2 F3 1ci 2 xib2s4 We can combine these two parts to obtain the density of wi given xi and ci f1w0xici2 5 1 2 F3 1ci 2 xib2s4 w 5 ci 1738 5 11s2f 3 1w 2 xib2s4 w ci 1739 The loglikelihood for observation i is obtained by taking the natural log of the density for each i We can maximize the sum of these across i with respect to the bj and s to obtain the MLEs It is important to know that we can interpret the bj just as in a linear regression model under ran dom sampling This is much different than Tobit applications to corner solution responses where the expectations of interest are nonlinear functions of the bj An important application of censored regression models is duration analysis A duration is a variable that measures the time before a certain event occurs For example we might wish to explain the number of days before a felon released from prison is arrested For some felons this may never happen or it may happen after such a long time that we must censor the duration in order to analyze the data In duration applications of censored normal regression as well as in top coding we often use the natural log as the dependent variable which means we also take the log of the censoring threshold in 1737 As we have seen throughout this text using the log transformation for the dependent variable causes the parameters to be interpreted as percentage changes Further as with many positive variables the log of a duration typically has a distribution closer to conditional normal than the duration itself ExamplE 174 Duration of Recidivism The file RECID contains data on the time in months until an inmate in a North Carolina prison is arrested after being released from prison call this durat Some inmates participated in a work pro gram while in prison We also control for a variety of demographic variables as well as for measures of prison and criminal history Of 1445 inmates 893 had not been arrested during the period they were followed therefore these observations are censored The censoring times differed among inmates ranging from 70 to 81 months Table 176 gives the results of censored normal regression for logdurat Each of the coeffi cients when multiplied by 100 gives the estimated percentage change in expected duration given a ceteris paribus increase of one unit in the corresponding explanatory variable Several of the coefficients in Table 176 are interesting The variables priors number of prior convictions and tserved total months spent in prison have negative effects on the time until the next arrest occurs This suggests that these variables measure proclivity for criminal activity rather than representing a deterrent effect For example an inmate with one more prior conviction has a duration until next arrest that is almost 14 less A year of time served reduces duration by about 100 1210192 5 228 A somewhat surprising finding is that a man serving time for a felony has an Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 550 estimated expected duration that is almost 56 3exp14442 2 1 564 longer than a man serving time for a nonfelony Those with a history of drug or alcohol abuse have substantially shorter expected durations until the next arrest The variables alcohol and drugs are binary variables Older men and men who were married at the time of incarceration are expected to have significantly longer durations until their next arrest Black men have substantially shorter durations on the order of 42 3exp125432 2 1 2424 The key policy variable workprg does not have the desired effect The point estimate is that other things being equal men who participated in the work program have estimated recidivism dura tions that are about 63 shorter than men who did not participate The coefficient has a small t statistic so we would probably conclude that the work program has no effect This could be due to a selfselection problem or it could be a product of the way men were assigned to the program Of course it may simply be that the program was ineffective In this example it is crucial to account for the censoring especially because almost 62 of the durations are censored If we apply straight OLS to the entire sample and treat the censored durations as if they were uncensored the coefficient estimates are markedly different In fact they are all shrunk toward zero For example the coefficient on priors becomes 2059 1se 5 0092 and that on alcohol becomes 2262 1se 5 0602 Although the directions of the effects are the same the importance of these variables is greatly diminished The censored regression estimates are much more reliable TAblE 176 Censored Regression Estimation of Criminal Recidivism Dependent Variable logdurat Independent Variables Coefficient Standard Error workprg 2063 120 priors 2137 021 tserved 2019 003 felon 444 145 alcohol 2635 144 drugs 2298 133 black 2543 117 married 341 140 educ 023 025 age 0039 0006 constant 4099 348 Loglikelihood value s 2159706 1810 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 551 There are other ways of measuring the effects of each of the explanatory variables in Table 176 on the duration rather than focusing only on the expected duration A treatment of modern duration analysis is beyond the scope of this text For an introduction see Wooldridge 2010 Chapter 22 If any of the assumptions of the censored normal regression model are violatedin particular if there is heteroskedasticity or nonnormality in uithe MLEs are generally inconsistent This shows that the censoring is potentially very costly as OLS using an uncensored sample requires neither normality nor homoskedasticity for consistency There are methods that do not require us to assume a distribution but they are more advanced See Wooldridge 2010 Chapter 19 174b Truncated Regression Models The truncated regression model differs in an important respect from the censored regression model In the case of data censoring we do randomly sample units from the population The censoring problem is that while we always observe the explanatory variables for each randomly drawn unit we observe the outcome on y only when it is not censored above or below a given threshold With data truncation we restrict attention to a subset of the population prior to sampling so there is a part of the popula tion for which we observe no information In particular we have no information on explanatory vari ables The truncated sampling scenario typically arises when a survey targets a particular subset of the population and perhaps due to cost considerations entirely ignores the other part of the population Subsequently researchers might want to use the truncated sample to answer questions about the entire population but one must recognize that the sampling scheme did not generate a random sample from the whole population As an example Hausman and Wise 1977 used data from a negative income tax experiment to study various determinants of earnings To be included in the study a family had to have income less than 15 times the 1967 poverty line where the poverty line depended on family size Hausman and Wise wanted to use the data to estimate an earnings equation for the entire population The truncated normal regression model begins with an underlying population model that satis fies the classical linear model assumptions y 5 b0 1 xb 1 u u0x Normal10 s22 1740 Recall that this is a strong set of assumptions because u must not only be independent of x but also normally distributed We focus on this model because relaxing the assumptions is difficult Under 1740 we know that given a random sample from the population OLS is the most effi cient estimation procedure The problem arises because we do not observe a random sample from the population Assumption MLR2 is violated In particular a random draw 1xi yi2 is observed only if yi ci where ci is the truncation threshold that can depend on exogenous variablesin par ticular the xi In the Hausman and Wise example ci depends on family size This means that if 51xi yi2 i 5 1 p n6 is our observed sample then yi is necessarily less than or equal to ci This dif fers from the censored regression model in a censored regression model we observe xi for any ran domly drawn observation from the population in the truncated model we only observe xi if yi ci To estimate the bj along with s we need the distribution of yi given that yi ci and xi This is written as g1y0xici2 5 f1y0xibs22 F1ci0xibs22 y ci 1741 where f1y0xibs22 denotes the normal density with mean b0 1 xib and variance s2 and F1ci0xibs22 is the normal cdf with the same mean and variance evaluated at ci This expression for the density conditional on yi ci makes intuitive sense it is the population density for y given x divided by the probability that yi is less than or equal to ci given xi P1yi ci0xi2 In effect we renormalize the density by dividing by the area under f 1 0xibs22 that is to the left of ci Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 552 If we take the log of 1741 sum across all i and maximize the result with respect to the bj and s2 we obtain the maximum likelihood estimators This leads to consistent approximately normal estimators The inference including standard errors and loglikelihood statistics is standard and treated in Wooldridge 2010 Chapter 19 We could analyze the data from Example 174 as a truncated sample if we drop all data on an observation whenever it is censored This would give us 552 observations from a truncated normal distribution where the truncation point differs across i However we would never analyze duration data or topcoded data in this way as it eliminates useful information The fact that we know a lower bound for 893 durations along with the explanatory variables is useful information censored regres sion uses this information while truncated regression does not A better example of truncated regression is given in Hausman and Wise 1977 where they emphasize that OLS applied to a sample truncated from above generally produces estimators biased toward zero Intuitively this makes sense Suppose that the relationship of interest is between income and education levels If we only observe people whose income is below a certain threshold we are lopping off the upper end This tends to flatten the estimated line relative to the true regression line in the whole population Figure 174 illustrates the problem when income is truncated from above at 50000 Although we observe the data points represented by the open circles we do not observe the data points represented by the darkened circles A regression analysis using the truncated sample does not lead to consistent estimators Incidentally if the sample in Figure 174 were censored rather than truncatedthat is we had topcoded datawe would observe education levels for all points in Figure 174 but for individuals with incomes above 50000 we would not know the exact income amount We would only know that income was at least 50000 In effect all observations represented by the darkened circles would be brought down to the horizontal line at income 5 50 As with censored regression if the underlying homoskedastic normal assumption in 1740 is violated the truncated normal MLE is biased and inconsistent Methods that do not require these assumptions are available see Wooldridge 2010 Chapter 19 for discussion and references income in thousands of dollars 20 150 50 15 education in years 10 true regression line regression line for truncated population FiguRE 174 A true or population regression line and the incorrect regression line for the truncated population with observed incomes below 50000 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 553 175 Sample Selection Corrections Truncated regression is a special case of a general problem known as nonrandom sample selection But survey design is not the only cause of nonrandom sample selection Often respondents fail to provide answers to certain questions which leads to missing data for the dependent or independent variables Because we cannot use these observations in our estimation we should wonder whether dropping them leads to bias in our estimators Another general example is usually called incidental truncation Here we do not observe y because of the outcome of another variable The leading example is estimating the socalled wage offer function from labor economics Interest lies in how various factors such as education affect the wage an individual could earn in the labor force For people who are in the workforce we observe the wage offer as the current wage But for those currently out of the workforce we do not observe the wage offer Because working may be systematically correlated with unobservables that affect the wage offer using only working peopleas we have in all wage examples so farmight produce biased estimators of the parameters in the wage offer equation Nonrandom sample selection can also arise when we have panel data In the simplest case we have two years of data but due to attrition some people leave the sample This is particularly a prob lem in policy analysis where attrition may be related to the effectiveness of a program 175a When Is OLS on the Selected Sample Consistent In Section 94 we provided a brief discussion of the kinds of sample selection that can be ignored The key distinction is between exogenous and endogenous sample selection In the truncated Tobit case we clearly have endogenous sample selection and OLS is biased and inconsistent On the other hand if our sample is determined solely by an exogenous explanatory variable we have exogenous sample selection Cases between these extremes are less clear and we now provide careful definitions and assumptions for them The population model is y 5 b0 1 b1x1 1 p 1 bkxk 1 u E1u0x1 x2 p xk2 5 0 1742 It is useful to write the population model for a random draw as yi 5 xib 1 ui 1743 where we use xib as shorthand for b0 1 b1xi1 1 b2xi2 1 p 1 bkxik Now let n be the size of a ran dom sample from the population If we could observe yi and each xij for all i we would simply use OLS Assume that for some reason either yi or some of the independent variables are not observed for certain i For at least some observations we observe the full set of variables Define a selection indicator si for each i by si 5 1 if we observe all of 1yi xi2 and si 5 0 otherwise Thus si 5 1 indi cates that we will use the observation in our analysis si 5 0 means the observation will not be used We are interested in the statistical properties of the OLS estimators using the selected sample that is using observations for which si 5 1 Therefore we use fewer than n observations say n1 It turns out to be easy to obtain conditions under which OLS is consistent and even unbiased Effectively rather than estimating 1743 we can only estimate the equation siyi 5 sixib 1 siui 1744 When si 5 1 we simply have 1743 when si 5 0 we simply have 0 5 0 1 0 which clearly tells us nothing about b Regressing si yi on sixi for i 5 1 2 p n is the same as regressing yi on xi using the observations for which si 5 1 Thus we can learn about the consistency of the b j by studying 1744 on a random sample From our analysis in Chapter 5 the OLS estimators from 1744 are consistent if the error term has zero mean and is uncorrelated with each explanatory variable In the population the zero mean assumption is E1su2 5 0 and the zero correlation assumptions can be stated as E3 1sxj2 1su2 4 5 E1sxju2 5 0 1745 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 554 where s xj and u are random variables representing the population we have used the fact that s2 5 s because s is a binary variable Condition 1745 is different from what we need if we observe all vari ables for a random sample E1xju2 5 0 Therefore in the population we need u to be uncorrelated with sxj The key condition for unbiasedness is E1su0sx1 p sxk2 5 0 As usual this is a stronger assump tion than that needed for consistency If s is a function only of the explanatory variables then sxj is just a function of x1 x2 p xk by the conditional mean assumption in 1742 sxj is also uncorrelated with u In fact E1su0sx1 p sxk2 5 sE1u0sx1 p sxk2 5 0 because E1u0x1 p xk2 5 0 This is the case of exogenous sample selection where si 5 1 is determined entirely by xi1 p xik As an example if we are estimat ing a wage equation where the explanatory variables are education experience tenure gender mari tal status and so onwhich are assumed to be exogenouswe can select the sample on the basis of any or all of the explanatory variables If sample selection is entirely random in the sense that si is independent of 1xi ui2 then E1sxju2 5 E1s2E1xju2 5 0 because E1xju2 5 0 under 1742 Therefore if we begin with a random sample and randomly drop observations OLS is still consistent In fact OLS is again unbiased in this case provided there is no perfect multicollinearity in the selected sample If s depends on the explanatory variables and additional random terms that are independent of x and u OLS is also consistent and unbiased For example suppose that IQ score is an explanatory variable in a wage equation but IQ is missing for some people Suppose we think that selection can be described by s 5 1 if IQ v and s 5 0 if IQ v where v is an unobserved random variable that is independent of IQ u and the other explanatory variables This means that we are more likely to observe an IQ that is high but there is always some chance of not observing any IQ Conditional on the explanatory variables s is independent of u which means that E1u0x1 p xk s2 5 E1u0x1 p xk2 and the last expectation is zero by assumption on the population model If we add the homoskedasticity assumption E1u20xs2 5 E1u22 5 s2 then the usual OLS standard errors and test statistics are valid So far we have shown several situations where OLS on the selected sample is unbiased or at least consistent When is OLS on the selected sample inconsistent We already saw one example regression using a truncated sample When the truncation is from above si 5 1 if yi ci where ci is the truncation threshold Equivalently si 5 1 if ui ci 2 xib Because si depends directly on ui si and ui will not be uncorrelated even conditional on xi This is why OLS on the selected sample does not consistently estimate the bj There are less obvious ways that s and u can be correlated we con sider this in the next subsection The results on consistency of OLS extend to instrumental variables estimation If the IVs are denoted zh in the population the key condition for consistency of 2SLS is E1szhu2 5 0 which holds if E1u0zs2 5 0 Therefore if selection is determined entirely by the exogenous variables z or if s depends on other factors that are independent of u and z then 2SLS on the selected sample is gener ally consistent We do need to assume that the explanatory and instrumental variables are appropri ately correlated in the selected part of the population Wooldridge 2010 Chapter 19 contains precise statements of these assumptions It can also be shown that when selection is entirely a function of the exogenous variables MLE of a nonlinear modelsuch as a logit or probit modelproduces consistent asymptotically normal estimators and the usual standard errors and test statistics are valid Again see Wooldridge 2010 Chapter 19 175b Incidental Truncation As we mentioned earlier a common form of sample selection is called incidental truncation We again start with the population model in 1742 However we assume that we will always observe the explanatory variables xj The problem is we only observe y for a subset of the population The rule determining whether we observe y does not depend directly on the outcome of y A leading example Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 555 is when y 5 log1wageo2 where wageo is the wage offer or the hourly wage that an individual could receive in the labor market If the person is actually working at the time of the survey then we observe the wage offer because we assume it is the observed wage But for people out of the workforce we cannot observe wageo Therefore the truncation of wage offer is incidental because it depends on another variable namely labor force participation Importantly we would generally observe all other information about an individual such as education prior experience gender marital status and so on The usual approach to incidental truncation is to add an explicit selection equation to the popula tion model of interest y 5 xb 1 u E1u0x2 5 0 1746 s 5 13zg 1 v 04 1747 where s 5 1 if we observe y and zero otherwise We assume that elements of x and z are always observed and we write xb 5 b0 1 b1x1 1 p 1 bkxk and zg 5 g0 1 g1z1 1 p 1 gmzm The equation of primary interest is 1746 and we could estimate b by OLS given a random sample The selection equation 1747 depends on observed variables zh and an unobserved error v A standard assumption which we will make is that z is exogenous in 1746 E1u0x z2 5 0 In fact for the following proposed methods to work well we will require that x be a strict subset of z any xj is also an element of z and we have some elements of z that are not also in x We will see later why this is crucial The error term v in the sample selection equation is assumed to be independent of z and there fore x We also assume that v has a standard normal distribution We can easily see that correlation between u and v generally causes a sample selection problem To see why assume that 1u v2 is inde pendent of z Then taking the expectation of 1746 conditional on z and v and using the fact that x is a subset of z gives E1y0zv2 5 xb 1 E1u0zv2 5 xb 1 E1u0v2 where E1u0zv2 5 E1u0v2 because 1u v2 is independent of z Now if u and v are jointly normal with zero mean then E1u0v2 5 rv for some parameter r Therefore E1y0zv2 5 xb 1 rv We do not observe v but we can use this equation to compute E1y0zs2 and then specialize this to s 5 1 We now have E1y0zs2 5 xb 1 rE1v0zs2 Because s and v are related by 1747 and v has a standard normal distribution we can show that E1v0z s2 is simply the inverse Mills ratio l1zg2 when s 5 1 This leads to the important equation E1y0zs 5 12 5 xb 1 rl1zg2 1748 Equation 1748 shows that the expected value of y given z and observability of y is equal to xb plus an additional term that depends on the inverse Mills ratio evaluated at xg Remember we hope to estimate b This equation shows that we can do so using only the selected sample provided we include the term l1zg2 as an additional regressor If r 5 0 l1zg2 does not appear and OLS of y on x using the selected sample consistently esti mates b Otherwise we have effectively omitted a variable l1zg2 which is generally correlated with x When does r 5 0 The answer is when u and v are uncorrelated Because g is unknown we cannot evaluate l1zig2 for each i However from the assumptions we have made s given z follows a probit model P1s 5 10z2 5 F1zg2 1749 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 556 Therefore we can estimate g by probit of si on zi using the entire sample In a second step we can estimate b We summarize the procedure which has recently been dubbed the Heckit method in econometrics literature after the work of Heckman 1976 Sample Selection Correction i Using all n observations estimate a probit model of si on zi and obtain the estimates g h Compute the inverse Mills ratio l i 5 l1zig 2 for each i Actually we need these only for the i with si 5 1 ii Using the selected sample that is the observations for which si 5 1 say n1 of them run the regression of yi on xi l i 1750 The b j are consistent and approximately normally distributed A simple test of selection bias is available from regression 1750 Namely we can use the usual t statistic on l i as a test of H0 r 5 0 Under H0 there is no sample selection problem When r 2 0 the usual OLS standard errors reported from 1750 are not correct This is because they do not account for estimation of g which uses the same observations in regression 1750 and more Some econometrics packages compute corrected standard errors Unfortunately it is not as simple as a heteroskedasticity adjustment See Wooldridge 2010 Chapter 6 for further discussion In many cases the adjustments do not lead to important differences but it is hard to know that before hand unless r is small and insignificant We recently mentioned that x should be a strict subset of z This has two implications First any element that appears as an explanatory variable in 1746 should also be an explanatory variable in the selection equation Although in rare cases it makes sense to exclude elements from the selection equation including all elements of x in z is not very costly excluding them can lead to inconsistency if they are incorrectly excluded A second major implication is that we have at least one element of z that is not also in x This means that we need a variable that affects selection but does not have a partial effect on y This is not absolutely necessary to apply the procedurein fact we can mechanically carry out the two steps when z 5 xbut the results are usually less than convincing unless we have an exclusion restriction in 1746 The reason for this is that while the inverse Mills ratio is a nonlinear function of z it is often well approximated by a linear function If z 5 x l i can be highly correlated with the elements of xi As we know such multicollinearity can lead to very high standard errors for the b j Intuitively if we do not have a variable that affects selection but not y it is extremely difficult if not impossible to distinguish sample selection from a misspecified functional form in 1746 ExamplE 175 Wage Offer Equation for married Women We apply the sample selection correction to the data on married women in MROZ Recall that of the 753 women in the sample 428 worked for a wage during the year The wage offer equation is standard with logwage as the dependent variable and educ exper and exper2 as the explanatory variables In order to test and correct for sample selection biasdue to unobservability of the wage offer for nonworking womenwe need to estimate a probit model for labor force participation In addition to the education and experience variables we include the factors in Table 171 other income age number of young children and number of older children The fact that these four variables are excluded from the wage offer equation is an assumption we assume that given the productivity fac tors nwifeinc age kidslt6 and kidsge6 have no effect on the wage offer It is clear from the probit results in Table 171 that at least age and kidslt6 have a strong effect on labor force participation Table 177 contains the results from OLS and Heckit The standard errors reported for the Heckit results are just the usual OLS standard errors from regression 1750 There is no evidence of a sample selection problem in estimating the wage offer equation The coefficient on l has a very small Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 557 t statistic 239 so we fail to reject H0 r 5 0 Just as importantly there are no practically large dif ferences in the estimated slope coefficients in Table 177 The estimated returns to education differ by only onetenth of a percentage point TAblE 177 Wage Offer Equation for Married Women Dependent Variable logwage Independent Variables OLS Heckit educ 108 014 109 016 exper 042 012 044 016 exper 2 200081 00039 200086 00044 constant 2522 199 2578 307 l 032 134 Sample size Rsquared 428 157 428 157 An alternative to the preceding twostep estimation method is full MLE This is more compli cated as it requires obtaining the joint distribution of y and s It often makes sense to test for sample selection using the previous procedure if there is no evidence of sample selection there is no reason to continue If we detect sample selection bias we can either use the twostep estimates or estimate the regression and selection equations jointly by MLE See Wooldridge 2010 Chapter 19 In Example 175 we know more than just whether a woman worked during the year we know how many hours each woman worked It turns out that we can use this information in an alternative sample selection procedure In place of the inverse Mills ratio l i we use the Tobit residuals say vi which are computed as vi 5 yi 2 xib whenever yi 0 It can be shown that the regression in 1750 with vi in place of l i also produces consistent estimates of the bj and the standard t statistic on vi is a valid test for sample selection bias This approach has the advantage of using more information but it is less widely applicable See Wooldridge 2010 Chapter 19 There are many more topics concerning sample selection One worth mentioning is models with endogenous explanatory variables in addition to possible sample selection bias Write a model with a single endogenous explanatory variable as y1 5 a1y2 1 z1b1 1 u1 1751 where y1 is only observed when s 5 1 and y2 may only be observed along with y1 An example is when y1 is the percentage of votes received by an incumbent and y2 is the percentage of total expen ditures accounted for by the incumbent For incumbents who do not run we cannot observe y1 or y2 If we have exogenous factors that affect the decision to run and that are correlated with campaign expenditures we can consistently estimate a1 and the elements of b1 by instrumental variables To be convincing we need two exogenous variables that do not appear in 1751 Effectively one should affect the selection decision and one should be correlated with y2 the usual requirement for estimat ing 1751 by 2SLS Briefly the method is to estimate the selection equation by probit where all exogenous variables appear in the probit equation Then we add the inverse Mills ratio to 1751 and estimate the equation by 2SLS The inverse Mills ratio acts as its own instrument as it depends Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 558 PART 3 Advanced Topics only on exogenous variables We use all exogenous variables as the other instruments As before we can use the t statistic on l i as a test for selection bias See Wooldridge 2010 Chapter 19 for further information Summary In this chapter we have covered several advanced methods that are often used in applications especially in microeconomics Logit and probit models are used for binary response variables These models have some advantages over the linear probability model fitted probabilities are between zero and one and the partial effects diminish The primary cost to logit and probit is that they are harder to interpret The Tobit model is applicable to nonnegative outcomes that pile up at zero but also take on a broad range of positive values Many individual choice variables such as labor supply amount of life insurance and amount of pension fund invested in stocks have this feature As with logit and probit the expected val ues of y given xeither conditional on y 0 or unconditionallydepend on x and b in nonlinear ways We gave the expressions for these expectations as well as formulas for the partial effects of each xj on the expectations These can be estimated after the Tobit model has been estimated by maximum likelihood When the dependent variable is a count variablethat is it takes on nonnegative integer valuesa Poisson regression model is appropriate The expected value of y given the xj has an exponential form This gives the parameter interpretations as semielasticities or elasticities depending on whether xj is in level or logarithmic form In short we can interpret the parameters as if they are in a linear model with logy as the dependent variable The parameters can be estimated by MLE However because the Poisson distribu tion imposes equality of the variance and mean it is often necessary to compute standard errors and test statistics that allow for over or underdispersion These are simple adjustments to the usual MLE standard errors and statistics Censored and truncated regression models handle specific kinds of missing data problems In cen sored regression the dependent variable is censored above or below a threshold We can use information on the censored outcomes because we always observe the explanatory variables as in duration applications or top coding of observations A truncated regression model arises when a part of the population is excluded entirely we observe no information on units that are not covered by the sampling scheme This is a special case of a sample selection problem Section 175 gave a systematic treatment of nonrandom sample selection We showed that exogenous sample selection does not affect consistency of OLS when it is applied to the subsample but endogenous sample selection does We showed how to test and correct for sample selection bias for the general problem of incidental truncation where observations are missing on y due to the outcome of another variable such as labor force participation Heckmans method is relatively easy to implement in these situations Key Terms Average Marginal Effect AME Average Partial Effect APE Binary Response Models Censored Normal Regression Model Censored Regression Model Corner Solution Response Count Variable Duration Analysis Exogenous Sample Selection Heckit Method Incidental Truncation Inverse Mills Ratio Latent Variable Model Likelihood Ratio Statistic Limited Dependent Variable LDV Logit Model LogLikelihood Function Maximum Likelihood Estimation MLE Nonrandom Sample Selection Overdispersion Partial Effect at the Average PEA Percent Correctly Predicted Poisson Distribution Poisson Regression Model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 559 Problems 1 i For a binary response y let y be the proportion of ones in the sample which is equal to the sample average of the yj Let q 0 be the percent correctly predicted for the outcome y 5 0 and let q 1 be the percent correctly predicted for the outcome y 5 1 If p is the overall percent correctly predicted show that p is a weighted average of q 0 and q 1 p 5 11 2 y2q 0 1 yq 1 ii In a sample of 300 suppose that y 5 70 so that there are 210 outcomes with yi 5 1 and 90 with yi 5 0 Suppose that the percent correctly predicted when y 5 0 is 80 and the percent correctly predicted when y 5 1 is 40 Find the overall percent correctly predicted 2 Let grad be a dummy variable for whether a studentathlete at a large university graduates in five years Let hsGPA and SAT be high school grade point average and SAT score respectively Let study be the number of hours spent per week in an organized study hall Suppose that using data on 420 studentathletes the following logit model is obtained P 1grad 5 10hsGPASATstudy2 5 L12117 1 24 hsGPA 1 00058 SAT 1 073 study2 where L1z2 5 exp1z231 1 exp1z2 4 is the logit function Holding hsGPA fixed at 30 and SAT fixed at 1200 compute the estimated difference in the graduation probability for someone who spent 10 hours per week in study hall and someone who spent 5 hours per week 3 Requires calculus i Suppose in the Tobit model that x1 5 log1z12 and this is the only place z1 appears in x Show that E1y0y 0x2 z1 5 1b1z12 51 2 l1xbs2 3xbs 1 l1xbs2 46 1752 where b1 is the coefficient on log1z12 ii If x1 5 z1 and x2 5 z2 1 show that E1y0y 0x2 z1 5 1b1 1 2b2z12 51 2 l1xbs2 3xbs 1 l1xbs2 46 where b1 is the coefficient on z1 and b2 is the coefficient on z2 1 4 Let mvpi be the marginal value product for worker i which is the price of a firms good multiplied by the marginal product of the worker Assume that log1mvpi2 5 b0 1 b1xi1 1 p 1 bkxik 1 ui wagei 5 max1mvpiminwagei2 where the explanatory variables include education experience and so on and minwagei is the mini mum wage relevant for person i Write log1wagei2 in terms of log1mvpi2 and log1minwagei2 Probit Model Pseudo RSquared QuasiLikelihood Ratio Statistic QuasiMaximum Likelihood Estimation QMLE Response Probability Selected Sample Tobit Model Top Coding Truncated Normal Regression Model Truncated Regression Model Wald Statistic Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 560 PART 3 Advanced Topics 5 Requires calculus Let patents be the number of patents applied for by a firm during a given year Assume that the conditional expectation of patents given sales and RD is E1patents0salesRD2 5 exp3b0 1 b1log1sales2 1 b2RD 1 b3RD24 where sales is annual firm sales and RD is total spending on research and development over the past 10 years i How would you estimate the bj Justify your answer by discussing the nature of patents ii How do you interpret b1 iii Find the partial effect of RD on EpatentssalesRD 6 Consider a family saving function for the population of all families in the United States sav 5 b0 1 b1inc 1 b2hhsize 1 b3educ 1 b4age 1 u where hhsize is household size educ is years of education of the household head and age is age of the household head Assume that E1uinchhsizeeducage2 5 0 i Suppose that the sample includes only families whose head is over 25 years old If we use OLS on such a sample do we get unbiased estimators of the bj Explain ii Now suppose our sample includes only married couples without children Can we estimate all of the parameters in the saving equation Which ones can we estimate iii Suppose we exclude from our sample families that save more than 25000 per year Does OLS produce consistent estimators of the bj 7 Suppose you are hired by a university to study the factors that determine whether students admitted to the university actually come to the university You are given a large random sample of students who were admitted the previous year You have information on whether each student chose to attend high school performance family income financial aid offered race and geographic variables Someone says to you Any analysis of that data will lead to biased results because it is not a random sample of all college applicants but only those who apply to this university What do you think of this criticism Computer Exercises C1 Use the data in PNTSPRD for this exercise i The variable favwin is a binary variable if the team favored by the Las Vegas point spread wins A linear probability model to estimate the probability that the favored team wins is P1 favwin 5 10spread2 5 b0 1 b1spread Explain why if the spread incorporates all relevant information we expect b0 5 5 ii Estimate the model from part i by OLS Test H0 b0 5 5 against a twosided alternative Use both the usual and heteroskedasticityrobust standard errors iii Is spread statistically significant What is the estimated probability that the favored team wins when spread 5 10 iv Now estimate a probit model for P1favwin 5 10spread2 Interpret and test the null hypothesis that the intercept is zero Hint Remember that F102 5 5 v Use the probit model to estimate the probability that the favored team wins when spread 5 10 Compare this with the LPM estimate from part iii vi Add the variables favhome fav25 and und25 to the probit model and test joint significance of these variables using the likelihood ratio test How many df are in the chisquare distribution Interpret this result focusing on the question of whether the spread incorporates all observable information prior to a game Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 561 C2 Use the data in LOANAPP for this exercise see also Computer Exercise C8 in Chapter 7 i Estimate a probit model of approve on white Find the estimated probability of loan approval for both whites and nonwhites How do these compare with the linear probability estimates ii Now add the variables hrat obrat loanprc unem male married dep sch cosign chist pubrec mortlat1 mortlat2 and vr to the probit model Is there statistically significant evidence of discrimination against nonwhites iii Estimate the model from part ii by logit Compare the coefficient on white to the probit estimate iv Use equation 1717 to estimate the sizes of the discrimination effects for probit and logit C3 Use the data in FRINGE for this exercise i For what percentage of the workers in the sample is pension equal to zero What is the range of pension for workers with nonzero pension benefits Why is a Tobit model appropriate for modeling pension ii Estimate a Tobit model explaining pension in terms of exper age tenure educ depends married white and male Do whites and males have statistically significant higher expected pension benefits iii Use the results from part ii to estimate the difference in expected pension benefits for a white male and a nonwhite female both of whom are 35 years old are single with no dependents have 16 years of education and have 10 years of experience iv Add union to the Tobit model and comment on its significance v Apply the Tobit model from part iv but with peratio the pensionearnings ratio as the dependent variable Notice that this is a fraction between zero and one but though it often takes on the value zero it never gets close to being unity Thus a Tobit model is fine as an approximation Does gender or race have an effect on the pensionearnings ratio C4 In Example 91 we added the quadratic terms pcnv2 ptime862 and inc862 to a linear model for narr86 i Use the data in CRIME1 to add these same terms to the Poisson regression in Example 173 ii Compute the estimate of s2 given by s 2 5 1n 2 k 2 12 21g n i51 u 2 iyi Is there evidence of overdispersion How should the Poisson MLE standard errors be adjusted iii Use the results from parts i and ii and Table 175 to compute the quasilikelihood ratio statistic for joint significance of the three quadratic terms What do you conclude C5 Refer to Table 131 in Chapter 13 There we used the data in FERTIL1 to estimate a linear model for kids the number of children ever born to a woman i Estimate a Poisson regression model for kids using the same variables in Table 131 Interpret the coefficient on y82 ii What is the estimated percentage difference in fertility between a black woman and a nonblack woman holding other factors fixed iii Obtain s Is there evidence of over or underdispersion iv Compute the fitted values from the Poisson regression and obtain the Rsquared as the squared correlation between kidsi and kidsi Compare this with the Rsquared for the linear regression model C6 Use the data in RECID to estimate the model from Example 174 by OLS using only the 552 uncen sored durations Comment generally on how these estimates compare with those in Table 176 C7 Use the MROZ data for this exercise i Using the 428 women who were in the workforce estimate the return to education by OLS including exper exper2 nwifeinc age kidslt6 and kidsge6 as explanatory variables Report your estimate on educ and its standard error ii Now estimate the return to education by Heckit where all exogenous variables show up in the secondstage regression In other words the regression is logwage on educ exper exper2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 562 PART 3 Advanced Topics nwifeinc age kidslt6 kidsge6 and l Compare the estimated return to education and its standard error to that from part i iii Using only the 428 observations for working women regress l on educ exper exper2 nwifeinc age kidslt6 and kidsge6 How big is the Rsquared How does this help explain your findings from part ii Hint Think multicollinearity C8 The file JTRAIN2 contains data on a job training experiment for a group of men Men could enter the program starting in January 1976 through about mid1977 The program ended in December 1977 The idea is to test whether participation in the job training program had an effect on unemployment prob abilities and earnings in 1978 i The variable train is the job training indicator How many men in the sample participated in the job training program What was the highest number of months a man actually participated in the program ii Run a linear regression of train on several demographic and pretraining variables unem74 unem75 age educ black hisp and married Are these variables jointly significant at the 5 level iii Estimate a probit version of the linear model in part ii Compute the likelihood ratio test for joint significance of all variables What do you conclude iv Based on your answers to parts ii and iii does it appear that participation in job training can be treated as exogenous for explaining 1978 unemployment status Explain v Run a simple regression of unem78 on train and report the results in equation form What is the estimated effect of participating in the job training program on the probability of being unemployed in 1978 Is it statistically significant vi Run a probit of unem78 on train Does it make sense to compare the probit coefficient on train with the coefficient obtained from the linear model in part v vii Find the fitted probabilities from parts v and vi Explain why they are identical Which approach would you use to measure the effect and statistical significance of the job training program viii Add all of the variables from part ii as additional controls to the models from parts v and vi Are the fitted probabilities now identical What is the correlation between them ix Using the model from part viii estimate the average partial effect of train on the 1978 unemployment probability Use 1717 with ck 5 0 How does the estimate compare with the OLS estimate from part viii C9 Use the data in APPLE for this exercise These are telephone survey data attempting to elicit the demand for a fictional ecologically friendly apple Each family was randomly presented with a set of prices for regular apples and the ecolabeled apples They were asked how many pounds of each kind of apple they would buy i Of the 660 families in the sample how many report wanting none of the ecolabeled apples at the set price ii Does the variable ecolbs seem to have a continuous distribution over strictly positive values What implications does your answer have for the suitability of a Tobit model for ecolbs iii Estimate a Tobit model for ecolbs with ecoprc regprc faminc and hhsize as explanatory variables Which variables are significant at the 1 level iv Are faminc and hhsize jointly significant v Are the signs of the coefficients on the price variables from part iii what you expect Explain vi Let b1 be the coefficient on ecoprc and let b2 be the coefficient on regprc Test the hypothesis H0 2b1 5 b2 against the twosided alternative Report the pvalue of the test You might want to refer to Section 44 if your regression package does not easily compute such tests vii Obtain the estimates of E1ecolbs0x2 for all observations in the sample See equation 1725 Call these ecolbsi What are the smallest and largest fitted values Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 563 viii Compute the squared correlation between ecolbsi and ecolbsi ix Now estimate a linear model for ecolbs using the same explanatory variables from part iii Why are the OLS estimates so much smaller than the Tobit estimates In terms of goodnessof fit is the Tobit model better than the linear model x Evaluate the following statement Because the Rsquared from the Tobit model is so small the estimated price effects are probably inconsistent C10 Use the data in SMOKE for this exercise i The variable cigs is the number of cigarettes smoked per day How many people in the sample do not smoke at all What fraction of people claim to smoke 20 cigarettes a day Why do you think there is a pileup of people at 20 cigarettes ii Given your answers to part i does cigs seem a good candidate for having a conditional Poisson distribution iii Estimate a Poisson regression model for cigs including logcigpric logincome white educ age and age2 as explanatory variables What are the estimated price and income elasticities iv Using the maximum likelihood standard errors are the price and income variables statistically significant at the 5 level v Obtain the estimate of s2 described after equation 1735 What is s How should you adjust the standard errors from part iv vi Using the adjusted standard errors from part v are the price and income elasticities now statistically different from zero Explain vii Are the education and age variables significant using the more robust standard errors How do you interpret the coefficient on educ viii Obtain the fitted values yi from the Poisson regression model Find the minimum and maximum values and discuss how well the exponential model predicts heavy cigarette smoking ix Using the fitted values from part viii obtain the squared correlation coefficient between yi and yi x Estimate a linear model for cigs by OLS using the explanatory variables and same functional forms as in part iii Does the linear model or exponential model provide a better fit Is either Rsquared very large C11 Use the data in CPS91 for this exercise These data are for married women where we also have infor mation on each husbands income and demographics i What fraction of the women report being in the labor force ii Using only the data for working womenyou have no choiceestimate the wage equation log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 1 b4black 1 b5hispanic 1 u by ordinary least squares Report the results in the usual form Do there appear to be significant wage differences by race and ethnicity iii Estimate a probit model for inlf that includes the explanatory variables in the wage equation from part ii as well as nwifeinc and kidlt6 Do these last two variables have coefficients of the expected sign Are they statistically significant iv Explain why for the purposes of testing and possibly correcting the wage equation for selection into the workforce it is important for nwifeinc and kidlt6 to help explain inlf What must you assume about nwifeinc and kidlt6 in the wage equation v Compute the inverse Mills ratio for each observation and add it as an additional regressor to the wage equation from part ii What is its twosided pvalue Do you think this is particularly small with 3286 observations vi Does adding the inverse Mills ratio change the coefficients in the wage regression in important ways Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 564 PART 3 Advanced Topics C12 Use the data in CHARITY to answer these questions i The variable respond is a binary variable equal to one if an individual responded with a donation to the most recent request The database consists only of people who have responded at least once in the past What fraction of people responded most recently ii Estimate a probit model for respond using resplast weekslast propresp mailsyear and avggift as explanatory variables Which of the explanatory variables is statistically significant iii Find the average partial effect for mailsyear and compare it with the coefficient from a linear probability model iv Using the same explanatory variables estimate a Tobit model for gift the amount of the most recent gift in Dutch guilders Now which explanatory variable is statistically significant v Compare the Tobit APE for mailsyear with that from a linear regression Are they similar vi Are the estimates from parts ii and iv entirely compatible with at Tobit model Explain C13 Use the data in HTV to answer this question i Using OLS on the full sample estimate a model for logwage using explanatory variables educ abil exper nc west south and urban Report the estimated return to education and its standard error ii Now estimate the equation from part i using only people with educ 16 What percentage of the sample is lost Now what is the estimated return to a year of schooling How does it compare with part i iii Now drop all observations with wage 20 so that everyone remaining in the sample earns less than 20 an hour Run the regression from part i and comment on the coefficient on educ Because the normal truncated regression model assumes that y is continuous it does not matter in theory whether we drop observations with wage 20 or wage 20 In practice including in this application it can matter slightly because there are some people who earn exactly 20 per hour iv Using the sample in part iii apply truncated regression with the upper truncation point being log20 Does truncated regression appear to recover the return to education in the full population assuming the estimate from i is consistent Explain C14 Use the data in HAPPINESS for this question See also Computer Exercise C15 in Chapter 13 i Estimate a probit probability model relating vhappy to occattend and regattend and include a full set of year dummies Find the average partial effects for occattend and regattend How do these compare with those from estimating a linear probability model ii Define a variable highinc equal to one if family income is above 25000 Include highinc unem10 educ and teens to the probit estimation in part ii Is the APE of regattend affected much What about its statistical significance iii Discuss the APEs and statistical significance of the four new variables in part ii Do the estimates make sense iv Controlling for the factors in part ii do there appear to be differences in happiness by gender or race Justify your answer C15 Use the data set in ALCOHOL obtained from Terza 2002 to answer this question The data on 9822 men includes labor market information whether the man abuses alcohol and demographic and background variables In this question you will study the effects of alcohol abuse on employ which is a binary variable equal to one if the man has a job If employ 5 0 the man is either unemployed or not in the workforce i What fraction of the sample is employed at the time of the interview What fraction of the sample has abused alcohol ii Run the simple regression of employ on abuse and report the results in the usual form obtaining the heteroskedasticityrobust standard errors Interpret the estimated equation Is the relationship as you expected Is it statistically significant Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 565 iii Run a probit of employ on abuse Do you get the same sign and statistical significance as in part ii How does the average partial effect for the probit compare with that for the linear probability model iv Obtain the fitted values for the LPM estimated in part ii and report what they are when abuse 5 0 and when abuse 5 1 How do these compare to the probit fitted values and why v To the LPM in part ii add the variables age agesq educ educsq married famsize white northeast midwest south centcity outercity qrt1 qrt2 and qrt3 What happens to the coefficient on abuse and its statistical significance vi Estimate a probit model using the variables in part v Find the APE of abuse and its t statistic Is the estimated effect now identical to that for the linear model Is it close vii Variables indicating the overall health of each man are also included in the data set Is it obvious that such variables should be included as controls Explain viii Why might abuse be properly thought of as endogenous in the employ equation Do you think the variables mothalc and fathalc indicating whether a mans mother or father were alcoholics are sensible instrumental variables for abuse ix Estimate the LPM underlying part v by 2SLS where mothalc and fathalc act as IVs for abuse Is the difference between the 2SLS and OLS coefficients practically large x Use the test described in Section 155 to test whether abuse is endogenous in the LPM C16 Use the data in CRIME1 to answer this question i For the OLS estimates reported in Table 175 find the heteroskedasticityrobust standard errors In terms of statistical significance of the coefficients are there any notable changes ii Obtain the fully robust standard errorsthat is those that do not even require assumption 1735for the Poisson regression estimates in the second column This requires that you have a statistical package that computes the fully robust standard errors Compare the fully robust 95 confidence interval for bpcnv with that obtained using the standard error in Table 175 iii Compute the average partial effects for each variable in the Poisson regression model Use the formula for binary explanatory variables for black hispan and born60 Compare the APEs for qemp86 and inc86 with the corresponding OLS coefficients iv If your statistical package reports the robust standard errors for the APEs in part iii compare the robust t statistic for the OLS estimate of bpcnv with the robust t statistic for the APE of pcnv in the Poisson regression APPEndix 17A 17A1 Maximum Likelihood Estimation with Explanatory Variables Appendix C provides a review of maximum likelihood estimation MLE in the simplest case of estimating the parameters in an unconditional distribution But most models in econometrics have explanatory variables whether we estimate those models by OLS or MLE The latter is indispens able for nonlinear models and here we provide a very brief description of the general approach All of the models covered in this chapter can be put in the following form Let f1y0xb2 denote the density function for a random draw yi from the population conditional on xi 5 x The maximum likelihood estimator MLE of b maximizes the loglikelihood function max b a n i51 log f1yi0xi b2 1753 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 566 PART 3 Advanced Topics where the vector b is the dummy argument in the maximization problem In most cases the MLE which we write as b is consistent and has an approximate normal distribution in large samples This is true even though we cannot write down a formula for b except in very special circumstances For the binary response case logit and probit the conditional density is determined by two values f 110xb2 5 P1yi 5 10xi2 5 G1xib2 and f 100xb2 5 P1yi 5 00xi2 5 1 2 G1xib2 In fact a succinct way to write the density is f1y0xb2 5 31 2 G1xb2 4112y23G1xb2 4y for y 5 0 1 Thus we can write 1753 as max b a n i51 5 11 2 yi2log31 2 G1xib2 4 1 yilog3G1xib2 46 1754 Generally the solutions to 1754 are quickly found by modern computers using iterative meth ods to maximize a function The total computation time even for fairly large data sets is typically quite low The loglikelihood function for the Tobit model and for censored and truncated regression are only slightly more complicated depending on an additional variance parameter in addition to b They are easily derived from the densities obtained in the text See Wooldridge 2010 for details APPEndix 17B 17B1 Asymptotic Standard Errors in Limited Dependent Variable Models Derivations of the asymptotic standard errors for the models and methods introduced in this chapter are well beyond the scope of this text Not only do the derivations require matrix algebra but they also require advanced asymptotic theory of nonlinear estimation The background needed for a care ful analysis of these methods and several derivations are given in Wooldridge 2010 It is instructive to see the formulas for obtaining the asymptotic standard errors for at least some of the methods Given the binary response model P1y 5 10x2 5 G1xb2 where G1 2 is the logit or probit function and b is the k 3 1 vector of parameters the asymptotic variance matrix of b is estimated as Avar1b 2 a a n i51 3g1xib 2 42xrixi G1xib 2 31 2 G1xib 2 4 b 21 1755 which is a k 3 k matrix See Appendix D for a summary of matrix algebra Without the terms involving g1 2 and G1 2 this formula looks a lot like the estimated variance matrix for the OLS estimator minus the term s 2 The expression in 1755 accounts for the nonlinear nature of the response probabilitythat is the nonlinear nature of G1 2as well as the particular form of hetero skedasticity in a binary response model Var1y0x2 5 G1xb2 31 2 G1xb2 4 The square roots of the diagonal elements of 1755 are the asymptotic standard errors of the b j and they are routinely reported by econometrics software that supports logit and probit analysis Once we have these asymptotic t statistics and confidence intervals are obtained in the usual ways The matrix in 1755 is also the basis for Wald tests of multiple restrictions on b see Wooldridge 2010 Chapter 15 The asymptotic variance matrix for Tobit is more complicated but has a similar structure Note that we can obtain a standard error for s as well The asymptotic variance for Poisson regression allowing for s2 2 1 in 1735 has a form much like 1755 Avar1b 2 5 s 2a a n i51 exp1xib 2xrixib 21 1756 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 567 The square roots of the diagonal elements of this matrix are the asymptotic standard errors If the Poisson assumption holds we can drop s 2 from the formula because s2 5 1 The formula for the fully robust variance matrix estimator is obtained in Wooldridge 2010 Chapter 18 Avar1b 2 5 c a n i51 exp1xib 2xrixid 21 a a n i51 u 2 ixrixib c a n i51 exp1xib 2xrixid 21 where u i 5 yi 2 exp1xib 2 are the residuals from the Poisson regression This expression has a struc ture similar to the heteroskedasticityrobust standard variance matrix estimator for OLS and it is computed routinely by many software packages to obtain the fully robust standard errors Asymptotic standard errors for censored regression truncated regression and the Heckit sample selection correction are more complicated although they share features with the previous formulas See Wooldridge 2010 for details Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 568 I n this chapter we cover some more advanced topics in time series econometrics In Chapters 10 11 and 12 we emphasized in several places that using time series data in regression analysis requires some care due to the trending persistent nature of many economic time series In addition to studying topics such as infinite distributed lag models and forecasting we also discuss some recent advances in analyzing time series processes with unit roots In Section 181 we describe infinite distributed lag models which allow a change in an explana tory variable to affect all future values of the dependent variable Conceptually these models are straightforward extensions of the finite distributed lag models in Chapter 10 but estimating these models poses some interesting challenges In Section 182 we show how to formally test for unit roots in a time series process Recall from Chapter 11 that we excluded unit root processes to apply the usual asymptotic theory Because the presence of a unit root implies that a shock today has a longlasting impact determining whether a process has a unit root is of interest in its own right We cover the notion of spurious regression between two time series processes each of which has a unit root in Section 183 The main result is that even if two unit root series are independent it is quite likely that the regression of one on the other will yield a statistically significant t statistic This emphasizes the potentially serious consequences of using standard inference when the dependent and independent variables are integrated processes The notion of cointegration applies when two series are I1 but a linear combination of them is I0 in this case the regression of one on the other is not spurious but instead tells us something Advanced Time Series Topics c h a p t e r 18 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 569 about the longrun relationship between them Cointegration between two series also implies a particular kind of model called an error correction model for the shortterm dynamics We cover these models in Section 184 In Section 185 we provide an overview of forecasting and bring together all of the tools in this and previous chapters to show how regression methods can be used to forecast future outcomes of a time series The forecasting literature is vast so we focus only on the most common regressionbased methods We also touch on the related topic of Granger causality 181 Infinite Distributed Lag Models Let 51yt zt2 t 5 p 22 21 0 1 2 p6 be a bivariate time series process which is only partially observed An infinite distributed lag IDL model relating yt to current and all past values of z is yt 5 a 1 d0zt 1 d1zt21 1 d2zt22 1 p 1 ut 181 where the sum on lagged z extends back to the indefinite past This model is only an approximation to reality as no economic process started infinitely far into the past Compared with a finite distributed lag model an IDL model does not require that we truncate the lag at a particular value In order for model 181 to make sense the lag coefficients dj must tend to zero as j S This is not to say that d2 is smaller in magnitude than d1 it only means that the impact of zt2j on yt must eventually become small as j gets large In most applications this makes economic sense as well the distant past of z should be less important for explaining y than the recent past of z Even if we decide that 181 is a useful model we clearly cannot estimate it without some restrictions For one we only observe a finite history of data Equation 181 involves an infinite number of parameters d0 d1 d2 p which cannot be estimated without restrictions Later we place restrictions on the dj that allow us to estimate 181 As with finite distributed lag FDL models the impact propensity in 181 is simply d0 see Chapter 10 Generally the dh have the same interpretation as in an FDL Suppose that zs 5 0 for all s 0 and that z0 5 1 and zs 5 0 for all s 1 in other words at time t 5 0 z increases temporarily by one unit and then reverts to its initial level of zero For any h 0 we have yh 5 a 1 dh 1 uh for all h 0 and so E1yh2 5 a 1 dh 182 where we use the standard assumption that uh has zero mean It follows that dh is the change in E1yh2 given a oneunit temporary change in z at time zero We just said that dh must be tending to zero as h gets large for the IDL to make sense This means that a temporary change in z has no longrun effect on expected y E1yh2 5 a 1 dh S a as h S We assumed that the process z starts at zs 5 0 and that the oneunit increase occurred at t 5 0 These were only for the purpose of illustration More generally if z temporarily increases by one unit from any initial level at time t then dh measures the change in the expected value of y after h peri ods The lag distribution which is dh plotted as a function of h shows the expected path that future outcomes on y follow given the oneunit temporary increase in z The longrun propensity in model 181 is the sum of all of the lag coefficients LRP 5 d0 1 d1 1 d2 1 d3 1 p 183 where we assume that the infinite sum is well defined Because the dj must converge to zero the LRP can often be well approximated by a finite sum of the form d0 1 d1 1 p 1 dp for sufficiently large p To interpret the LRP suppose that the process zt is steady at zs 5 0 for s 0 At t 5 0 the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 570 process permanently increases by one unit For example if zt is the percentage change in the money supply and yt is the inflation rate then we are interested in the effects of a permanent increase of one percentage point in money supply growth Then by substituting zs 5 0 for s 0 and zt 5 1 for t 0 we have yh 5 a 1 d0 1 d1 1 p 1 dh 1 uh where h 0 is any horizon Because ut has a zero mean for all t we have E1yh2 5 a 1 d0 1 d1 1 p 1 dh 184 It is useful to compare 184 and 182 As the horizon increases that is as h S the righthand side of 184 is by definition the longrun propensity plus a Thus the LRP measures the longrun change in the expected value of y given a oneunit permanent increase in z The previous derivation of the LRP and the interpretation of dj used the fact that the errors have a zero mean as usual this is not much of an assump tion provided an intercept is included in the model A closer examination of our reasoning shows that we assumed that the change in z during any time period had no effect on the expected value of ut This is the infinite distributed lag version of the strict exogeneity assumption that we introduced in Chapter 10 in particular Assumption TS3 Formally E1ut0 p zt22 zt21 zt zt11 p2 5 0 185 so that the expected value of ut does not depend on the z in any time period Although 185 is natu ral for some applications it rules out other important possibilities In effect 185 does not allow feedback from yt to future z because zt1h must be uncorrelated with ut for h 0 In the inflation money supply growth example where yt is inflation and zt is money supply growth 185 rules out future changes in money supply growth that are tied to changes in todays inflation rate Given that money supply policy often attempts to keep interest rates and inflation at certain levels this might be unrealistic One approach to estimating the dj which we cover in the next subsection requires a strict exog eneity assumption in order to produce consistent estimators of the dj A weaker assumption is E1ut0zt zt21 p2 5 0 186 Under 186 the error is uncorrelated with current and past z but it may be correlated with future z this allows zt to be a variable that follows policy rules that depend on past y Sometimes 186 is sufficient to estimate the dj we explain this in the next subsection One thing to remember is that neither 185 nor 186 says anything about the serial correlation properties of 5ut6 This is just as in finite distributed lag models If anything we might expect the 5ut6 to be serially correlated because 181 is not generally dynamically complete in the sense dis cussed in Section 114 We will study the serial correlation problem later How do we interpret the lag coefficients and the LRP if 186 holds but 185 does not The answer is the same way as before We can still do the previous thought or counterfactual experi ment even though the data we observe are generated by some feedback between yt and future z For example we can certainly ask about the longrun effect of a permanent increase in money supply growth on inflation even though the data on money supply growth cannot be characterized as strictly exogenous Suppose that zs 5 0 for s 0 and that z0 5 1 z1 5 1 and zs 5 0 for s 1 Find E1y212 E1y02 and E1yh2 for h 1 What happens as h S Exploring FurthEr 181 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 571 181a The Geometric or Koyck Distributed Lag Because there are generally an infinite number of dj we cannot consistently estimate them without some restrictions The simplest version of 181 which still makes the model depend on an infinite number of lags is the geometric or Koyck distributed lag In this model the dj depend on only two parameters dj 5 gr j 0r0 1 j 5 0 1 2 p 187 The parameters g and r may be positive or negative but r must be less than one in absolute value This ensures that dj S 0 as j S In fact this convergence happens at a very fast rate For example with r 5 5 and j 5 10 rj 5 11024 001 The impact propensity IP in the GDL is simply d0 5 g so the sign of the IP is deter mined by the sign of g If g 0 say and r 0 then all lag coefficients are positive If r 0 the lag coefficients alternate in sign rj is negative for odd j The longrun propensity is more difficult to obtain but we can use a standard result on the sum of a geometric series for 0r0 1 1 1 r 1 r2 1 p 1 rj 1 p 5 111 2 r2 and so LRP 5 g11 2 r2 The LRP has the same sign as g If we plug 187 into 181 we still have a model that depends on the z back to the indefi nite past Nevertheless a simple subtraction yields an estimable model Write the IDL at times t and t 2 1 as yt 5 a 1 gzt 1 grzt21 1 gr2zt22 1 p 1 ut 188 and yt21 5 a 1 gzt21 1 grzt22 1 gr2zt23 1 p 1 ut21 189 If we multiply the second equation by r and subtract it from the first all but a few of the terms cancel yt 2 ryt21 5 11 2 r2a 1 gzt 1 ut 2 rut21 which we can write as yt 5 a0 1 gzt 1 ryt21 1 ut 2 rut21 1810 where a0 5 11 2 r2a This equation looks like a standard model with a lagged dependent variable where zt appears contemporaneously Because g is the coefficient on zt and r is the coefficient on yt21 it appears that we can estimate these parameters If for some reason we are interested in a we can always obtain a 5 a 011 2 r 2 after estimating r and a0 The simplicity of 1810 is somewhat misleading The error term in this equation ut 2 rut21 is generally correlated with yt21 From 189 it is pretty clear that ut21 and yt21 are correlated Therefore if we write 1810 as yt 5 a0 1 gzt 1 ryt21 1 vt 1811 where vt ut 2 rut21 then we generally have correlation between vt and yt21 Without further assumptions OLS estimation of 1811 produces inconsistent estimates of g and r One case where vt must be correlated with yt21 occurs when ut is independent of zt and all past values of z and y Then 188 is dynamically complete so ut is uncorrelated with yt21 From 189 the covariance between vt and yt21 is 2rVar1ut212 5 2rs2 u which is zero only if r 5 0 We can easily see that vt is serially correlated because 5ut6 is serially uncorrelated E1vtvt212 5 E1utut212 2 rE1u2 t212 2 rE1utut222 1 r2E1ut21ut222 5 2rs2 u For j 1 E1vtvt2j2 5 0 Thus 5vt6 is a moving average process of order one see Section 111 This and equation 1811 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 572 gives an example of a modelwhich is derived from the original model of interestthat has a lagged dependent variable and a particular kind of serial correlation If we make the strict exogeneity assumption 185 then zt is uncorrelated with ut and ut21 and therefore with vt Thus if we can find a suitable instrumental variable for yt21 then we can estimate 1811 by IV What is a good IV candidate for yt21 By assumption ut and ut21 are both uncorrelated with zt21 so vt is uncorrelated with zt21 If g 2 0 zt21 and yt21 are correlated even after partialling out zt Therefore we can use instruments 1zt zt212 to estimate 1811 Generally the standard errors need to be adjusted for serial correlation in the 5vt6 as we discussed in Section 157 An alternative to IV estimation exploits the fact that 5ut6 may contain a specific kind of serial correlation In particular in addition to 186 suppose that 5ut6 follows the AR1 model ut 5 rut21 1 et 1812 E1et0zt yt21 zt21 p2 5 0 1813 It is important to notice that the r appearing in 1812 is the same parameter multiplying yt21 in 1811 If 1812 and 1813 hold we can write equation 1810 as yt 5 a0 1 gzt 1 ryt21 1 et 1814 which is a dynamically complete model under 1813 From Chapter 11 we can obtain consist ent asymptotically normal estimators of the parameters by OLS This is very convenient as there is no need to deal with serial correlation in the errors If et satisfies the homoskedasticity assumption Var1et0zt yt212 5 s2 e the usual inference applies Once we have estimated g and r we can easily esti mate the LRP LRP 5 g11 2 r 2 Many econometrics packages have simple commands that allow one to obtain a standard error for the estimated LRP The simplicity of this procedure relies on the potentially strong assumption that 5ut6 follows an AR1 process with the same r appearing in 187 This is usually no worse than assuming the 5ut6 are serially uncorrelated Nevertheless because consistency of the estimators relies heavily on this assumption it is a good idea to test it A simple test begins by specifying 5ut6 as an AR1 process with a different parameter say ut 5 lut21 1 et McClain and Wooldridge 1995 devised a simple Lagrange multiplier test of H0 l 5 r that can be computed after OLS estimation of 1814 The geometric distributed lag model extends to multiple explanatory variablesso that we have an infinite DL in each explanatory variablebut then we must be able to write the coefficient on zt2j h as ghr j In other words though gh is different for each explanatory variable r is the same Thus we can write yt 5 a0 1 g1zt1 1 p 1 gkztk 1 ryt21 1 vt 1815 The same issues that arose in the case with one z arise in the case with many z Under the natu ral extension of 1812 and 1813just replace zt with zt 5 1zt1 p ztk2OLS is consistent and asymptotically normal Or an IV method can be used 181b Rational Distributed Lag Models The geometric DL implies a fairly restrictive lag distribution When g 0 and r 0 the dj are positive and monotonically declining to zero It is possible to have more general infinite distributed lag models The GDL is a special case of what is generally called a rational distributed lag RDL model A general treatment is beyond our scopeHarvey 1990 is a good referencebut we can cover one simple useful extension Such an RDL model is most easily described by adding a lag of z to equation 1811 yt 5 a0 1 g0zt 1 ryt21 1 g1zt21 1 vt 1816 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 573 where vt 5 ut 2 rut21 as before By repeated substitution it can be shown that 1816 is equivalent to the infinite distributed lag model yt 5 a 1 g01zt 1 rzt21 1 r2zt22 1 p2 1 g11zt21 1 rzt22 1 r2zt23 1 p2 1 ut 5 a 1 g0zt 1 1rg0 1 g12zt21 1 r1rg0 1 g12zt22 1 r21rg0 1 g12zt23 1 p 1 ut where we again need the assumption 0r0 1 From this last equation we can read off the lag distri bution In particular the impact propensity is g0 while the coefficient on zt2h is rh211rg0 1 g12 for h 1 Therefore this model allows the impact propensity to differ in sign from the other lag coef ficients even if r 0 However if r 0 the dh have the same sign as 1rg0 1 g12 for all h 1 The lag distribution is plotted in Figure 181 for r 5 5 g0 5 21 and g1 5 1 The easiest way to compute the longrun propensity is to set y and z at their longrun values for all t say yp and zp and then find the change in yp with respect to zp see also Problem 3 in Chapter 10 We have yp 5 a0 1 g0zp 1 ryp 1 g1zp and solving gives yp 5 a011 2 r2 1 1g0 1 g1211 2 r2zp Now we use the fact that LRP 5 DypDzp LRP 5 1g0 1 g1211 2 r2 Because 0r0 1 the LRP has the same sign as g0 1 g1 and the LRP is zero if and only if g0 1 g1 5 0 as in Figure 181 ExamplE 181 Housing Investment and Residential price Inflation We estimate both the basic geometric and the rational distributed lag models by applying OLS to 1814 and 1816 respectively The dependent variable is loginvpc after a linear time trend has been removed that is we linearly detrend loginvpc For zt we use the growth in the price index This allows us to estimate how residential price inflation affects movements in housing investment around its trend The results of the estimation using the data in HSEINV are given in Table 181 FiguRE 181 Lag distribution for the rational distributed lag 1816 with r 5 5 g0 5 21 and g1 5 1 coefficient 5 5 10 lag 21 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 574 The geometric distributed lag model is clearly rejected by the data as gprice21 is very signifi cant The adjusted Rsquareds also show that the RDL model fits much better The two models give very different estimates of the longrun propensity If we incorrectly use the GDL the estimated LRP is almost five a permanent one percentage point increase in residential price inflation increases longterm housing investment by 47 above its trend value Economically this seems implausible The LRP estimated from the rational distributed lag model is below one In fact we cannot reject the null hypothesis H0 g0 1 g1 5 0 at any reasonable significance level 1pvalue 5 832 so there is no evidence that the LRP is different from zero This is a good exam ple of how misspecifying the dynamics of a model by omitting relevant lags can lead to erroneous conclusions 182 Testing for Unit Roots We now turn to the important problem of testing whether a time series follows a Unit Roots In Chapter 11 we gave some vague necessarily informal guidelines to decide whether a series is I1 or not In many cases it is useful to have a formal test for a unit root As we will see such tests must be applied with caution The simplest approach to testing for a unit root begins with an AR1 model yt 5 a 1 ryt21 1 et t 5 1 2 p 1817 where y0 is the observed initial value Throughout this section we let 5et6 denote a process that has zero mean given past observed y E1et0yt21 yt22 p y02 5 0 1818 Under 1818 5et6 is said to be a martingale difference sequence with respect to 5yt21 yt22 p6 If 5et6 is assumed to be iid with zero mean and is independent of y0 then it also satisfies 1818 If 5yt6 follows 1817 it has a unit root if and only if r 5 1 If a 5 0 and r 5 1 5yt6 follows a random walk without drift with the innovations et satisfying 1818 If a 2 0 and r 5 1 5yt6 is a random walk with drift which means that E1yt2 is a linear function of t A unit root process with drift behaves very differently from one without drift Nevertheless it is common to leave a unspecified TAblE 181 Distributed Lag Models for Housing Investment Dependent Variable loginvpc detrended Independent Variables GeometricDL RationalDL gprice 3095 933 3256 970 y21 340 132 547 152 gprice21 22936 973 constant 2010 018 006 017 Longrun propensity 4689 706 Sample size Adjusted Rsquared 41 375 40 504 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 575 under the null hypothesis and this is the approach we take Therefore the null hypothesis is that 5yt6 has a unit root H0 r 5 1 1819 In almost all cases we are interested in the onesided alternative H1 r 1 1820 In practice this means 0 r 1 as r 0 for a series that we suspect has a unit root would be very rare The alternative H1 r 1 is not usually considered since it implies that yt is explosive In fact if a 0 yt has an exponential trend in its mean when r 1 When 0r0 1 5yt6 is a stable AR1 process which means it is weakly dependent or asymp totically uncorrelated Recall from Chapter 11 that Corr1yt yt1h2 5 rh S 0 when 0r0 1 Therefore testing 1819 in model 1817 with the alternative given by 1820 is really a test of whether 5yt6 is I1 against the alternative that 5yt6 is I0 We do not take the null to be I0 in this setup because 5yt6 is I0 for any value of r strictly between 1 and 1 something that classical hypothesis test ing does not handle easily There are tests where the null hypothesis is I0 against the alternative of I1 but these take a different approach See for example Kwiatkowski Phillips Schmidt and Shin 1992 A convenient equation for carrying out the unit root test is to subtract yt21 from both sides of 1817 and to define u 5 r 2 1 Dyt 5 a 1 uyt21 1 et 1821 Under 1818 this is a dynamically complete model and so it seems straightforward to test H0 u 5 0 against H1 u 0 The problem is that under H0 yt21 is I1 and so the usual central limit theorem that underlies the asymptotic standard normal distribution for the t statistic does not apply the t statis tic does not have an approximate standard normal distribution even in large sample sizes The asymp totic distribution of the t statistic under H0 has come to be known as the DickeyFuller distribution after Dickey and Fuller 1979 Although we cannot use the usual critical values we can use the usual t statistic for u in 1821 at least once the appropriate critical values have been tabulated The resulting test is known as the DickeyFuller DF test for a unit root The theory used to obtain the asymptotic critical values is rather complicated and is covered in advanced texts on time series econometrics See for example Banerjee Dolado Galbraith and Hendry 1993 or BDGH for short By contrast using these results is very easy The critical values for the t statistic have been tabulated by several authors beginning with the original work by Dickey and Fuller 1979 Table 182 contains the large sample critical values for various significance levels taken from BDGH 1993 Table 42 Critical values adjusted for small sample sizes are available in BDGH We reject the null hypothesis H0 u 5 0 against H1 u 0 if tu c where c is one of the negative values in Table 182 For example to carry out the test at the 5 significance level we reject if tu 2286 This requires a t statistic with a much larger magnitude than if we used the standard nor mal critical value which would be 165 If we use the standard normal critical value to test for a unit root we would reject H0 much more often than 5 of the time when H0 is true TAblE 182 Asymptotic Critical Values for Unit Root t Test No Time Trend Significance level 1 25 5 10 Critical value 2343 2312 2286 2257 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 576 ExamplE 182 Unit Root Test for Threemonth TBill Rates We use the quarterly data in INTQRT to test for a unit root in threemonth Tbill rates When we estimate 1820 we obtain Dr3t 5 625 2 091 r3t21 12612 10372 1822 n 5 123 R2 5 048 where we keep with our convention of reporting standard errors in parentheses below the estimates We must remember that these standard errors cannot be used to construct usual confidence inter vals or to carry out traditional t tests because these do not behave in the usual ways when there is a unit root The coefficient on r3t21 shows that the estimate of r is r 5 1 1 u 5 909 While this is less than unity we do not know whether it is statistically less than one The t statistic on r3t21 is 2091037 5 2246 From Table 182 the 10 critical value is 2257 therefore we fail to reject H0 r 5 1 against H1 r 1 at the 10 significance level As with other hypothesis tests when we fail to reject H0 we do not say that we accept H0 Why Suppose we test H0 r 5 9 in the previous example using a standard t testwhich is asymptotically valid because yt is I0 under H0 Then we obtain t 5 001037 which is very small and provides no evidence against r 5 9 Yet it makes no sense to accept r 5 1 and r 5 9 When we fail to reject a unit root as in the previous example we should only conclude that the data do not provide strong evidence against H0 In this example the test does provide some evidence against H0 because the t statistic is close to the 10 critical value Ideally we would compute a pvalue but this requires special software because of the nonnormal distribution In addition though r 91 implies a fair amount of persistence in 5r3t6 the correlation between observations that are 10 periods apart for an AR1 model with r 5 9 is about 35 rather than almost one if r 5 1 What happens if we now want to use r3t as an explanatory variable in a regression analysis The outcome of the unit root test implies that we should be extremely cautious if r3t does have a unit root the usual asymptotic approximations need not hold as we discussed in Chapter 11 One solution is to use the first difference of r3t in any analysis As we will see in Section 184 that is not the only possibility We also need to test for unit roots in models with more complicated dynamics If 5yt6 follows 1817 with r 5 1 then Dyt is serially uncorrelated We can easily allow 5Dyt6 to follow an AR model by augmenting equation 1821 with additional lags For example Dyt 5 a 1 uyt21 1 g1Dyt21 1 et 1823 where 0g10 1 This ensures that under H0 u 5 0 5Dyt6 follows a stable AR1 model Under the alternative H1 u 0 it can be shown that 5yt6 follows a stable AR2 model More generally we can add p lags of Dyt to the equation to account for the dynamics in the process The way we test the null hypothesis of a unit root is very similar we run the regression of Dyt on yt21 Dyt21 c Dyt2p 1824 and carry out the t test on u the coefficient on yt21 just as before This extended version of the DickeyFuller test is usually called the augmented DickeyFuller test because the regression has been augmented with the lagged changes Dyt2h The critical values and rejection rule are the same as before The inclusion of the lagged changes in 1824 is intended to clean up any serial correla tion in Dyt The more lags we include in 1824 the more initial observations we lose If we include too many lags the small sample power of the test generally suffers But if we include too few lags the size of the test will be incorrect even asymptotically because the validity of the critical values in Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 577 Table 182 relies on the dynamics being completely modeled Often the lag length is dictated by the frequency of the data as well as the sample size For annual data one or two lags usually suffice For monthly data we might include 12 lags But there are no hard rules to follow in any case Interestingly the t statistics on the lagged changes have approximate t distributions The F statis tics for joint significance of any group of terms Dyt2h are also asymptotically valid These maintain the homoskedasticity assumption discussed in Section 115 Therefore we can use standard tests to determine whether we have enough lagged changes in 1824 ExamplE 183 Unit Root Test for annual US Inflation We use annual data on US inflation based on the CPI to test for a unit root in inflation see PHILLIPS restricting ourselves to the years from 1948 through 1996 Allowing for one lag of Dinft in the augmented DickeyFuller regression gives Dinft 5 136 2 310 inft21 1 138 Dinft21 15172 11032 11262 n 5 47 R2 5 172 The t statistic for the unit root test is 2310103 5 2301 Because the 5 critical value is 2286 we reject the unit root hypothesis at the 5 level The estimate of r is about 690 Together this is reasonably strong evidence against a unit root in inflation The lag Dinft21 has a t statistic of about 110 so we do not need to include it but we could not know this ahead of time If we drop Dinft21 the evidence against a unit root is slightly stronger u 5 2335 1r 5 6652 and tu 5 2313 For series that have clear time trends we need to modify the test for unit roots A trendstationary processwhich has a linear trend in its mean but is I0 about its trendcan be mistaken for a unit root process if we do not control for a time trend in the DickeyFuller regression In other words if we carry out the usual DF or augmented DF test on a trending but I0 series we will probably have little power for rejecting a unit root To allow for series with time trends we change the basic equation to Dyt 5 a 1 dt 1 uyt21 1 et 1825 where again the null hypothesis is H0 u 5 0 and the alternative is H1 u 0 Under the alternative 5yt6 is a trendstationary process If yt has a unit root then Dyt 5 a 1 dt 1 et and so the change in yt has a mean linear in t unless d 5 0 It can be shown that E1yt2 is actually a quadratic in t It is unusual for the first difference of an economic series to have a linear trend so a more appropriate null hypothesis is probably H0 u 5 0 d 5 0 Although it is possible to test this joint hypothesis using an F testbut with modified critical valuesit is common to test H0 u 5 0 using only a t test We follow that approach here See BDGH 1993 Section 44 for more details on the joint test When we include a time trend in the regression the critical values of the test change Intuitively this occurs because detrending a unit root process tends to make it look more like an I0 process Therefore we require a larger magnitude for the t statistic in order to reject H0 The DickeyFuller critical values for the t test that includes a time trend are given in Table 183 they are taken from BDGH 1993 Table 42 TAblE 183 Asymptotic Critical Values for Unit Root t Test Linear Time Trend Significance level 1 25 5 10 Critical value 2396 2366 2341 2312 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 578 For example to reject a unit root at the 5 level we need the t statistic on u to be less than 341 as compared with 286 without a time trend We can augment equation 1825 with lags of Dyt to account for serial correlation just as in the case without a trend ExamplE 184 Unit Root in the log of US Real Gross Domestic product We can apply the unit root test with a time trend to the US GDP data in INVEN These annual data cover the years from 1959 through 1995 We test whether log1GDPt2 has a unit root This series has a pronounced trend that looks roughly linear We include a single lag of Dlog1GDPt2 which is simply the growth in GDP in decimal form to account for dynamics gGDPt 5 165 1 0059 t 2 210 log1GDPt212 1 264 gGDPt21 1672 100272 10872 11652 1826 n 5 35 R2 5 268 From this equation we get r 5 1 2 21 5 79 which is clearly less than one But we cannot reject a unit root in the log of GDP the t statistic on log1GDPt212 is 2210087 5 2241 which is well above the 10 critical value of 2312 The t statistic on gGDPt21 is 160 which is almost significant at the 10 level against a twosided alternative What should we conclude about a unit root Again we cannot reject a unit root but the point estimate of r is not especially close to one When we have a small sample sizeand n 5 35 is con sidered to be pretty smallit is very difficult to reject the null hypothesis of a unit root if the process has something close to a unit root Using more data over longer time periods many researchers have concluded that there is little evidence against the unit root hypothesis for logGDP This has led most of them to assume that the growth in GDP is I0 which means that logGDP is I1 Unfortunately given currently available sample sizes we cannot have much confidence in this conclusion If we omit the time trend there is much less evidence against H0 as u 5 2023 and tu 5 2192 Here the estimate of r is much closer to one but this is misleading due to the omitted time trend It is tempting to compare the t statistic on the time trend in 1826 with the critical value from a standard normal or t distribution to see whether the time trend is significant Unfortunately the t statistic on the trend does not have an asymptotic standard normal distribution 1unless0r0 12 The asymptotic distribution of this t statistic is known but it is rarely used Typically we rely on intuition or plots of the time series to decide whether to include a trend in the DF test There are many other variants on unit root tests In one version that is applicable only to series that are clearly not trending the intercept is omitted from the regression that is a is set to zero in 1821 This variant of the DickeyFuller test is rarely used because of biases induced if a 2 0 Also we can allow for more complicated time trends such as quadratic Again this is seldom used Another class of tests attempts to account for serial correlation in Dyt in a different manner than by including lags in 1821 or 1825 The approach is related to the serial correlationrobust stand ard errors for the OLS estimators that we discussed in Section 125 The idea is to be as agnostic as possible about serial correlation in Dyt In practice the augmented DickeyFuller test has held up pretty well See BDGH 1993 Section 43 for a discussion on other tests 183 Spurious Regression In a crosssectional environment we use the phrase spurious correlation to describe a situation where two variables are related through their correlation with a third variable In particular if we regress y on x we find a significant relationship But when we control for another variable say z the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 579 partial effect of x on y becomes zero Naturally this can also happen in time series contexts with I0 variables As we discussed in Section 105 it is possible to find a spurious relationship between time series that have increasing or decreasing trends Provided the series are weakly dependent about their time trends the problem is effectively solved by including a time trend in the regression model When we are dealing with integrated processes of order one there is an additional complication Even if the two series have means that are not trending a simple regression involving two independ ent I1 series will often result in a significant t statistic To be more precise let 5xt6 and 5yt6 be random walks generated by xt 5 xt21 1 at t 5 1 2 p 1827 and yt 5 yt21 1 et t 5 1 2 p 1828 where 5at6 and 5et6 are independent identically distributed innovations with mean zero and vari ances s2 a and s2 e respectively For concreteness take the initial values to be x0 5 y0 5 0 Assume fur ther that 5at6 and 5et6 are independent processes This implies that 5xt6 and 5yt6 are also independent But what if we run the simple regression yt 5 b 0 1 b 1xt 1829 and obtain the usual t statistic for b 1 and the usual Rsquared Because yt and xt are independent we would hope that plim b 1 5 0 Even more importantly if we test H0 b1 5 0 against H1 b1 2 0 at the 5 level we hope that the t statistic for b 1 is insignificant 95 of the time Through a simula tion Granger and Newbold 1974 showed that this is not the case even though yt and xt are inde pendent the regression of yt on xt yields a statistically significant t statistic a large percentage of the time much larger than the nominal significance level Granger and Newbold called this the spurious regression problem there is no sense in which y and x are related but an OLS regression using the usual t statistics will often indicate a relationship Recent simulation results are given by Davidson and MacKinnon 1993 Table 191 where at and et are generated as independent identically dis tributed normal random variables and 10000 dif ferent samples are generated For a sample size of n 5 50 at the 5 significance level the standard t statistic for H0 b1 5 0 against the twosided alter native rejects H0 about 662 of the time under H0 rather than 5 of the time As the sample size increases things get worse with n 5 250 the null is rejected 847 of the time Here is one way to see what is happening when we regress the level of y on the level of x Write the model underlying 1829 as yt 5 b0 1 b1xt 1 ut 1830 For the t statistic of b 1 to have an approximate standard normal distribution in large samples at a min imum 5ut6 should be a mean zero serially uncorrelated process But under H0 b1 5 0 yt 5 b0 1 ut and because 5yt6 is a random walk starting at y0 5 0 equation 1830 holds under H0 only if b0 5 0 and more importantly if ut 5 yt 5 a t j51ej In other words 5ut6 is a random walk under H0 This clearly violates even the asymptotic version of the GaussMarkov assumptions from Chapter 11 Including a time trend does not really change the conclusion If yt or xt is a random walk with drift and a time trend is not included the spurious regression problem is even worse The same quali tative conclusions hold if 5at6 and 5et6 are general I0 processes rather than iid sequences Under the preceding setup where 5xt6 and 5yt6 are generated by 1827 and 1828 and 5et6 and 5at6 are iid sequences what is the plim of the slope coefficient say g 1 from the regression of Dyt on Dxt Describe the behavior of the t statistic of g 1 Exploring FurthEr 182 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 580 In addition to the usual t statistic not having a limiting standard normal distributionin fact it increases to infinity as n S the behavior of Rsquared is nonstandard In crosssectional con texts or in regressions with I0 time series variables the Rsquared converges in probability to the population Rsquared 1 2 s2 us2 y This is not the case in spurious regressions with I1 processes Rather than the Rsquared having a welldefined plim it actually converges to a random variable Formalizing this notion is well beyond the scope of this text A discussion of the asymptotic proper ties of the t statistic and the Rsquared can be found in BDGH Section 31 The implication is that the Rsquared is large with high probability even though 5yt6 and 5xt6 are independent time series processes The same considerations arise with multiple independent variables each of which may be I1 or some of which may be I0 If 5yt6 is I1 and at least some of the explanatory variables are I1 the regression results may be spurious The possibility of spurious regression with I1 variables is quite important and has led econo mists to reexamine many aggregate time series regressions whose t statistics were very significant and whose Rsquareds were extremely high In the next section we show that regressing an I1 depend ent variable on an I1 independent variable can be informative but only if these variables are related in a precise sense 184 Cointegration and Error Correction Models The discussion of spurious regression in the previous section certainly makes one wary of using the levels of I1 variables in regression analysis In earlier chapters we suggested that I1 variables should be differenced before they are used in linear regression models whether they are estimated by OLS or instrumental variables This is certainly a safe course to follow and it is the approach used in many time series regressions after Granger and Newbolds original paper on the spurious regression problem Unfortunately always differencing I1 variables limits the scope of the questions that we can answer 184a Cointegration The notion of cointegration which was given a formal treatment in Engle and Granger 1987 makes regressions involving I1 variables potentially meaningful A full treatment of cointegration is mathematically involved but we can describe the basic issues and methods that are used in many applications If 5yt t 5 0 1 p6 and 5xt t 5 0 1 p6 are two I1 processes then in general yt 2 bxt is an I1 process for any number b Nevertheless it is possible that for some b 2 0 yt 2 bxt is an I0 process which means it has constant mean constant variance and autocorrelations that depend only on the time distance between any two variables in the series and it is asymptotically uncorrelated If such a b exists we say that y and x are cointe grated and we call b the cointegration parameter Alternatively we could look at xt 2 gyt for g 2 0 if yt 2 bxt is I0 then xt 2 11b2yt is I0 Therefore the linear combination of yt and xt is not unique but if we fix the coefficient on yt at unity then b is unique See Problem 3 For concreteness we consider linear combinations of the form yt 2 bxt For the sake of illustration take b 5 1 suppose that y0 5 x0 5 0 and write yt 5 yt21 1 rt xt 5 xt21 1 vt where 5rt6 and 5vt6 are two I0 processes with zero means Then yt and xt have a tendency to wander around and not return to the initial value of zero with any regularity By contrast if yt 2 xt is I0 it has zero mean and does return to zero with some regularity Let 5 1yt xt2 t 5 1 2 p6 be a bivariate time series where each series is I1 without drift Explain why if yt and xt are cointegrated yt and xt21 are also cointegrated Exploring FurthEr 183 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 581 As a specific example let r6t be the annualized interest rate for sixmonth Tbills at the end of quarter t and let r3t be the annualized interest rate for threemonth Tbills These are typically called bond equivalent yields and they are reported in the financial pages In Example 182 using the data in INTQRT we found little evidence against the hypothesis that r3t has a unit root the same is true of r6t Define the spread between six and threemonth Tbill rates as sprt 5 r6t 2 r3t Then using equation 1821 the DickeyFuller t statistic for sprt is 771 with u 5 267 or r 5 33 Therefore we strongly reject a unit root for sprt in favor of I0 The upshot of this is that though r6t and r3t each appear to be unit root processes the difference between them is an I0 process In other words r6 and r3 are cointegrated Cointegration in this example as in many examples has an economic interpretation If r6 and r3 were not cointegrated the difference between interest rates could become very large with no ten dency for them to come back together Based on a simple arbitrage argument this seems unlikely Suppose that the spread sprt continues to grow for several time periods making sixmonth Tbills a much more desirable investment Then investors would shift away from threemonth and toward sixmonth Tbills driving up the price of sixmonth Tbills while lowering the price of threemonth Tbills Because interest rates are inversely related to price this would lower r6 and increase r3 until the spread is reduced Therefore large deviations between r6 and r3 are not expected to continue the spread has a tendency to return to its mean value The spread actually has a slightly positive mean because longterm investors are more rewarded relative to shortterm investors There is another way to characterize the fact that sprt will not deviate for long periods from its average value r6 and r3 have a longrun relationship To describe what we mean by this let m 5 E1sprt2 denote the expected value of the spread Then we can write r6t 5 r3t 1 m 1 et where 5et6 is a zero mean I0 process The equilibrium or longrun relationship occurs when et 5 0 or r6p 5 r3p 1 m At any time period there can be deviations from equilibrium but they will be tem porary there are economic forces that drive r6 and r3 back toward the equilibrium relationship In the interest rate example we used economic reasoning to tell us the value of b if yt and xt are cointegrated If we have a hypothesized value of b then testing whether two series are cointegrated is easy we simply define a new variable st 5 yt 2 bxt and apply either the usual DF or augmented DF test to 5st6 If we reject a unit root in 5st6 in favor of the I0 alternative then we find that yt and xt are cointegrated In other words the null hypothesis is that yt and xt are not cointegrated Testing for cointegration is more difficult when the potential cointegration parameter b is unknown Rather than test for a unit root in 5st6 we must first estimate b If yt and xt are cointegrated it turns out that the OLS estimator b from the regression yt 5 a 1 b xt 1831 is consistent for b The problem is that the null hypothesis states that the two series are not cointe grated which means that under H0 we are running a spurious regression Fortunately it is possible to tabulate critical values even when b is estimated where we apply the DickeyFuller or augmented DickeyFuller test to the residuals say u t 5 yt 2 a 2 b xt from 1831 The only difference is that the critical values account for estimation of b The resulting test is called the EngleGranger test and the asymptotic critical values are given in Table 184 These are taken from Davidson and MacKinnon 1993 Table 202 TAblE 184 Asymptotic Critical Values for Cointegration Test No Time Trend Significance level 1 25 5 10 Critical value 2390 2359 2334 2304 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 582 In the basic test we run the regression of Du t on u t21 and compare the t statistic on u t21 to the desired critical value in Table 184 If the t statistic is below the critical value we have evidence that yt 2 bxt is I0 for some b that is yt and xt are cointegrated We can add lags of Du t to account for serial correlation If we compare the critical values in Table 184 with those in Table 182 we must get a t statistic much larger in magnitude to find cointegration than if we used the usual DF critical values This happens because OLS which minimizes the sum of squared residuals tends to produce residuals that look like an I0 sequence even if yt and xt are not cointegrated As with the usual DickeyFuller test we can augment the EngleGranger test by including lags of Du t as additional regressors If yt and xt are not cointegrated a regression of yt on xt is spurious and tells us nothing meaning ful there is no longrun relationship between y and x We can still run a regression involving the first differences Dyt and Dxt including lags But we should interpret these regressions for what they are they explain the difference in y in terms of the difference in x and have nothing necessarily to do with a relationship in levels If yt and xt are cointegrated we can use this to specify more general dynamic models as we will see in the next subsection The previous discussion assumes that neither yt nor xt has a drift This is reasonable for interest rates but not for other time series If yt and xt contain drift terms E1yt2 and E1xt2 are linear usually increasing functions of time The strict definition of cointegration requires yt 2 bxt to be I0 without a trend To see what this entails write yt 5 dt 1 gt and xt 5 lt 1 ht where 5gt6 and 5ht6 are I1 processes d is the drift in yt3d 5 E1Dyt2 4 and l is the drift in xt3l 5 E1Dxt2 4 Now if yt and xt are cointegrated there must exist b such that gt 2 bht is I0 But then yt 2 bxt 5 1d 2 bl2t 1 1gt 2 bht2 which is generally a trendstationary process The strict form of cointegration requires that there not be a trend which means d 5 bl For I1 processes with drift it is possible that the stochastic parts that is gt and htare cointegrated but that the parameter b that causes gt 2 bht to be I0 does not eliminate the linear time trend We can test for cointegration between gt and ht without taking a stand on the trend part by run ning the regression yt 5 a 1 h t 1 b xt 1832 and applying the usual DF or augmented DF test to the residuals u t The asymptotic critical values are given in Table 185 from Davidson and MacKinnon 1993 Table 202 A finding of cointegration in this case leaves open the possibility that yt 2 bxt has a linear trend But at least it is not I1 ExamplE 185 Cointegration between Fertility and personal Exemption In Chapters 10 and 11 we studied various models to estimate the relationship between the general fertility rate gfr and the real value of the personal tax exemption pe in the United States The static regression results in levels and first differences are notably different The regression in levels with a time trend included gives an OLS coefficient on pe equal to 187 1se 5 0352 and R2 5 500 In first differences without a trend the coefficient on Dpe is 2043 1se 5 0282 and R2 5 032 Although there are other reasons for these differencessuch as misspecified distributed lag dynamicsthe TAblE 185 Asymptotic Critical Values for Cointegration Test Linear Time Trend Significance level 1 25 5 10 Critical value 2432 2403 2378 2350 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 583 discrepancy between the levels and changes regressions suggests that we should test for cointegration Of course this presumes that gfr and pe are I1 processes This appears to be the case the augmented DF tests with a single lagged change and a linear time trend each yield t statistics of about 147 and the estimated AR1 coefficients are close to one When we obtain the residuals from the regression of gfr on t and pe and apply the augmented DF test with one lag we obtain a t statistic on u t21 of 2243 which is nowhere near the 10 criti cal value 2350 Therefore we must conclude that there is little evidence of cointegration between gfr and pe even allowing for separate trends It is very likely that the earlier regression results we obtained in levels suffer from the spurious regression problem The good news is that when we used first differences and allowed for two lagssee equation 1127we found an overall positive and significant longrun effect of Dpe on Dgfr If we think two series are cointegrated we often want to test hypotheses about the cointegrating parameter For example a theory may state that the cointegrating parameter is one Ideally we could use a t statistic to test this hypothesis We explicitly cover the case without time trends although the extension to the linear trend case is immediate When yt and xt are I1 and cointegrated we can write yt 5 a 1 bxt 1 ut 1833 where ut is a zero mean I0 process Generally 5ut6 contains serial correlation but we know from Chapter 11 that this does not affect consistency of OLS As mentioned earlier OLS applied to 1833 consistently estimates b and a Unfortunately because xt is I1 the usual inference procedures do not necessarily apply OLS is not asymptotically normally distributed and the t statistic for b does not necessarily have an approximate t distribution We do know from Chapter 10 that if 5xt6 is strictly exogenoussee Assumption TS3and the errors are homoskedastic serially uncorrelated and nor mally distributed the OLS estimator is also normally distributed conditional on the explanatory vari ables and the t statistic has an exact t distribution Unfortunately these assumptions are too strong to apply to most situations The notion of cointegration implies nothing about the relationship between 5xt6 and 5ut6indeed they can be arbitrarily correlated Further except for requiring that 5ut6 is I0 cointegration between yt and xt does not restrict the serial dependence in 5ut6 Fortunately the feature of 1833 that makes inference the most difficultthe lack of strict exogeneity of 5xt6can be fixed Because xt is I1 the proper notion of strict exogeneity is that ut is uncorrelated with Dxs for all t and s We can always arrange this for a new set of errors at least approximately by writing ut as a function of the Dxs for all s close to t For example ut 5 h 1 f0Dxt 1 f1Dxt21 1 f2Dxt22 1834 1 g1Dxt11 1 g2Dxt12 1 et where by construction et is uncorrelated with each Dxs appearing in the equation The hope is that et is uncorrelated with further lags and leads of Dxs We know that as 0s 2 t0 gets large the correlation between et and Dxs approaches zero because these are I0 processes Now if we plug 1834 into 1833 we obtain yt 5 a0 1 bxt 1 f0Dxt 1 f1Dxt21 1 f2Dxt22 1835 1 g1Dxt11 1 g2Dxt12 1 et This equation looks a bit strange because future Dxs appear with both current and lagged Dxt The key is that the coefficient on xt is still b and by construction xt is now strictly exogenous in this equation The strict exogeneity assumption is the important condition needed to obtain an approxi mately normal t statistic for b If ut is uncorrelated with all Dxs s 2 t then we can drop the leads and lags of the changes and simply include the contemporaneous change Dxt Then the equation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 584 we estimate looks more standard but still includes the first difference of xt along with its level yt 5 a0 1 bxt 1 f0Dxt 1 et In effect adding Dxt solves any contemporaneous endogeneity between xt and ut Remember any endogeneity does not cause inconsistency But we are trying to obtain an asymptotically normal t statistic Whether we need to include leads and lags of the changes and how many is really an empirical issue Each time we add an additional lead or lag we lose one observa tion and this can be costly unless we have a large data set The OLS estimator of b from 1835 is called the leads and lags estimator of b because of the way it employs Dx See for example Stock and Watson 1993 The only issue we must worry about in 1835 is the possibility of serial correlation in 5et6 This can be dealt with by computing a serial correlationrobust standard error for b as described in Section 125 or by using a standard AR1 correction such as CochraneOrcutt ExamplE 186 Cointegrating parameter for Interest Rates Earlier we tested for cointegration between r6 and r3six and threemonth Tbill ratesby assum ing that the cointegrating parameter was equal to one This led us to find cointegration and naturally to conclude that the cointegrating parameter is equal to unity Nevertheless let us estimate the cointe grating parameter directly and test H0 b 5 1 We apply the leads and lags estimator with two leads and two lags of Dr3 as well as the contemporaneous change The estimate of b is b 5 1038 and the usual OLS standard error is 0081 Therefore the t statistic for H0 b 5 1 is 11038 2 120081 469 which is a strong statistical rejection of H0 Of course whether 1038 is economically different from 1 is a relevant consideration There is little evidence of serial correlation in the residuals so we can use this t statistic as having an approximate normal distribution For comparison the OLS estimate of b without the leads lags or contemporaneous Dr3 termsand using five more observationsis 1026 1se 5 00772 But the t statistic from 1833 is not necessarily valid There are many other estimators of cointegrating parameters and this continues to be a very active area of research The notion of cointegration applies to more than two processes but the inter pretation testing and estimation are much more complicated One issue is that even after we nor malize a coefficient to be one there can be many cointegrating relationships BDGH provide some discussion and several references 184b Error Correction Models In addition to learning about a potential longrun relationship between two series the concept of coin tegration enriches the kinds of dynamic models at our disposal If yt and xt are I1 processes and are not cointegrated we might estimate a dynamic model in first differences As an example consider the equation Dyt 5 a0 1 a1Dyt21 1 g0Dxt 1 g1Dxt21 1 ut 1836 where ut has zero mean given Dxt Dyt21 Dxt21 and further lags This is essentially equation 1816 but in first differences rather than in levels If we view this as a rational distributed lag model we can find the impact propensity longrun propensity and lag distribution for Dy as a distributed lag in Dx If yt and xt are cointegrated with parameter b then we have additional I0 variables that we can include in 1836 Let st 5 yt 2 bxt so that st is I0 and assume for the sake of simplicity that st has zero mean Now we can include lags of st in the equation In the simplest case we include one lag of st Dyt 5 a0 1 a1Dyt21 1 g0Dxt 1 g1Dxt21 1 dst21 1 ut 1837 5 a0 1 a1Dyt21 1 g0Dxt 1 g1Dxt21 1 d1yt21 2 bxt212 1 ut Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 585 where E1ut0It212 5 0 and It21 contains information on Dxt and all past values of x and y The term d1yt21 2 bxt212 is called the error correction term and 1837 is an example of an error correc tion model In some error correction models the contemporaneous change in x Dxt is omitted Whether it is included or not depends partly on the purpose of the equation In forecasting Dxt is rarely included for reasons we will see in Section 185 An error correction model allows us to study the shortrun dynamics in the relationship between y and x For simplicity consider the model without lags of Dyt and Dxt Dyt 5 a0 1 g0Dxt 1 d1yt21 2 bxt212 1 ut 1838 where d 0 If yt21 bxt21 then y in the previous period has overshot the equilibrium because d 0 the error correction term works to push y back toward the equilibrium Similarly if yt21 bxt21 the error correction term induces a positive change in y back toward the equilibrium How do we estimate the parameters of an error correction model If we know b this is easy For example in 1838 we simply regress Dyt on Dxt and st21 where st21 5 1yt21 2 bxt212 ExamplE 187 Error Correction model for Holding Yields In Problem 6 in Chapter 11 we regressed hy6t the threemonth holding yield in percent from buy ing a sixmonth Tbill at time t 1 and selling it at time t as a threemonth Tbill on hy3t21 the threemonth holding yield from buying a threemonth Tbill at time t 1 The expectations hypoth esis implies that the slope coefficient should not be statistically different from one It turns out that there is evidence of a unit root in 5hy3t6 which calls into question the standard regression analysis We will assume that both holding yields are I1 processes The expectations hypothesis implies at a minimum that hy6t and hy3t21 are cointegrated with b equal to one which appears to be the case see Computer Exercise C5 Under this assumption an error correction model is Dhy6t 5 a0 1 g0Dhy3t21 1 d1hy6t21 2 hy3t222 1 ut where ut has zero mean given all hy3 and hy6 dated at time t 1 and earlier The lags on the variables in the error correction model are dictated by the expectations hypothesis Using the data in INTQRT gives Dhy6t 5 090 1 1218 Dhy3t21 2 8401hy6t21 2 hy3t222 10432 12642 12442 1839 n 5 122 R2 5 790 The error correction coefficient is negative and very significant For example if the holding yield on sixmonth Tbills is above that for threemonth Tbills by one point hy6 falls by 84 points on average in the next quarter Interestingly d 5 284 is not statistically different from 1 as is easily seen by computing the 95 confidence interval In many other examples the cointegrating parameter must be estimated Then we replace st21 with st21 5 yt21 2 b xt21 where b can be various estimators of b We have covered the standard OLS estimator as well as the leads and lags estimator This raises the issue about how sampling variation in b affects inference on the other parameters in the error correction model Fortunately as shown by Engle and Granger 1987 we can ignore the preliminary estimation of b asymptotically This property is very convenient and implies that the asymptotic efficiency of the estimators of the param eters in the error correction model is unaffected by whether we use the OLS estimator or the leads and How would you test H0 g0 5 1 d 5 21 in the holding yield error correction model Exploring FurthEr 184 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 586 lags estimator for b Of course the choice of b will generally have an effect on the estimated error correction parameters in any particular sample but we have no systematic way of deciding which preliminary estimator of b to use The procedure of replacing b with b is called the EngleGranger twostep procedure 185 Forecasting Forecasting economic time series is very important in some branches of economics and it is an area that continues to be actively studied In this section we focus on regressionbased forecasting methods Diebold 2001 provides a comprehensive introduction to forecasting including recent developments We assume in this section that the primary focus is on forecasting future values of a time series process and not necessarily on estimating causal or structural economic models It is useful to first cover some fundamentals of forecasting that do not depend on a specific model Suppose that at time t we want to forecast the outcome of y at time t 1 1 or yt11 The time period could correspond to a year a quarter a month a week or even a day Let It denote information that we can observe at time t This information set includes yt earlier values of y and often other variables dated at time t or earlier We can combine this information in innumerable ways to forecast yt11 Is there one best way The answer is yes provided we specify the loss associated with forecast error Let ft denote the forecast of yt11 made at time t We call ft a onestepahead forecast The forecast error is et11 5 yt11 2 ft which we observe once the outcome on yt11 is observed The most common meas ure of loss is the same one that leads to ordinary least squares estimation of a multiple linear regres sion model the squared error e2 t11 The squared forecast error treats positive and negative prediction errors symmetrically and larger forecast errors receive relatively more weight For example errors of 12 and 22 yield the same loss and the loss is four times as great as forecast errors of 11 or 21 The squared forecast error is an example of a loss function Another popular loss function is the absolute value of the prediction error 0et110 For reasons to be seen shortly we focus now on squared error loss Given the squared error loss function we can determine how to best use the information at time t to forecast yt11 But we must recognize that at time t we do not know et11 it is a random variable because yt11 is a random variable Therefore any useful criterion for choosing ft must be based on what we know at time t It is natural to choose the forecast to minimize the expected squared forecast error given It E1e2 t110It2 5 E3 1yt11 2 ft2 20It4 1840 A basic fact from probability see Property CE6 in Appendix B is that the conditional expectation E1yt110It2 minimizes 1840 In other words if we wish to minimize the expected squared forecast error given information at time t our forecast should be the expected value of yt11 given variables we know at time t For many popular time series processes the conditional expectation is easy to obtain Suppose that 5yt t 5 0 1 p6 is a martingale difference sequence MDS and take It to be 5yt yt21 p y06 the observed past of y By definition E1yt110It2 5 0 for all t the best prediction of yt11 at time t is always zero Recall from Section 182 that an iid sequence with zero mean is a martingale differ ence sequence A martingale difference sequence is one in which the past is not useful for predicting the future Stock returns are widely thought to be well approximated as an MDS or perhaps with a positive mean The key is that E1yt110yt yt21 p2 5 E1yt112 the conditional mean is equal to the uncondi tional mean in which case past outcomes on y do not help to predict future y Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 587 A process 5yt6 is a martingale if E1yt110yt yt21 p y02 5 yt for all t 0 If 5yt6 is a martingale then 5Dyt6 is a martingale difference sequence which is where the latter name comes from The pre dicted value of y for the next period is always the value of y for this period A more complicated example is E1yt110It2 5 ayt 1 a11 2 a2yt21 1 p 1 a11 2 a2 ty0 1841 where 0 a 1 is a parameter that we must choose This method of forecasting is called exponen tial smoothing because the weights on the lagged y decline to zero exponentially The reason for writing the expectation as in 1841 is that it leads to a very simple recurrence relation Set f0 5 y0 Then for t 1 the forecasts can be obtained as ft 5 ayt 1 11 2 a2ft21 In other words the forecast of yt11 is a weighted average of yt and the forecast of yt made at time t 2 1 Exponential smoothing is suitable only for very specific time series and requires choosing a Regression methods which we turn to next are more flexible The previous discussion has focused on forecasting y only one period ahead The general issues that arise in forecasting yt1h at time t where h is any positive integer are similar In particular if we use expected squared forecast error as our measure of loss the best predictor is E1yt1h0It2 When deal ing with a multiplestepahead forecast we use the notation ft h to indicate the forecast of yt1h made at time t 185a Types of Regression Models Used for Forecasting There are many different regression models that we can use to forecast future values of a time series The first regression model for time series data from Chapter 10 was the static model To see how we can forecast with this model assume that we have a single explanatory variable yt 5 b0 1 b1zt 1 ut 1842 Suppose for the moment that the parameters b0 and b1 are known Write this equation at time t 1 1 as yt11 5 b0 1 b1zt11 1 ut11 Now if zt11 is known at time t so that it is an element of It and E1ut110It2 5 0 then E1yt110It2 5 b0 1 b1zt11 where It contains zt11 yt zt p y1 z1 The righthand side of this equation is the forecast of yt11 at time t This kind of forecast is usually called a conditional forecast because it is conditional on knowing the value of z at time t 1 1 Unfortunately at any time we rarely know the value of the explanatory variables in future time periods Exceptions include time trends and seasonal dummy variables which we cover explicitly below but otherwise knowledge of zt11 at time t is rare Sometimes we wish to generate conditional forecasts for several values of zt11 Another problem with 1842 as a model for forecasting is that E1ut110It2 5 0 means that 5ut6 cannot contain serial correlation something we have seen to be false in most static regression models Problem 8 asks you to derive the forecast in a simple distributed lag model with AR1 errors If zt11 is not known at time t we cannot include it in It Then we have E1yt110It2 5 b0 1 b1E1zt110It2 This means that in order to forecast yt11 we must first forecast zt11 based on the same information set This is usually called an unconditional forecast because we do not assume knowledge of zt11 at time t Unfortunately this is somewhat of a misnomer as our forecast is still conditional on the infor mation in It But the name is entrenched in the forecasting literature Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 588 For forecasting unless we are wedded to the static model in 1842 for other reasons it makes more sense to specify a model that depends only on lagged values of y and z This saves us the extra step of having to forecast a righthand side variable before forecasting y The kind of model we have in mind is yt 5 d0 1 a1yt21 1 g1zt21 1 ut 1843 E1ut0It212 5 0 where It21 contains y and z dated at time t 1 and earlier Now the forecast of yt11 at time t is d0 1 a1yt 1 g1zt if we know the parameters we can just plug in the values of yt and zt If we only want to use past y to predict future y then we can drop zt21 from 1843 Naturally we can add more lags of y or z and lags of other variables Especially for forecasting one step ahead such models can be very useful 185b OneStepAhead Forecasting Obtaining a forecast one period after the sample ends is relatively straightforward using models such as 1843 As usual let n be the sample size The forecast of yn11 is fn 5 d 0 1 a 1yn 1 g 1zn 1844 where we assume that the parameters have been estimated by OLS We use a hat on fn to emphasize that we have estimated the parameters in the regression model If we knew the parameters there would be no estimation error in the forecast The forecast errorwhich we will not know until time n 1 1is en11 5 yn11 2 fn 1845 If we add more lags of y or z to the forecasting equation we simply lose more observations at the beginning of the sample The forecast fn of yn11 is usually called a point forecast We can also obtain a forecast interval A forecast interval is essentially the same as a prediction interval which we studied in Section 64 There we showed how under the classical linear model assumptions to obtain an exact 95 predic tion interval A forecast interval is obtained in exactly the same way If the model does not satisfy the classical linear model assumptionsfor example if it contains lagged dependent variables as in 1844the forecast interval is still approximately valid provided ut given It21 is normally distrib uted with zero mean and constant variance This ensures that the OLS estimators are approximately normally distributed with the usual OLS variances and that un11 is independent of the OLS estima tors with mean zero and variance s2 Let se1 fn2 be the standard error of the forecast and let s be the standard error of the regression From Section 64 we can obtain fn and se1 fn2 as the intercept and its standard error from the regression of yt on 1yt21 2 yn2 and 1zt21 2 zn2 t 5 1 2 p n that is we subtract the time n value of y from each lagged y and similarly for z before doing the regression Then se1en112 5 53se1fn2 42 1 s 2612 1846 and the approximate 95 forecast interval is fn 6 196 se1en112 1847 Because se1 fn2 is roughly proportional to 1n se1 fn2 is usually small relative to the uncertainty in the error un11 as measured by s Some econometrics packages compute forecast intervals routinely but others require some simple manipulations to obtain 1847 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 589 ExamplE 188 Forecasting the US Unemployment Rate We use the data in PHILLIPS but only for the years 1948 through 1996 to forecast the US civilian unemployment rate for 1997 We use two models The first is a simple AR1 model for unem unemt 5 1572 1 732 unemt21 15772 10972 1848 n 5 48 R2 5 544 s 5 1049 In a second model we add inflation with a lag of one year unemt 5 1304 1 647 unemt21 1 184 inft21 14902 10842 10412 1849 n 5 48 R2 5 677 s 5 883 The lagged inflation rate is very significant in 1849 1t 452 and the adjusted Rsquared from the second equation is much higher than that from the first Nevertheless this does not necessarily mean that the second equation will produce a better forecast for 1997 All we can say so far is that using the data up through 1996 a lag of inflation helps to explain variation in the unemployment rate To obtain the forecasts for 1997 we need to know unem and inf in 1996 These are 54 and 30 respectively Therefore the forecast of unem1997 from equation 1848 is 1572 1 73254 or about 552 The forecast from equation 1849 is 1304 1 64754 1 18430 or about 535 The actual civilian unemployment rate for 1997 was 49 so both equations overpredict the actual rate The second equation does provide a somewhat better forecast We can easily obtain a 95 forecast interval When we regress unemt on 1unemt21 2 542 and 1inft21 2 302 we obtain 535 as the interceptwhich we already computed as the forecastand se1fn2 5 137 Therefore because s 5 883 we have se1en112 5 3 11372 2 1 18832 2412 894 The 95 forecast interval from 1847 is 535 6 19618942 or about 36 71 This is a wide inter val and the realized 1997 value 49 is well within the interval As expected the standard error of un11 which is 883 is a very large fraction of se1en112 A professional forecaster must usually produce a forecast for every time period For example at time n she or he produces a forecast of yn11 Then when yn11 and zn11 become available he or she must forecast yn12 Even if the forecaster has settled on model 1843 there are two choices for forecasting yn12 The first is to use d 0 1 a 1yn11 1 g 1zn11 where the parameters are estimated using the first n observations The second possibility is to reestimate the parameters using all n 1 1 obser vations and then to use the same formula to forecast yn12 To forecast in subsequent time periods we can generally use the parameter estimates obtained from the initial n observations or we can update the regression parameters each time we obtain a new data point Although the latter approach requires more computation the extra burden is relatively minor and it can although it need not work better because the regression coefficients adjust at least somewhat to the new data points As a specific example suppose we wish to forecast the unemployment rate for 1998 using the model with a single lag of unem and inf The first possibility is to just plug the 1997 values of unem ployment and inflation into the righthand side of 1849 With unem 5 49 and inf 5 23 in 1997 we have a forecast for unem1998 of about 49 It is just a coincidence that this is the same as the 1997 unemployment rate The second possibility is to reestimate the equation by adding the 1997 observa tion and then using this new equation see Computer Exercise C6 The model in equation 1843 is one equation in what is known as a vector autoregressive VAR model We know what an autoregressive model is from Chapter 11 we model a single series 5yt6 in terms of its own past In vector autoregressive models we model several serieswhich if you Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 590 are familiar with linear algebra is where the word vector comes fromin terms of their own past If we have two series yt and zt a vector autoregression consists of equations that look like yt 5 d0 1 a1yt21 1 g1zt21 1 a2yt22 1 g2zt22 1 p 1850 and zt 5 h0 1 b1yt21 1 r1zt21 1 b2yt22 1 r2zt22 1 p where each equation contains an error that has zero expected value given past information on y and z In equation 1843and in the example estimated in 1849we assumed that one lag of each vari able captured all of the dynamics An F test for joint significance of unemt22 and inft22 confirms that only one lag of each is needed As Example 188 illustrates VAR models can be useful for forecasting In many cases we are interested in forecasting only one variable y in which case we only need to estimate and analyze the equation for y Nothing prevents us from adding other lagged variables say wt21 wt22 p to equa tion 1850 Such equations are efficiently estimated by OLS provided we have included enough lags of all variables and the equation satisfies the homoskedasticity assumption for time series regressions Equations such as 1850 allow us to test whether after controlling for past values of y past val ues of z help to forecast yt Generally we say that z Granger causes y if E1yt0It212 2 E1yt0Jt212 1851 where It21 contains past information on y and z and Jt21 contains only information on past y When 1851 holds past z is useful in addition to past y for predicting yt The term causes in Granger causes should be interpreted with caution The only sense in which z causes y is given in 1851 In particular it has nothing to say about contemporaneous causality between y and z so it does not allow us to determine whether zt is an exogenous or endogenous variable in an equation relating yt to zt This is also why the notion of Granger causality does not apply in pure crosssectional contexts Once we assume a linear model and decide how many lags of y should be included in E1yt0yt21 yt22 c2 we can easily test the null hypothesis that z does not Granger cause y To be more specific suppose that E1yt0yt21 yt22 c2 depends on only three lags yt 5 d0 1 a1yt21 1 a2yt22 1 a3yt23 1 ut E1ut0yt21 yt22 p2 5 0 Now under the null hypothesis that z does not Granger cause y any lags of z that we add to the equa tion should have zero population coefficients If we add zt21 then we can simply do a t test on zt21 If we add two lags of z then we can do an F test for joint significance of zt21 and zt22 in the equation yt 5 d0 1 a1yt21 1 a2yt22 1 a3yt23 1 g1zt21 1 g2zt22 1 ut If there is heteroskedasticity we can use a robust form of the test There cannot be serial correlation under H0 because the model is dynamically complete As a practical matter how do we decide on which lags of y and z to include First we start by estimating an autoregressive model for y and performing t and F tests to determine how many lags of y should appear With annual data the number of lags is typically small say one or two With quar terly or monthly data there are usually many more lags Once an autoregressive model for y has been chosen we can test for lags of z The choice of lags of z is less important because when z does not Granger cause y no set of lagged zs should be significant With annual data 1 or 2 lags are typically used with quarterly data usually 4 or 8 and with monthly data perhaps 6 12 or maybe even 24 given enough data We have already done one example of testing for Granger causality in equation 1849 The autoregressive model that best fits unemployment is an AR1 In equation 1849 we added a single lag of inflation and it was very significant Therefore inflation Granger causes unemployment Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 591 There is an extended definition of Granger causality that is often useful Let 5wt6 be a third series or it could represent several additional series Then z Granger causes y conditional on w if 1851 holds but now It21 contains past information on y z and w while Jt21 contains past information on y and w It is certainly possible that z Granger causes y but z does not Granger cause y conditional on w A test of the null that z does not Granger cause y conditional on w is obtained by testing for sig nificance of lagged z in a model for y that also depends on lagged y and lagged w For example to test whether growth in the money supply Granger causes growth in real GDP conditional on the change in interest rates we would regress gGDPt on lags of gGDP Dint and gM and do significance tests on the lags of gM See for example Stock and Watson 1989 185c Comparing OneStepAhead Forecasts In almost any forecasting problem there are several competing methods for forecasting Even when we restrict attention to regression models there are many possibilities Which variables should be included and with how many lags Should we use logs levels of variables or first differences In order to decide on a forecasting method we need a way to choose which one is most suitable Broadly we can distinguish between insample criteria and outofsample criteria In a regression context insample criteria include Rsquared and especially adjusted Rsquared There are many other model selection statistics but we will not cover those here see for example Ramanathan 1995 Chapter 4 For forecasting it is better to use outofsample criteria as forecasting is essentially an outof sample problem A model might provide a good fit to y in the sample used to estimate the parameters But this need not translate to good forecasting performance An outofsample comparison involves using the first part of a sample to estimate the parameters of the model and saving the latter part of the sample to gauge its forecasting capabilities This mimics what we would have to do in practice if we did not yet know the future values of the variables Suppose that we have n 1 m observations where we use the first n observations to estimate the parameters in our model and save the last m observations for forecasting Let fn1h be the onestepahead forecast of yn1h11 for h 5 0 1 c m 2 1 The m forecast errors are en1h11 5 yn1h11 2 fn1h How should we measure how well our model forecasts y when it is out of sample Two measures are most common The first is the root mean squared error RMSE RMSE 5 am21 a m21 h50e2 n1h11b 12 1852 This is essentially the sample standard deviation of the forecast errors without any degrees of free dom adjustment If we compute RMSE for two or more forecasting methods then we prefer the method with the smallest outofsample RMSE A second common measure is the mean absolute error MAE which is the average of the absolute forecast errors MAE 5 m21 a m21 h50 0en1h110 1853 Again we prefer a smaller MAE Other possible criteria include minimizing the largest of the abso lute values of the forecast errors ExamplE 189 OutofSample Comparisons of Unemployment Forecasts In Example 188 we found that equation 1849 fit notably better over the years 1948 through 1996 than did equation 1848 and at least for forecasting unemployment in 1997 the model that included lagged inflation worked better Now we use the two models still estimated using the data only through 1996 to compare onestepahead forecasts for 1997 through 2003 This leaves seven Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 592 outofsample observations n 5 48 and m 5 7 to use in equations 1852 and 1853 For the AR1 model RMSE 5 962 and MAE 5 778 For the model that adds lagged inflation a VAR model of order one RMSE 5 673 and MAE 5 628 Thus by either measure the model that includes inft21 produces better outofsample forecasts for 1997 through 2003 In this case the insample and outofsample criteria choose the same model Rather than using only the first n observations to estimate the parameters of the model we can reestimate the models each time we add a new observation and use the new model to forecast the next time period 185d MultipleStepAhead Forecasts Forecasting more than one period ahead is generally more difficult than forecasting one period ahead We can formalize this as follows Suppose we consider forecasting yt11 at time t and at an earlier time period s so that s t Then Var3yt11 2 E1yt110It2 4 Var3yt11 2 E1yt110Is2 4 where the inequality is usually strict We will not prove this result generally but intuitively it makes sense the forecast error variance in predicting yt11 is larger when we make that forecast based on less information If 5yt6 follows an AR1 model which includes a random walk possibly with drift we can eas ily show that the error variance increases with the forecast horizon The model is yt 5 a 1 ryt21 1 ut E1ut0It212 5 0 It21 5 5yt21 yt22 p6 and 5ut6 has constant variance s2 conditional on It21 At time t 1 h 2 1 our forecast of yt1h is a 1 ryt1h21 and the forecast error is simply ut1h Therefore the onestepahead forecast variance is simply s2 To find multiplestepahead forecasts we have by repeated substitution yt1h 5 11 1 r 1 p 1 rh212a 1 rhyt 1 rh21ut11 1 rh22ut12 1 p 1 ut1h At time t the expected value of ut1j for all j 1 is zero So E1yt1h0It2 5 11 1 r 1 p 1 rh212a 1 rhyt 1854 and the forecast error is eth 5 rh21ut11 1 rh22 ut12 1 p 1 ut1h This is a sum of uncor related random variables and so the variance of the sum is the sum of the variances Var1et h2 5 s23r21h212 1 r21h222 1 p 1 r2 1 14 Because r2 0 each term multiplying s2 is posi tive so the forecast error variance increases with h When r2 1 as h gets large the forecast variance converges to s211 2 r22 which is just the unconditional variance of yt In the case of a random walk 1r 5 12 ft h 5 ah 1 yt and Var1et h2 5 s2h the forecast variance grows without bound as the hori zon h increases This demonstrates that it is very difficult to forecast a random walk with or without drift far out into the future For example forecasts of interest rates farther into the future become dramatically less precise Equation 1854 shows that using the AR1 model for multistep forecasting is easy once we have estimated r by OLS The forecast of yn1h at time n is fnh 5 11 1 r 1 p 1 r h212a 1 r hyn 1855 Obtaining forecast intervals is harder unless h 5 1 because obtaining the standard error of fnh is dif ficult Nevertheless the standard error of fnh is usually small compared with the standard deviation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 593 of the error term and the latter can be estimated as s 3r 21h212 1 r 21h222 1 p 1 r 2 1 1412 where s is the standard error of the regression from the AR1 estimation We can use this to obtain an approxi mate confidence interval For example when h 5 2 an approximate 95 confidence interval for large n is fn2 6 196s 11 1 r 22 12 1856 Because we are underestimating the standard deviation of yn1h this interval is too narrow but perhaps not by much especially if n is large A less traditional but useful approach is to estimate a different model for each forecast horizon For example suppose we wish to forecast y two periods ahead If It depends only on y through time t we might assume that E1yt120It2 5 a0 1 g1yt which as we saw earlier holds if 5yt6 follows an AR1 model We can estimate a0 and g1 by regressing yt on an intercept and on yt22 Even though the errors in this equation contain serial correlationerrors in adjacent periods are correlatedwe can obtain consistent and approximately normal estimators of a0 and g1 The forecast of yn12 at time n is simply fn 2 5 a 0 1 g 1yn Further and very importantly the standard error of the regression is just what we need for computing a confidence interval for the forecast Unfortunately to get the standard error of fn 2 using the trick for a onestepahead forecast requires us to obtain a serial correlation robust standard error of the kind described in Section 125 This standard error goes to zero as n gets large while the variance of the error is constant Therefore we can get an approximate interval by using 1856 and by putting the SER from the regression of yt on yt22 in place of s 11 1 r 22 12 But we should remember that this ignores the estimation error in a 0 and g 1 We can also compute multiplestepahead forecasts with more complicated autoregressive models For example suppose 5yt6 follows an AR2 model and that at time n we wish to forecast yn12 Now yn12 5 a 1 r1yn11 1 r2yn 1 un12 so E1yn120In2 5 a 1 r1E1yn110In2 1 r2yn We can write this as fn 2 5 a 1 r1 fn 1 1 r2yn so that the twostepahead forecast at time n can be obtained once we get the onestepahead forecast If the parameters of the AR2 model have been estimated by OLS then we operationalize this as fn2 5 a 1 r 1 fn1 1 r 2 yn 1857 Now fn1 5 a 1 r 1yn 1 r 2yn21 which we can compute at time n Then we plug this into 1857 along with yn to obtain fn2 For any h 2 obtaining any hstepahead forecast for an AR2 model is easy to find in a recursive manner fnh 5 a 1 r 1fnh21 1 r 2 fn h22 Similar reasoning can be used to obtain multiplestepahead forecasts for VAR models To illus trate suppose we have yt 5 d0 1 a1yt21 1 g1zt21 1 ut 1858 and zt 5 h0 1 b1yt21 1 r1zt21 1 vt Now if we wish to forecast yn11 at time n we simply use fn1 5 d 0 1 a 1yn 1 g 1zn Likewise the forecast of zn11 at time n is say g n1 5 h 0 1 b 1yn 1 r 1zn Now suppose we wish to obtain a two stepahead forecast of y at time n From 1858 we have E1yn120In2 5 d0 1 a1E1yn110In2 1 g1E1zn110In2 because E1un120In2 5 0 so we can write the forecast as fn2 5 d 0 1 a 1 fn1 1 g 1g n1 1859 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 594 This equation shows that the twostepahead forecast for y depends on the onestepahead forecasts for y and z Generally we can build up multiplestepahead forecasts of y by using the recursive formula fnh 5 d 0 1 a 1fnh21 1 g 1g nh21 h 2 ExamplE 1810 TwoYearahead Forecast for the Unemployment Rate To use equation 1849 to forecast unemployment two years outsay the 1998 rate using the data through 1996we need a model for inflation The best model for inf in terms of lagged unem and inf appears to be a simple AR1 model unem21 is not significant when added to the regression inft 5 1277 1 665 inft21 15582 11072 n 5 48 R2 5 457 R2 5 445 If we plug the 1996 value of inf into this equation we get the forecast of inf for 1997 inf1997 5 327 Now we can plug this along with unem1997 5 535 which we obtained earlier into 1859 to fore cast unem1998 unem1998 5 1304 1 64715352 1 18413272 537 Remember this forecast uses information only through 1996 The onestepahead forecast of unem1998 obtained by plugging the 1997 values of unem and inf into 1848 was about 490 The actual unem ployment rate in 1998 was 45 which means that in this case the onestepahead forecast does quite a bit better than the twostepahead forecast Just as with onestepahead forecasting an outofsample root mean squared error or a mean absolute error can be used to choose among multiplestepahead forecasting methods 185e Forecasting Trending Seasonal and Integrated Processes We now turn to forecasting series that either exhibit trends have seasonality or have unit roots Recall from Chapters 10 and 11 that one approach to handling trending dependent or independent variables in regression models is to include time trends the most popular being a linear trend Trends can be included in forecasting equations as well although they must be used with caution In the simplest case suppose that 5yt6 has a linear trend but is unpredictable around that trend Then we can write yt 5 a 1 bt 1 ut E1ut0It212 5 0 t 5 1 2 p 1860 where as usual It21 contains information observed through time t 2 1 which includes at least past y How do we forecast yn1h at time n for any h 1 This is simple because E1yn1h0In2 5 a 1 b1n 1 h2 The forecast error variance is simply s2 5 Var1ut2 assuming a constant variance over time If we estimate a and b by OLS using the first n observations then our forecast for yn1h at time n is fnh 5 a 1 b 1n 1 h2 In other words we simply plug the time period corresponding to y into the estimated trend function For example if we use the n 5 131 observations in BARIUM to forecast monthly imports of Chinese barium chloride to the United States from China we obtain a 5 24956 and b 5 515 The sample period ends in December 1988 so the forecast of imports of Chinese barium chloride six months later is 24956 1 51511372 5 95511 measured as short tons For com parison the December 1988 value is 108781 so it is greater than the forecasted value six months later The series and its estimated trend line are shown in Figure 182 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 595 As we discussed in Chapter 10 most economic time series are better characterized as having at least approximately a constant growth rate which suggests that log1yt2 follows a linear time trend Suppose we use n observations to obtain the equation log1yt2 5 a 1 b t t 5 1 2 p n 1861 Then to forecast log1y2 at any future time period n 1 h we just plug n 1 h into the trend equation as before But this does not allow us to forecast y which is usually what we want It is tempting to simply exponentiate a 1 b 1n 1 h2 to obtain the forecast for yn1h but this is not quite right for the same reasons we gave in Section 64 We must prop erly account for the error implicit in 1861 The simplest way to do this is to use the n observations to regress yt on exp1logyt2 without an intercept Let g be the slope coefficient on exp1logyt2 Then the forecast of y in period n h is simply fnh 5 gexp3a 1 b 1n 1 h2 4 1862 As an example if we use the first 687 weeks of data on the New York Stock Exchange index in NYSE we obtain a 5 3782 and b 5 0019 by regressing log1pricet2 on a linear time trend this shows that the index grows about 2 per week on average When we regress price on the expo nentiated fitted values we obtain g 5 1018 Now we forecast price four weeks out which is the last week in the sample using 1862 1018 exp33782 1 001916912 4 16612 The actual value turned out to be 16425 so we have somewhat overpredicted But this result is much better than if we estimate a linear time trend for the first 687 weeks the forecasted value for week 691 is 15223 which is a substantial underprediction barium chloride short tons 131 100 t 70 35 1 40 500 1000 1500 FiguRE 182 US imports of Chinese barium chloride in short tons and its estimated linear trend line 24956 1 515 t Suppose you model 5yt t 5 1 2 p 466 as a linear time trend where data are annual starting in 1950 and ending in 1995 Define the variable yeart as ranging from 50 when t 5 1 to 95 when t 5 46 If you estimate the equation yt 5 g 1 dyeart how do g and d compare with a and b in yt 5 a 1 b t How will forecasts from the two equations compare Exploring FurthEr 185 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 596 Although trend models can be useful for prediction they must be used with caution especially for forecasting far into the future integrated series that have drift The potential problem can be seen by considering a random walk with drift At time t h we can write yt1h as yt1h 5 bh 1 yt 1 ut11 1 p 1 ut1h where b is the drift term usually b 0 and each ut1j has zero mean given It and constant variance s2 As we saw earlier the forecast of yt1h at time t is E1yt1h0It2 5 bh 1 yt and the forecast error vari ance is s2h What happens if we use a linear trend model Let y0 be the initial value of the process at time zero which we take as nonrandom Then we can also write yt1h 5 y0 1 b1t 1 h2 1 u1 1 u2 1 p 1 ut1h 5 y0 1 b1t 1 h2 1 vt1h This looks like a linear trend model with the intercept a 5 y0 But the error vt1h while having mean zero has variance s21t 1 h2 Therefore if we use the linear trend y0 1 b1t 1 h2 to forecast yt1h at time t the forecast error variance is s21t 1 h2 compared with s2h when we use bh 1 yt The ratio of the forecast variances is 1t 1 h2h which can be big for large t The bottom line is that we should not use a linear trend to forecast a random walk with drift Computer Exercise C8 asks you to compare forecasts from a cubic trend line and those from the simple random walk model for the general fertil ity rate in the United States Deterministic trends can also produce poor forecasts if the trend parameters are estimated using old data and the process has a subsequent shift in the trend line Sometimes exogenous shockssuch as the oil crises of the 1970scan change the trajectory of trending variables If an old trend line is used to forecast far into the future the forecasts can be way off This problem can be mitigated by using the most recent data available to obtain the trend line parameters Nothing prevents us from combining trends with other models for forecasting For example we can add a linear trend to an AR1 model which can work well for forecasting series with linear trends but which are also stable AR processes around the trend It is also straightforward to forecast processes with deterministic seasonality monthly or quar terly series For example the file BARIUM contains the monthly production of gasoline in the United States from 1978 through 1988 This series has no obvious trend but it does have a strong sea sonal pattern Gasoline production is higher in the summer months and in December In the simplest model we would regress gas measured in gallons on 11 month dummies say for February through December Then the forecast for any future month is simply the intercept plus the coefficient on the appropriate month dummy For January the forecast is just the intercept in the regression We can also add lags of variables and time trends to allow for general series with seasonality Forecasting processes with unit roots also deserves special attention Earlier we obtained the expected value of a random walk conditional on information through time n To forecast a random walk with possible drift a h periods into the future at time n we use fnh 5 a h 1 yn where a is the sample average of the Dyt up through t 5 n If there is no drift we set a 5 0 This approach imposes the unit root An alternative would be to estimate an AR1 model for 5yt6 and to use the forecast formula 1855 This approach does not impose a unit root but if one is present r converges in probability to one as n gets large Nevertheless r can be substantially different than one especially if the sample size is not very large The matter of which approach produces better outofsample fore casts is an empirical issue If in the AR1 model r is less than one even slightly the AR1 model will tend to produce better longrun forecasts Generally there are two approaches to producing forecasts for I1 processes The first is to impose a unit root For a onestepahead forecast we obtain a model to forecast the change in y Dyt11 given information through time t Then because yt11 5 Dyt11 1 yt E1yt110It2 5 E1yt110It2 1 yt Therefore our forecast of yn11 at time n is just fn 5 g n 1 yn Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 597 where g n is the forecast of Dyn11 at time n Typically an AR model which is necessarily stable is used for Dyt or a vector autoregression This can be extended to multiplestepahead forecasts by writing yn1h as yn1h 5 1yn1h 2 yn1h212 1 1yn1h21 2 yn1h222 1 p 1 1yn11 2 yn2 1 yn or yn1h 5 Dyn1h 1 Dyn1h21 1 p 1 Dyn11 1 yn Therefore the forecast of yn1h at time n is fnh 5 g nh 1 g nh21 1 p 1 g n1 1 yn 1863 where g n j is the forecast of Dyn1j at time n For example we might model Dyt as a stable AR1 obtain the multiplestepahead forecasts from 1855 but with a and r obtained from Dyt on Dyt21 and yn replaced with Dyn and then plug these into 1863 The second approach to forecasting I1 variables is to use a general AR or VAR model for 5yt6 This does not impose the unit root For example if we use an AR2 model yt 5 a 1 r1yt21 1 r2yt22 1 ut 1864 then r1 1 r2 5 1 If we plug in r1 5 1 2 r2 and rearrange we obtain Dyt 5 a 2 r2Dyt21 1 ut which is a stable AR1 model in the difference that takes us back to the first approach described earlier Nothing prevents us from estimating 1864 directly by OLS One nice thing about this regression is that we can use the usual t statistic on r 2 to determine if yt22 is significant This assumes that the homoskedasticity assumption holds if not we can use the heteroskedasticity robust form We will not show this formally but intuitively it follows by rewriting the equation as yt 5 a 1 gyt21 2 r2Dyt21 1 ut where g 5 r1 1 r2 Even if g 5 1 r2 is minus the coefficient on a stationary weakly dependent process 5Dyt216 Because the regression results will be identical to 1864 we can use 1864 directly As an example let us estimate an AR2 model for the general fertility rate in FERTIL3 using the observations through 1979 In Computer Exercise C8 you are asked to use this model for fore casting which is why we save some observations at the end of the sample gfrt 5 322 1 1272 gfrt21 2 311 gfrt22 12922 11202 11212 1865 n 5 65 R2 5 949 R2 5 947 The t statistic on the second lag is about 257 which is statistically different from zero at about the 1 level The first lag also has a very significant t statistic which has an approximate t distribution by the same reasoning used for r 2 The Rsquared adjusted or not is not especially informative as a goodnessoffit measure because gfr apparently contains a unit root and it makes little sense to ask how much of the variance in gfr we are explaining The coefficients on the two lags in 1865 add up to 961 which is close to and not statistically different from one as can be verified by applying the augmented DickeyFuller test to the equation Dgfrt 5 a 1 ugfrt21 1 d1Dgfrt21 1 ut Even though we have not imposed the unit root restriction we can still use 1865 for forecasting as we discussed earlier Before ending this section we point out one potential improvement in forecasting in the context of vector autoregressive models with I1 variables Suppose 5yt6 and 5zt6 are each I1 processes One approach for obtaining forecasts of y is to estimate a bivariate autoregression in the variables Dyt and Dzt and then to use 1863 to generate one or multiplestepahead forecasts this is essentially the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 598 PART 3 Advanced Topics first approach we described earlier However if yt and zt are cointegrated we have more stationary stable variables in the information set that can be used in forecasting Dy namely lags of yt 2 bzt where b is the cointegrating parameter A simple error correction model is Dyt 5 a0 1 a1Dyt21 1 g1Dzt21 1 d11yt21 2 bzt212 1 et 1866 E1et0It212 5 0 To forecast yn11 we use observations up through n to estimate the cointegrating parameter b and then estimate the parameters of the error correction model by OLS as described in Section 184 Forecasting Dyn11 is easy we just plug Dyn Dzn and yn 2 b zn into the estimated equation Having obtained the forecast of Dyn11 we add it to yn By rearranging the error correction model we can write yt 5 a0 1 r1yt21 1 r2yt22 1 d1zt21 1 d2zt22 1 ut 1867 where r1 5 1 1 a1 1 d r2 5 2a1 and so on which is the first equation in a VAR model for yt and zt Notice that this depends on five parameters just as many as in the error correction model The point is that for the purposes of forecasting the VAR model in the levels and the error correction model are essentially the same This is not the case in more general error correction models For example suppose that a1 5 g1 5 0 in 1866 but we have a second error correction term d21yt22 2 bzt222 Then the error correction model involves only four parameters whereas 1867which has the same order of lags for y and zcontains five parameters Thus error correction models can economize on parameters that is they are generally more parsimonious than VARs in levels If yt and zt are I1 but not cointegrated the appropriate model is 1866 without the error correc tion term This can be used to forecast Dyn11 and we can add this to yn to forecast yn11 Summary The time series topics covered in this chapter are used routinely in empirical macroeconomics empirical finance and a variety of other applied fields We began by showing how infinite distributed lag models can be interpreted and estimated These can provide flexible lag distributions with fewer parameters than a similar finite distributed lag model The geometric distributed lag and more generally rational distributed lag models are the most popular They can be estimated using standard econometric procedures on simple dynamic equations Testing for a unit root has become very common in time series econometrics If a series has a unit root then in many cases the usual large sample normal approximations are no longer valid In addition a unit root process has the property that an innovation has a longlasting effect which is of interest in its own right While there are many tests for unit roots the DickeyFuller t testand its extension the augmented DickeyFuller testis probably the most popular and easiest to implement We can allow for a linear trend when testing for unit roots by adding a trend to the DickeyFuller regression When an I1 series yt is regressed on another I1 series xt there is serious concern about spurious regression even if the series do not contain obvious trends This has been studied thoroughly in the case of a random walk even if the two random walks are independent the usual t test for significance of the slope coefficient based on the usual critical values will reject much more than the nominal size of the test In addition the R2 tends to a random variable rather than to zero as would be the case if we regress the dif ference in yt on the difference in xt Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 599 In one important case a regression involving I1 variables is not spurious and that is when the series are cointegrated This means that a linear function of the two I1 variables is I0 If yt and xt are I1 but yt 2 xt is I0 yt and xt cannot drift arbitrarily far apart There are simple tests of the null of no cointegra tion against the alternative of cointegration one of which is based on applying a DickeyFuller unit root test to the residuals from a static regression There are also simple estimators of the cointegrating parameter that yield t statistics with approximate standard normal distributions and asymptotically valid confidence intervals We covered the leads and lags estimator in Section 184 Cointegration between yt and xt implies that error correction terms may appear in a model relating Dyt to Dxt the error correction terms are lags in yt 2 bxt where b is the cointegrating parameter A simple two step estimation procedure is available for estimating error correction models First b is estimated using a static regression or the leads and lags regression Then OLS is used to estimate a simple dynamic model in first differences that includes the error correction terms Section 185 contained an introduction to forecasting with emphasis on regressionbased forecast ing methods Static models or more generally models that contain explanatory variables dated con temporaneously with the dependent variable are limited because then the explanatory variables need to be forecasted If we plug in hypothesized values of unknown future explanatory variables we obtain a conditional forecast Unconditional forecasts are similar to simply modeling yt as a function of past information we have observed at the time the forecast is needed Dynamic regression models including autoregressions and vector autoregressions are used routinely In addition to obtaining onestepahead point forecasts we also discussed the construction of forecast intervals which are very similar to predic tion intervals Various criteria are used for choosing among forecasting methods The most common performance measures are the root mean squared error and the mean absolute error Both estimate the size of the average forecast error It is most informative to compute these measures using outofsample forecasts Multiplestepahead forecasts present new challenges and are subject to large forecast error variances Nevertheless for models such as autoregressions and vector autoregressions multiplestepahead forecasts can be computed and approximate forecast intervals can be obtained Forecasting trending and I1 series requires special care Processes with deterministic trends can be forecasted by including time trends in regression models possibly with lags of variables A potential draw back is that deterministic trends can provide poor forecasts for longhorizon forecasts once it is estimated a linear trend continues to increase or decrease The typical approach to forecasting an I1 process is to forecast the difference in the process and to add the level of the variable to that forecasted difference Alter natively vector autoregressive models can be used in the levels of the series If the series are cointegrated error correction models can be used instead Key Terms Augmented DickeyFuller Test Cointegration Conditional Forecast DickeyFuller Distribution DickeyFuller DF Test EngleGranger Test EngleGranger TwoStep Procedure Error Correction Model Exponential Smoothing Forecast Error Forecast Interval Geometric or Koyck Distributed Lag Granger Causality Infinite Distributed Lag IDL Model Information Set InSample Criteria Leads and Lags Estimator Loss Function Martingale Martingale Difference Sequence Mean Absolute Error MAE MultipleStepAhead Forecast OneStepAhead Forecast OutofSample Criteria Point Forecast Rational Distributed Lag RDL Model Root Mean Squared Error RMSE Spurious Regression Problem Unconditional Forecast Unit Roots Vector Autoregressive VAR Model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 600 PART 3 Advanced Topics Problems 1 Consider equation 1815 with k 5 2 Using the IV approach to estimating the gh and r what would you use as instruments for yt21 2 An interesting economic model that leads to an econometric model with a lagged dependent variable relates yt to the expected value of xt say xp t where the expectation is based on all observed information at time t 2 1 yt 5 a0 1 a1xp t 1 ut 1868 A natural assumption on 5ut6 is that E1ut0It212 5 0 where It21 denotes all information on y and x observed at time t 2 1 this means that E1yt0It212 5 a0 1 a1xp t To complete this model we need an assumption about how the expectation xp t is formed We saw a simple example of adaptive expectations in Section 112 where xp t 5 xt21 A more complicated adaptive expectations scheme is xp t 2 xp t21 5 l1xt21 2 xp t212 1869 where 0 l 1 This equation implies that the change in expectations reacts to whether last periods realized value was above or below its expectation The assumption 0 l 1 implies that the change in expectations is a fraction of last periods error i Show that the two equations imply that yt 5 la0 1 11 2 l2yt21 1 la1xt21 1 ut 2 11 2 l2ut21 Hint Lag equation 1868 one period multiply it by 11 2 l2 and subtract this from 1868 Then use 1869 ii Under E1ut0It212 5 0 5ut6 is serially uncorrelated What does this imply about the new errors vt 5 ut 2 11 2 l2ut21 iii If we write the equation from part i as yt 5 b0 1 b1yt21 1 b2xt21 1 vt how would you consistently estimate the bj iv Given consistent estimators of the bj how would you consistently estimate l and a1 3 Suppose that 5yt6 and 5zt6 are I1 series but yt 2 bzt is I0 for some b 2 0 Show that for any d 2 b yt 2 dzt must be I1 4 Consider the error correction model in equation 1837 Show that if you add another lag of the error correction term yt22 2 bxt22 the equation suffers from perfect collinearity Hint Show that yt22 2 bxt22 is a perfect linear function of yt21 2 bxt21 Dxt21 and Dyt21 5 Suppose the process 5 1xt yt2 t 5 0 1 2 p6 satisfies the equations yt 5 bxt 1 ut and Dxt 5 gDxt21 1 vt where E1ut0It212 5 E1vt0It212 5 0 It21 contains information on x and y dated at time t 2 1 and earlier b 2 0 and 0g0 1 so that xt and therefore yt is I1 Show that these two equations imply an error correction model of the form Dyt 5 g1Dxt21 1 d1yt21 2 bxt212 1 et where g1 5 bg d 5 21 and et 5 ut 1 bvt Hint First subtract yt21 from both sides of the first equa tion Then add and subtract bxt21 from the righthand side and rearrange Finally use the second equation to get the error correction model that contains Dxt21 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 601 6 Using the monthly data in VOLAT the following model was estimated pcip 5 154 1 344 pcip21 1 074 pcip22 1 073 pcip23 1 031 pcsp21 1562 10422 10452 10422 10132 n 5 554 R2 5 174 R2 5 168 where pcip is the percentage change in monthly industrial production at an annualized rate and pcsp is the percentage change in the Standard Poors 500 Index also at an annualized rate i If the past three months of pcip are zero and pcsp21 5 0 what is the predicted growth in industrial production for this month Is it statistically different from zero ii If the past three months of pcip are zero but pcsp21 5 10 what is the predicted growth in industrial production iii What do you conclude about the effects of the stock market on real economic activity 7 Let gMt be the annual growth in the money supply and let unemt be the unemployment rate Assuming that unemt follows a stable AR1 process explain in detail how you would test whether gM Granger causes unem 8 Suppose that yt follows the model yt 5 a 1 d1zt21 1 ut ut 5 rut21 1 et E1et0It212 5 0 where It21 contains y and z dated at t 2 1 and earlier i Show that E1yt110It2 5 11 2 r2a 1 ryt 1 d1zt 2 rd1zt21 Hint Write ut21 5 yt21 2 a 2 d1zt22 and plug this into the second equation then plug the result into the first equation and take the conditional expectation ii Suppose that you use n observations to estimate a d1 and r Write the equation for forecasting yn11 iii Explain why the model with one lag of z and AR1 serial correlation is a special case of the model yt 5 a0 1 ryt21 1 g1zt21 1 g2zt22 1 et iv What does part iii suggest about using models with AR1 serial correlation for forecasting 9 Let 5yt6 be an I1 sequence Suppose that g n is the onestepahead forecast of Dyn11 and let fn 5 g n 1 yn be the onestepahead forecast of yn11 Explain why the forecast errors for forecasting Dyn11 and yn11 are identical Computer Exercises C1 Use the data in WAGEPRC for this exercise Problem 5 in Chapter 11 gave estimates of a finite distrib uted lag model of gprice on gwage where 12 lags of gwage are used i Estimate a simple geometric DL model of gprice on gwage In particular estimate equation 1811 by OLS What are the estimated impact propensity and LRP Sketch the estimated lag distribution ii Compare the estimated IP and LRP to those obtained in Problem 5 in Chapter 11 How do the estimated lag distributions compare iii Now estimate the rational distributed lag model from 1816 Sketch the lag distribution and compare the estimated IP and LRP to those obtained in part ii Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 602 PART 3 Advanced Topics C2 Use the data in HSEINV for this exercise i Test for a unit root in log1invpc2 including a linear time trend and two lags of Dlog1invpct2 Use a 5 significance level ii Use the approach from part i to test for a unit root in log1price2 iii Given the outcomes in parts i and ii does it make sense to test for cointegration between log1invpc2 and log1price2 C3 Use the data in VOLAT for this exercise i Estimate an AR3 model for pcip Now add a fourth lag and verify that it is very insignificant ii To the AR3 model from part i add three lags of pcsp to test whether pcsp Granger causes pcip Carefully state your conclusion iii To the model in part ii add three lags of the change in i3 the threemonth Tbill rate Does pcsp Granger cause pcip conditional on past Di3 C4 In testing for cointegration between gfr and pe in Example 185 add t2 to equation 1832 to obtain the OLS residuals Include one lag in the augmented DF test The 5 critical value for the test is 2415 C5 Use INTQRT for this exercise i In Example 187 we estimated an error correction model for the holding yield on sixmonth Tbills where one lag of the holding yield on threemonth Tbills is the explanatory variable We assumed that the cointegration parameter was one in the equation hy6t 5 a 1 bhy3t21 1 ut Now add the lead change Dhy3t the contemporaneous change Dhy3t21 and the lagged change Dhy3t22 of hy3t21 That is estimate the equation hy6t 5 a 1 bhy3t21 1 f0Dhy3t 1 f1Dhy3t21 1 r1Dhy3t22 1 et and report the results in equation form Test H0 b 5 1 against a twosided alternative Assume that the lead and lag are sufficient so that 5hy3t216 is strictly exogenous in this equation and do not worry about serial correlation ii To the error correction model in 1839 add Dhy3t22 and 1hy6t22 2 hy3t232 Are these terms jointly significant What do you conclude about the appropriate error correction model C6 Use the data in PHILLIPS to answer these questions i Estimate the models in 1848 and 1849 using the data through 1997 Do the parameter estimates change much compared with 1848 and 1849 ii Use the new equations to forecast unem1998 round to two places after the decimal Which equation produces a better forecast iii As we discussed in the text the forecast for unem1998 using 1849 is 490 Compare this with the forecast obtained using the data through 1997 Does using the extra year of data to obtain the parameter estimates produce a better forecast iv Use the model estimated in 1848 to obtain a twostepahead forecast of unem That is forecast unem1998 using equation 1855 with a 5 1572 r 5 732 and h 5 2 Is this better or worse than the onestepahead forecast obtained by plugging unem1997 5 49 into 1848 C7 Use the data in BARIUM for this exercise i Estimate the linear trend model chnimpt 5 a 1 bt 1 ut using the first 119 observations this excludes the last 12 months of observations for 1988 What is the standard error of the regression ii Now estimate an AR1 model for chnimp again using all data but the last 12 months Compare the standard error of the regression with that from part i Which model provides a better insample fit iii Use the models from parts i and ii to compute the onestepahead forecast errors for the 12 months in 1988 You should obtain 12 forecast errors for each method Compute and compare the RMSEs and the MAEs for the two methods Which forecasting method works better outofsample for onestepahead forecasts Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 603 iv Add monthly dummy variables to the regression from part i Are these jointly significant Do not worry about the slight serial correlation in the errors from this regression when doing the joint test C8 Use the data in FERTIL3 for this exercise i Graph gfr against time Does it contain a clear upward or downward trend over the entire sample period ii Using the data through 1979 estimate a cubic time trend model for gfr that is regress gfr on t t2 and t3 along with an intercept Comment on the Rsquared of the regression iii Using the model in part ii compute the mean absolute error of the onestepahead forecast errors for the years 1980 through 1984 iv Using the data through 1979 regress Dgfrt on a constant only Is the constant statistically different from zero Does it make sense to assume that any drift term is zero if we assume that gfrt follows a random walk v Now forecast gfr for 1980 through 1984 using a random walk model the forecast of gfrn11 is simply gfrn Find the MAE How does it compare with the MAE from part iii Which method of forecasting do you prefer vi Now estimate an AR2 model for gfr again using the data only through 1979 Is the second lag significant vii Obtain the MAE for 1980 through 1984 using the AR2 model Does this more general model work better outofsample than the random walk model C9 Use CONSUMP for this exercise i Let yt be real per capita disposable income Use the data through 1989 to estimate the model yt 5 a 1 bt 1 ryt21 1 ut and report the results in the usual form ii Use the estimated equation from part i to forecast y in 1990 What is the forecast error iii Compute the mean absolute error of the onestepahead forecasts for the 1990s using the parameters estimated in part i iv Now compute the MAE over the same period but drop yt21 from the equation Is it better to include yt21 in the model or not C10 Use the data in INTQRT for this exercise i Using the data from all but the last four years 16 quarters estimate an AR1 model for Dr6t We use the difference because it appears that r6t has a unit root Find the RMSE of the one stepahead forecasts for Dr6 using the last 16 quarters ii Now add the error correction term sprt21 5 r6t21 2 r3t21 to the equation from part i This assumes that the cointegrating parameter is one Compute the RMSE for the last 16 quarters Does the error correction term help with outofsample forecasting in this case iii Now estimate the cointegrating parameter rather than setting it to one Use the last 16 quarters again to produce the outofsample RMSE How does this compare with the forecasts from parts i and ii iv Would your conclusions change if you wanted to predict r6 rather than Dr6 Explain C11 Use the data in VOLAT for this exercise i Confirm that lsp500 5 log1sp5002 and lip 5 log1ip2 appear to contain unit roots Use Dickey Fuller tests with four lagged changes and do the tests with and without a linear time trend ii Run a simple regression of lsp500 on lip Comment on the sizes of the t statistic and Rsquared iii Use the residuals from part ii to test whether lsp500 and lip are cointegrated Use the standard DickeyFuller test and the ADF test with two lags What do you conclude iv Add a linear time trend to the regression from part ii and now test for cointegration using the same tests from part iii v Does it appear that stock prices and real economic activity have a longrun equilibrium relationship Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 604 PART 3 Advanced Topics C12 This exercise also uses the data from VOLAT Computer Exercise C11 studies the longrun relation ship between stock prices and industrial production Here you will study the question of Granger causality using the percentage changes i Estimate an AR3 model for pcipt the percentage change in industrial production reported at an annualized rate Show that the second and third lags are jointly significant at the 25 level ii Add one lag of pcspt to the equation estimated in part i Is the lag statistically significant What does this tell you about Granger causality between the growth in industrial production and the growth in stock prices iii Redo part ii but obtain a heteroskedasticityrobust t statistic Does the robust test change your conclusions from part ii C13 Use the data in TRAFFIC2 for this exercise These monthly data on traffic accidents in California over the years 1981 to 1989 were used in Computer Exercise C11 in Chapter 10 i Using the standard DickeyFuller regression test whether ltotacct has a unit root Can you reject a unit root at the 25 level ii Now add two lagged changes to the test from part i and compute the augmented Dickey Fuller test What do you conclude iii Add a linear time trend to the ADF regression from part ii Now what happens iv Given the findings from parts i through iii what would you say is the best characterization of ltotacct an I1 process or an I0 process about a linear time trend v Test the percentage of fatalities prcfatt for a unit root using two lags in an ADF regression In this case does it matter whether you include a linear time trend C14 Use the data in MINWAGEDTA for sector 232 to answer the following questions i Confirm that lwage232t and lemp232t are best characterized as I1 processes Use the augmented DF test with one lag of gwage232 and gemp232 respectively and a linear time trend Is there any doubt that these series should be assumed to have unit roots ii Regress lemp232t on lwage232t and test for cointegration both with and without a time trend allowing for two lags in the augmented EngleGranger test What do you conclude iii Now regress lemp232t on log of the real wage rate lrwage232t 5 lwage232t 2 lcpit and a time trend Do you find cointegration Are they closer to being cointegrated when you use real wages rather than nominal wages iv What are some factors that might be missing from the cointegrating regression in part iii C15 This question asks you to study the socalled Beveridge Curve from the perspective of cointegration analysis The US monthly data from December 2000 through February 2012 are in BEVERIDGE i Test for a unit root in urate using the usual DickeyFuller test with a constant and the augmented DF with two lags of curate What do you conclude Are the lags of curate in the augmented DF test statistically significant Does it matter to the outcome of the unit root test ii Repeat part i but with the vacancy rate vrate iii Assuming that urate and vrate are both I1 the Beveridge Curve uratet 5 a 1 bvrate 1 ut only makes sense if urate and vrate are cointegrated with cointegrating parameter b 0 Test for cointegration using the EngleGranger test with no lags Are urate and vrate cointegrated at the 10 significance level What about at the 5 level iv Obtain the leads and lags estimator with cvratet cvratet21 and cvratet11 as the I0 explanatory variables added to the equation in part iii Obtain the NeweyWest standard error for b using four lags so g 5 4 in the notation of Section 125 What is the resulting 95 confidence interval for b How does it compare with the confidence interval that is not robust to serial correlation or heteroskedasticity v Redo the EngleGranger test but with two lags in the augmented DF regression What happens What do you conclude about the robustness of the claim that urate and vrate are cointegratedlemp232t Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 605 I n this chapter we discuss the ingredients of a successful empirical analysis with emphasis on completing a term project In addition to reminding you of the important issues that have arisen throughout the text we emphasize recurring themes that are important for applied research We also provide suggestions for topics as a way of stimulating your imagination Several sources of economic research and data are given as references 191 Posing a Question The importance of posing a very specific question that in principle can be answered with data can not be overstated Without being explicit about the goal of your analysis you cannot know where to begin The widespread availability of rich data sets makes it tempting to launch into data collection based on halfbaked ideas but this is often counterproductive It is likely that without carefully for mulating your hypotheses and the kind of model you will need to estimate you will forget to collect information on important variables obtain a sample from the wrong population or collect data for the wrong time period This does not mean that you should pose your question in a vacuum Especially for a oneterm project you cannot be too ambitious Therefore when choosing a topic you should be reasonably sure that data sources exist that will allow you to answer your question in the allotted time You need to decide what areas of economics or other social sciences interest you when selecting a topic For example if you have taken a course in labor economics you have probably seen theories that can be tested empirically or relationships that have some policy relevance Labor economists are constantly coming up with new variables that can explain wage differentials Examples include Carrying Out an Empirical Project c h a p t e r 19 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 606 quality of high school Card and Krueger 1992 and Betts 1995 amount of math and science taken in high school Levine and Zimmerman 1995 and physical appearance Hamermesh and Biddle 1994 Averett and Korenman 1996 Biddle and Hamermesh 1998 and Hamermesh and Parker 2005 Researchers in state and local public finance study how local economic activity depends on economic policy variables such as property taxes sales taxes level and quality of services such as schools fire and police and so on See for example White 1986 Papke 1987 Bartik 1991 Netzer 1992 and Mark McGuire and Papke 2000 Economists that study education issues are interested in determining how spending affects per formance Hanushek 1986 whether attending certain kinds of schools improves performance for example Evans and Schwab 1995 and what factors affect where private schools choose to locate Downes and Greenstein 1996 Macroeconomists are interested in relationships between various aggregate time series such as the link between growth in gross domestic product and growth in fixed investment or machinery see De Long and Summers 1991 or the effect of taxes on interest rates for example Peek 1982 There are certainly reasons for estimating models that are mostly descriptive For example prop erty tax assessors use models called hedonic price models to estimate housing values for homes that have not been sold recently This involves a regression model relating the price of a house to its char acteristics size number of bedrooms number of bathrooms and so on As a topic for a term paper this is not very exciting we are unlikely to learn much that is surprising and such an analysis has no obvious policy implications Adding the crime rate in the neighborhood as an explanatory variable would allow us to determine how important a factor crime is on housing prices something that would be useful in estimating the costs of crime Several relationships have been estimated using macroeconomic data that are mostly descriptive For example an aggregate saving function can be used to estimate the aggregate marginal propensity to save as well as the response of saving to asset returns such as interest rates Such an analysis could be made more interesting by using time series data on a country that has a history of political upheavals and determining whether savings rates decline during times of political uncertainty Once you decide on an area of research there are a variety of ways to locate specific papers on the topic The Journal of Economic Literature JEL has a detailed classification system in which each paper is given a set of identifying codes that places it within certain subfields of economics The JEL also contains a list of articles published in a wide variety of journals organized by topic and it even contains short abstracts of some articles Especially convenient for finding published papers on various topics are Internet services such as EconLit which many universities subscribe to EconLit allows users to do a comprehensive search of almost all economics journals by author subject words in the title and so on The Social Sciences Citation Index is useful for finding papers on a broad range of topics in the social sciences including popular papers that have been cited often in other published works Google Scholar is an Internet search engine that can be very helpful for tracking down research on various topics or research by a particular author This is especially true of work that has not been published in an academic journal or that has not yet been published In thinking about a topic you should keep some things in mind First for a question to be inter esting it does not need to have broadbased policy implications rather it can be of local interest For example you might be interested in knowing whether living in a fraternity at your university causes students to have lower or higher grade point averages This may or may not be of interest to people outside your university but it is probably of concern to at least some people within the university On the other hand you might study a problem that starts by being of local interest but turns out to have widespread interest such as determining which factors affect and which university policies can stem alcohol abuse on college campuses Second it is very difficult especially for a quarter or semester project to do truly original research using the standard macroeconomic aggregates on the US economy For example the ques tion of whether money growth government spending growth and so on affect economic growth has Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 607 been and continues to be studied by professional macroeconomists The question of whether stock or other asset returns can be systematically predicted using known information has for obvious reasons been studied pretty carefully This does not mean that you should avoid estimating macroeconomic or empirical finance models as even just using more recent data can add constructively to a debate In addition you can sometimes find a new variable that has an important effect on economic aggregates or financial returns such a discovery can be exciting The point is that exercises such as using a few additional years to estimate a standard Phillips curve or an aggregate consumption function for the US economy or some other large economy are unlikely to yield additional insights although they can be instructive for the student Instead you might use data on a smaller country to estimate a static or dynamic Phillips curve or a Beveridge curve possibly allowing the slopes of the curves to depend on information known prior to the current time period or to test the efficient markets hypothesis and so on At the nonmacroeconomic level there are also plenty of questions that have been studied extensively For example labor economists have published many papers on estimating the return to education This question is still studied because it is very important and new data sets as well as new econometric approaches continue to be developed For example as we saw in Chapter 9 certain data sets have better proxy variables for unobserved ability than other data sets Compare WAGE1 and WAGE2 In other cases we can obtain panel data or data from a natural experimentsee Chapter 13that allow us to approach an old question from a different perspective As another example criminologists are interested in studying the effects of various laws on crime The question of whether capital punishment has a deterrent effect has long been debated Similarly economists have been interested in whether taxes on cigarettes and alcohol reduce consumption as always in a ceteris paribus sense As more years of data at the state level become available a richer panel data set can be created and this can help us better answer major policy questions Plus the effectiveness of fairly recent crimefighting innovationssuch as community policingcan be eval uated empirically While you are formulating your question it is helpful to discuss your ideas with your classmates instructor and friends You should be able to convince people that the answer to your question is of some interest Of course whether you can persuasively answer your question is another issue but you need to begin with an interesting question If someone asks you about your paper and you respond with Im doing my paper on crime or Im doing my paper on interest rates chances are you have only decided on a general area without formulating a true question You should be able to say something like Im studying the effects of community policing on city crime rates in the United States or Im looking at how inflation volatility affects shortterm interest rates in Brazil 192 Literature Review All papers even if they are relatively short should contain a review of relevant literature It is rare that one attempts an empirical project for which no published precedent exists If you search through journals or use online search services such as EconLit to come up with a topic you are already well on your way to a literature review If you select a topic on your ownsuch as studying the effects of drug usage on college performance at your universitythen you will probably have to work a little harder But online search services make that work a lot easier as you can search by keywords by words in the title by author and so on You can then read abstracts of papers to see how relevant they are to your own work When doing your literature search you should think of related topics that might not show up in a search using a handful of keywords For example if you are studying the effects of drug usage on wages or grade point average you should probably look at the literature on how alcohol usage affects such factors Knowing how to do a thorough literature search is an acquired skill but you can get a long way by thinking before searching Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 608 Researchers differ on how a literature review should be incorporated into a paper Some like to have a separate section called literature review while others like to include the literature review as part of the introduction This is largely a matter of taste although an extensive literature review prob ably deserves its own section If the term paper is the focus of the coursesay in a senior seminar or an advanced econometrics courseyour literature review probably will be lengthy Term papers at the end of a first course are typically shorter and the literature reviews are briefer 193 Data Collection 193a Deciding on the Appropriate Data Set Collecting data for a term paper can be educational exciting and sometimes even frustrating You must first decide on the kind of data needed to answer your posed question As we discussed in the introduction and have covered throughout this text data sets come in a variety of forms The most common kinds are crosssectional time series pooled cross sections and panel data sets Many questions can be addressed using any of the data structures we have described For exam ple to study whether more law enforcement lowers crime we could use a cross section of cities a time series for a given city or a panel data set of citieswhich consists of data on the same cities over two or more years Deciding on which kind of data to collect often depends on the nature of the analysis To answer questions at the individual or family level we often only have access to a single cross section typi cally these are obtained via surveys Then we must ask whether we can obtain a rich enough data set to do a convincing ceteris paribus analysis For example suppose we want to know whether families who save through individual retirement accounts IRAswhich have certain tax advantageshave less nonIRA savings In other words does IRA saving simply crowd out other forms of saving There are data sets such as the Survey of Consumer Finances that contain information on various kinds of saving for a different sample of families each year Several issues arise in using such a data set Perhaps the most important is whether there are enough controlsincluding income demograph ics and proxies for saving tastesto do a reasonable ceteris paribus analysis If these are the only kinds of data available we must do what we can with them The same issues arise with crosssectional data on firms cities states and so on In most cases it is not obvious that we will be able to do a ceteris paribus analysis with a single cross section For example any study of the effects of law enforcement on crime must recognize the endogeneity of law enforcement expenditures When using standard regression methods it may be very hard to complete a convincing ceteris paribus analysis no matter how many controls we have See Section 194 for more discussion If you have read the advanced chapters on panel data methods you know that having the same crosssectional units at two or more different points in time can allow us to control for timeconstant unobserved effects that would normally confound regression on a single cross section Panel data sets are relatively hard to obtain for individuals or familiesalthough some important ones exist such as the Panel Study of Income Dynamicsbut they can be used in very convincing ways Panel data sets on firms also exist For example Compustat and the Center for Research in Security Prices CRSP manage very large panel data sets of financial information on firms Easier to obtain are panel data sets on larger units such as schools cities counties and states as these tend not to disappear over time and government agencies are responsible for collecting information on the same variables each year For example the Federal Bureau of Investigation collects and reports detailed information on crime rates at the city level Sources of data are listed at the end of this chapter Data come in a variety of forms Some data sets especially historical ones are available only in printed form For small data sets entering the data yourself from the printed source is Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 609 manageable and convenient Sometimes articles are published with small data setsespecially time series applications These can be used in an empirical study perhaps by supplementing the data with more recent years Many data sets are available in electronic form Various government agencies provide data on their websites Private companies sometimes compile data sets to make them user friendly and then they provide them for a fee Authors of papers are often willing to provide their data sets in electronic form More and more data sets are available on the Internet The web is a vast resource of online databases Numerous websites containing economic and related data sets have been created Several other websites contain links to data sets that are of interest to economists some of these are listed at the end of this chapter Generally searching the Internet for data sources is easy and will become even more convenient in the future 193b Entering and Storing Your Data Once you have decided on a data type and have located a data source you must put the data into a usable format If the data came in electronic form they are already in some format hopefully one in widespread use The most flexible way to obtain data in electronic form is as a standard text ASCII file All statistics and econometrics software packages allow raw data to be stored this way Typically it is straightforward to read a text file directly into an econometrics package provided the file is properly structured The data files we have used throughout the text provide several examples of how crosssectional time series pooled cross sections and panel data sets are usually stored As a rule the data should have a tabular form with each observation representing a different row the columns in the data set represent different variables Occasionally you might encounter a data set stored with each column representing an observation and each row a different variable This is not ideal but most software packages allow data to be read in this form and then reshaped Naturally it is crucial to know how the data are organized before reading them into your econometrics package For time series data sets there is only one sensible way to enter and store the data namely chronologically with the earliest time period listed as the first observation and the most recent time period as the last observation It is often useful to include variables indicating year and if relevant quarter or month This facilitates estimation of a variety of models later on including allowing for seasonality and breaks at different time periods For cross sections pooled over time it is usually best to have the cross section for the earliest year fill the first block of observations followed by the cross section for the second year and so on See FERTIL1 as an example This arrangement is not crucial but it is very important to have a variable stating the year attached to each observation For panel data as we discussed in Section 135 it is best if all the years for each crosssectional observation are adjacent and in chronological order With this ordering we can use all of the panel data methods from Chapters 13 and 14 With panel data it is important to include a unique identifier for each crosssectional unit along with a year variable If you obtain your data in printed form you have several options for entering them into a computer First you can create a text file using a standard text editor This is how several of the raw data sets included with the text were initially created Typically it is required that each row starts a new obser vation that each row contains the same ordering of the variablesin particular each row should have the same number of entriesand that the values are separated by at least one space Sometimes a different separator such as a comma is better but this depends on the software you are using If you have missing observations on some variables you must decide how to denote that simply leaving a blank does not generally work Many regression packages accept a period as the missing value symbol Some people prefer to use a numberpresumably an impossible value for the variable of interestto denote missing values If you are not careful this can be dangerous we discuss this further later If you have nonnumerical datafor example you want to include the names in a sample of colleges or the names of citiesthen you should check the econometrics package you will use to see Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 610 the best way to enter such variables often called strings Typically strings are put between double or single quotation marks Or the text file can follow a rigid formatting which usually requires a small program to read in the text file But you need to check your econometrics package for details Another generally available option is to use a spreadsheet to enter your data such as Excel This has a couple of advantages over a text file First because each observation on each variable is a cell it is less likely that numbers will be run together as would happen if you forget to enter a space in a text file Second spreadsheets allow manipulation of data such as sorting or computing averages This benefit is less important if you use a software package that allows sophisticated data management many software packages including EViews and Stata fall into this category If you use a spreadsheet for initial data entry then you must often export the data in a form that can be read by your econo metrics package This is usually straightforward as spreadsheets export to text files using a variety of formats A third alternative is to enter the data directly into your econometrics package Although this obviates the need for a text editor or a spreadsheet it can be more awkward if you cannot freely move across different observations to make corrections or additions Data downloaded from the Internet may come in a variety of forms Often data come as text files but different conventions are used for separating variables for panel data sets the conventions on how to order the data may differ Some Internet data sets come as spreadsheet files in which case you must use an appropriate spreadsheet to read them 193c Inspecting Cleaning and Summarizing Your Data It is extremely important to become familiar with any data set you will use in an empirical analysis If you enter the data yourself you will be forced to know everything about it But if you obtain data from an outside source you should still spend some time understanding its structure and conventions Even data sets that are widely used and heavily documented can contain glitches If you are using a data set obtained from the author of a paper you must be aware that rules used for data set construc tion can be forgotten Earlier we reviewed the standard ways that various data sets are stored You also need to know how missing values are coded Preferably missing values are indicated with a nonnumeric character such as a period If a number is used as a missing value code such as 999 or 21 you must be very careful when using these observations in computing any statistics Your econometrics package will probably not know that a certain number really represents a missing value it is likely that such observations will be used as if they are valid and this can produce rather misleading results The best approach is to set any numerical codes for missing values to some other character such as a period that cannot be mistaken for real data You must also know the nature of the variables in the data set Which are binary variables Which are ordinal variables such as a credit rating What are the units of measurement of the variables For example are monetary values expressed in dollars thousands of dollars millions of dollars or some other units Are variables representing a ratesuch as school dropout rates inflation rates unioniza tion rates or interest ratesmeasured as a percentage or a proportion Especially for time series data it is crucial to know if monetary values are in nominal current or real constant dollars If the values are in real terms what is the base year or period If you receive a data set from an author some variables may already be transformed in certain ways For example sometimes only the log of a variable such as wage or salary is reported in the data set Detecting mistakes in a data set is necessary for preserving the integrity of any data analysis It is always useful to find minimums maximums means and standard deviations of all or at least the most important variables in the analysis For example if you find that the minimum value of education in your sample is 99 you know that at least one entry on education needs to be set to a missing value If upon further inspection you find that several observations have 299 as the level of Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 611 education you can be confident that you have stumbled onto the missing value code for education As another example if you find that an average murder conviction rate across a sample of cities is 632 you know that conviction rate is measured as a proportion not a percentage Then if the maximum value is above one this is likely a typographical error It is not uncommon to find data sets where most of the entries on a rate variable were entered as a percentage but where some were entered as a proportion and vice versa Such data coding errors can be difficult to detect but it is important to try We must also be careful in using time series data If we are using monthly or quarterly data we must know which variables if any have been seasonally adjusted Transforming data also requires great care Suppose we have a monthly data set and we want to create the change in a variable from one month to the next To do this we must be sure that the data are ordered chronologically from earliest period to latest If for some reason this is not the case the differencing will result in garbage To be sure the data are properly ordered it is useful to have a time period indicator With annual data it is sufficient to know the year but then we should know whether the year is entered as four digits or two digits for example 1998 versus 98 With monthly or quarterly data it is also useful to have a variable or variables indicating month or quarter With monthly data we may have a set of dummy variables 11 or 12 or one variable indicating the month 1 through 12 or a string variable such as jan feb and so on With or without yearly monthly or quarterly indicators we can easily construct time trends in all econometrics software packages Creating seasonal dummy variables is easy if the month or quarter is indicated at a minimum we need to know the month or quarter of the first observation Manipulating panel data can be even more challenging In Chapter 13 we discussed pooled OLS on the differenced data as one general approach to controlling for unobserved effects In construct ing the differenced data we must be careful not to create phantom observations Suppose we have a balanced panel on cities from 1992 through 1997 Even if the data are ordered chronologically within each crosssectional unitsomething that should be done before proceedinga mindless differenc ing will create an observation for 1992 for all cities except the first in the sample This observation will be the 1992 value for city i minus the 1997 value for city i 2 1 this is clearly nonsense Thus we must make sure that 1992 is missing for all differenced variables 194 Econometric Analysis This text has focused on econometric analysis and we are not about to provide a review of econo metric methods in this section Nevertheless we can give some general guidelines about the sorts of issues that need to be considered in an empirical analysis As we discussed earlier after deciding on a topic we must collect an appropriate data set Assuming that this has also been done we must next decide on the appropriate econometric methods If your course has focused on ordinary least squares estimation of a multiple linear regres sion model using either crosssectional or time series data the econometric approach has pretty much been decided for you This is not necessarily a weakness as OLS is still the most widely used econometric method Of course you still have to decide whether any of the variants of OLS such as weighted least squares or correcting for serial correlation in a time series regressionare warranted In order to justify OLS you must also make a convincing case that the key OLS assumptions are satisfied for your model As we have discussed at some length the first issue is whether the error term is uncorrelated with the explanatory variables Ideally you have been able to control for enough other factors to assume that those that are left in the error are unrelated to the regressors Especially when dealing with individual family or firmlevel crosssectional data the selfselection problem which we discussed in Chapters 7 and 15is often relevant For instance in the IRA example from Section 193 it may be that families with an unobserved taste for saving are also the ones that Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 612 open IRAs You should also be able to argue that the other potential sources of endogeneitynamely measurement error and simultaneityare not a serious problem When specifying your model you must also make functional form decisions Should some vari ables appear in logarithmic form In econometric applications the answer is often yes Should some variables be included in levels and squares to possibly capture a diminishing effect How should qualitative factors appear Is it enough to just include binary variables for different attributes or groups Or do these need to be interacted with quantitative variables See Chapter 7 for details A common mistake especially among beginners is to incorrectly include explanatory varia bles in a regression model that are listed as numerical values but have no quantitative meaning For example in an individuallevel data set that contains information on wages education experience and other variables an occupation variable might be included Typically these are just arbitrary codes that have been assigned to different occupations the fact that an elementary school teacher is given say the value 453 while a computer technician is say 751 is relevant only in that it allows us to distinguish between the two occupations It makes no sense to include the raw occupational vari able in a regression model What sense would it make to measure the effect of increasing occupa tion by one unit when the oneunit increase has no quantitative meaning Instead different dummy variables should be defined for different occupations or groups of occupations if there are many occupations Then the dummy variables can be included in the regression model A less egregious problem occurs when an ordered qualitative variable is included as an explanatory variable Suppose that in a wage data set a variable is included measuring job satisfaction defined on a scale from 1 to 7 with 7 being the most satisfied Provided we have enough data we would want to define a set of six dummy variables for say job satisfaction levels of 2 through 7 leaving job satisfaction level 1 as the base group By including the six job satisfaction dummies in the regression we allow a completely flexible relationship between the response variable and job satisfaction Putting in the job satisfaction variable in raw form implicitly assumes that a oneunit increase in the ordinal vari able has quantitative meaning While the direction of the effect will often be estimated appropriately interpreting the coefficient on an ordinal variable is difficult If an ordinal variable takes on many values then we can define a set of dummy variables for ranges of values See Section 173 for an example Sometimes we want to explain a variable that is an ordinal response For example one could think of using a job satisfaction variable of the type described above as the dependent variable in a regression model with both worker and employer characteristics among the independent variables Unfortunately with the job satisfaction variable in its original form the coefficients in the model are hard to interpret each measures the change in job satisfaction given a unit increase in the inde pendent variable Certain modelsordered probit and ordered logit are the most commonare well suited for ordered responses These models essentially extend the binary probit and logit models we discussed in Chapter 17 See Wooldridge 2010 Chapter 16 for a treatment of ordered response models A simple solution is to turn any ordered response into a binary response For example we could define a variable equal to one if job satisfaction is at least four and zero otherwise Unfortunately creating a binary variable throws away information and requires us to use a somewhat arbitrary cutoff For crosssectional analysis a secondary but nevertheless important issue is whether there is heteroskedasticity In Chapter 8 we explained how this can be dealt with The simplest way is to com pute heteroskedasticityrobust statistics As we emphasized in Chapters 10 11 and 12 time series applications require additional care Should the equation be estimated in levels If levels are used are time trends needed Is differencing the data more appropriate If the data are monthly or quarterly does seasonality have to be accounted for If you are allowing for dynamicsfor example distributed lag dynamicshow many lags should be included You must start with some lags based on intuition or common sense but eventu ally it is an empirical matter Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 613 If your model has some potential misspecification such as omitted variables and you use OLS you should attempt some sort of misspecification analysis of the kinds we discussed in Chapters 3 and 5 Can you determine based on reasonable assumptions the direction of any bias in the estimators If you have studied the method of instrumental variables you know that it can be used to solve various forms of endogeneity including omitted variables Chapter 15 errorsinvariables Chapter 15 and simultaneity Chapter 16 Naturally you need to think hard about whether the instrumental variables you are considering are likely to be valid Good papers in the empirical social sciences contain sensitivity analysis Broadly this means you estimate your original model and modify it in ways that seem reasonable Hopefully the impor tant conclusions do not change For example if you use as an explanatory variable a measure of alcohol consumption say in a grade point average equation do you get qualitatively similar results if you replace the quantitative measure with a dummy variable indicating alcohol usage If the binary usage variable is significant but the alcohol quantity variable is not it could be that usage reflects some unobserved attribute that affects GPA and is also correlated with alcohol usage But this needs to be considered on a casebycase basis If some observations are much different from the bulk of the samplesay you have a few firms in a sample that are much larger than the other firmsdo your results change much when those observations are excluded from the estimation If so you may have to alter functional forms to allow for these observations or argue that they follow a completely different model The issue of outliers was discussed in Chapter 9 Using panel data raises some additional econometric issues Suppose you have collected two periods There are at least four ways to use two periods of panel data without resorting to instru mental variables You can pool the two years in a standard OLS analysis as discussed in Chapter 13 Although this might increase the sample size relative to a single cross section it does not control for timeconstant unobservables In addition the errors in such an equation are almost always serially correlated because of an unobserved effect Random effects estimation corrects the serial correlation problem and produces asymptotically efficient estimators provided the unobserved effect has zero mean given values of the explanatory variables in all time periods Another possibility is to include a lagged dependent variable in the equation for the second year In Chapter 9 we presented this as a way to at least mitigate the omitted variables problem as we are in any event holding fixed the initial outcome of the dependent variable This often leads to similar results as differencing the data as we covered in Chapter 13 With more years of panel data we have the same options plus an additional choice We can use the fixed effects transformation to eliminate the unobserved effect With two years of data this is the same as differencing In Chapter 15 we showed how instrumental variables techniques can be combined with panel data transformations to relax exogeneity assumptions even more As a rule it is a good idea to apply several reasonable econometric methods and compare the results This often allows us to determine which of our assumptions are likely to be false Even if you are very careful in devising your topic postulating your model collecting your data and carrying out the econometrics it is quite possible that you will obtain puzzling results at least some of the time When that happens the natural inclination is to try different models different estimation techniques or perhaps different subsets of data until the results correspond more closely to what was expected Virtually all applied researchers search over various models before finding the best model Unfortunately this practice of data mining violates the assump tions we have made in our econometric analysis The results on unbiasedness of OLS and other estimators as well as the t and F distributions we derived for hypothesis testing assume that we observe a sample following the population model and we estimate that model once Estimating models that are variants of our original model violates that assumption because we are using the same set of data in a specification search In effect we use the outcome of tests by using the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 614 data to respecify our model The estimates and tests from different model specifications are not independent of one another Some specification searches have been programmed into standard software packages A popular one is known as stepwise regression where different combinations of explanatory variables are used in multiple regression analysis in an attempt to come up with the best model There are vari ous ways that stepwise regression can be used and we have no intention of reviewing them here The general idea is either to start with a large model and keep variables whose pvalues are below a certain significance level or to start with a simple model and add variables that have significant pvalues Sometimes groups of variables are tested with an F test Unfortunately the final model often depends on the order in which variables were dropped or added For more on stepwise regres sion see Draper and Smith 1981 In addition this is a severe form of data mining and it is difficult to interpret t and F statistics in the final model One might argue that stepwise regression simply automates what researchers do anyway in searching over various models However in most applica tions one or two explanatory variables are of primary interest and then the goal is to see how robust the coefficients on those variables are to either adding or dropping other variables or to changing functional form In principle it is possible to incorporate the effects of data mining into our statistical inference in practice this is very difficult and is rarely done especially in sophisticated empirical work See Leamer 1983 for an engaging discussion of this problem But we can try to minimize data mining by not searching over numerous models or estimation methods until a significant result is found and then reporting only that result If a variable is statistically significant in only a small fraction of the models estimated it is quite likely that the variable has no effect in the population 195 Writing an Empirical Paper Writing a paper that uses econometric analysis is very challenging but it can also be rewarding A successful paper combines a careful convincing data analysis with good explanations and expo sition Therefore you must have a good grasp of your topic good understanding of econometric methods and solid writing skills Do not be discouraged if you find writing an empirical paper difficult most professional researchers have spent many years learning how to craft an empirical analysis and to write the results in a convincing form While writing styles vary many papers follow the same general outline The following para graphs include ideas for section headings and explanations about what each section should contain These are only suggestions and hardly need to be strictly followed In the final paper each section would be given a number usually starting with one for the introduction 195a Introduction The introduction states the basic objectives of the study and explains why it is important It gener ally entails a review of the literature indicating what has been done and how previous work can be improved upon As discussed in Section 192 an extensive literature review can be put in a separate section Presenting simple statistics or graphs that reveal a seemingly paradoxical relationship is a useful way to introduce the papers topic For example suppose that you are writing a paper about factors affecting fertility in a developing country with the focus on education levels of women An appealing way to introduce the topic would be to produce a table or a graph showing that fertility has been falling say over time and a brief explanation of how you hope to examine the factors contrib uting to the decline At this point you may already know that ceteris paribus more highly educated women have fewer children and that average education levels have risen over time Most researchers like to summarize the findings of their paper in the introduction This can be a useful device for grabbing the readers attention For example you might state that your best estimate Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 615 of the effect of missing 10 hours of lecture during a 30hour term is about onehalf a grade point But the summary should not be too involved because neither the methods nor the data used to obtain the estimates have yet been introduced 195b Conceptual or Theoretical Framework In this section you describe the general approach to answering the question you have posed It can be formal economic theory but in many cases it is an intuitive discussion about what conceptual prob lems arise in answering your question Suppose you are studying the effects of economic opportunities and severity of punishment on criminal behavior One approach to explaining participation in crime is to specify a utility maximiza tion problem where the individual chooses the amount of time spent in legal and illegal activities given wage rates in both kinds of activities as well as variables measuring probability and severity of punishment for criminal activity The usefulness of such an exercise is that it suggests which variables should be included in the empirical analysis it gives guidance but rarely specifics as to how the variables should appear in the econometric model Often there is no need to write down an economic theory For econometric policy analysis com mon sense usually suffices for specifying a model For example suppose you are interested in esti mating the effects of participation in Aid to Families with Dependent Children AFDC on the effects of child performance in school AFDC provides supplemental income but participation also makes it easier to receive Medicaid and other benefits The hard part of such an analysis is deciding on the set of variables that should be controlled for In this example we could control for family income including AFDC and any other welfare income mothers education whether the family lives in an urban area and other variables Then the inclusion of an AFDC participation indicator hopefully measures the nonincome benefits of AFDC participation A discussion of which factors should be controlled for and the mechanisms through which AFDC participation might improve school perfor mance substitute for formal economic theory 195c Econometric Models and Estimation Methods It is very useful to have a section that contains a few equations of the sort you estimate and present in the results section of the paper This allows you to fix ideas about what the key explanatory variable is and what other factors you will control for Writing equations containing error terms allows you to discuss whether OLS is a suitable estimation method The distinction between a model and an estimation method should be made in this section A model represents a population relationship broadly defined to allow for time series equations For example we should write colGPA 5 b0 1 b1alcohol 1 b2hsGPA 1 b3SAT 1 b4 female 1 u 191 to describe the relationship between college GPA and alcohol consumption with some other controls in the equation Presumably this equation represents a population such as all undergraduates at a particular university There are no hats ˆ on the bj or on colGPA because this is a model not an estimated equation We do not put in numbers for the bj because we do not know and never will know these numbers Later we will estimate them In this section do not anticipate the presentation of your empirical results In other words do not start with a general model and then say that you omit ted certain variables because they turned out to be insignificant Such discussions should be left for the results section A time series model to relate citylevel car thefts to the unemployment rate and conviction rates could look like theftst 5 b0 1 b1unemt 1 b2unemt21 1 b3carst 192 1 b4convratet 1 b5convratet21 1 ut Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 616 where the t subscript is useful for emphasizing any dynamics in the equation in this case allowing for unemployment and the automobile theft conviction rate to have lagged effects After specifying a model or models it is appropriate to discuss estimation methods In most cases this will be OLS but for example in a time series equation you might use feasible GLS to do a serial correlation correction as in Chapter 12 However the method for estimating a model is quite distinct from the model itself It is not meaningful for instance to talk about an OLS model Ordinary least squares is a method of estimation and so are weighted least squares CochraneOrcutt and so on There are usually several ways to estimate any model You should explain why the method you are choosing is warranted Any assumptions that are used in obtaining an estimable econometric model from an underly ing economic model should be clearly discussed For example in the quality of high school example mentioned in Section 191 the issue of how to measure school quality is central to the analysis Should it be based on average SAT scores percentage of graduates attending college student teacher ratios average education level of teachers some combination of these or possibly other measures We always have to make assumptions about functional form whether or not a theoretical model has been presented As you know constant elasticity and constant semielasticity models are attrac tive because the coefficients are easy to interpret as percentage effects There are no hard rules on how to choose functional form but the guidelines discussed in Section 62 seem to work well in prac tice You do not need an extensive discussion of functional form but it is useful to mention whether you will be estimating elasticities or a semielasticity For example if you are estimating the effect of some variable on wage or salary the dependent variable will almost surely be in logarithmic form and you might as well include this in any equations from the beginning You do not have to present every one or even most of the functional form variations that you will report later in the results section Often the data used in empirical economics are at the city or county level For example suppose that for the population of small to midsize cities you wish to test the hypothesis that having a minor league baseball team causes a city to have a lower divorce rate In this case you must account for the fact that larger cities will have more divorces One way to account for the size of the city is to scale divorces by the city or adult population Thus a reasonable model is log1divpop2 5 b0 1 b1mlb 1 b2 perCath 1 b3log1incpop2 193 1 other factors where mlb is a dummy variable equal to one if the city has a minor league baseball team and perCath is the percentage of the population that is Catholic so a number such as 346 means 346 Note that divpop is a divorce rate which is generally easier to interpret than the absolute number of divorces Another way to control for population is to estimate the model log1div2 5 g0 1 g1mlb 1 g2 perCath 1 g3log1inc2 1 g4log1pop2 194 1 other factors The parameter of interest g1 when multiplied by 100 gives the percentage difference between divorce rates holding population percent Catholic income and whatever else is in other factors constant In equation 193 b1 measures the percentage effect of minor league baseball on divpop which can change either because the number of divorces or the population changes Using the fact that log1divpop2 5 log1div2 2 log1pop2 and log1incpop2 5 log1inc2 2 log1pop2 we can rewrite 193 as log1div2 5 b0 1 b1mlb 1 b2 perCath 1 b3log1inc2 1 11 2 b32log1pop2 1 others factors Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 617 which shows that 193 is a special case of 194 with g4 5 11 2 b32 and gj 5 bj j 5 0 1 2 3 Alternatively 194 is equivalent to adding log1pop2 as an additional explanatory variable to 193 This makes it easy to test for a separate population effect on the divorce rate If you are using a more advanced estimation method such as two stage least squares you need to provide some reasons for doing so If you use 2SLS you must provide a careful discussion on why your IV choices for the endogenous explanatory variable or variables are valid As we mentioned in Chapter 15 there are two requirements for a variable to be considered a good IV First it must be omitted from and exogenous to the equation of interest structural equation This is something we must assume Second it must have some partial correlation with the endogenous explanatory vari able This we can test For example in equation 191 you might use a binary variable for whether a student lives in a dormitory dorm as an IV for alcohol consumption This requires that living situ ation has no direct impact on colGPAso that it is omitted from 191and that it is uncorrelated with unobserved factors in u that have an effect on colGPA We would also have to verify that dorm is partially correlated with alcohol by regressing alcohol on dorm hsGPA SAT and female See Chapter 15 for details You might account for the omitted variable problem or omitted heterogeneity by using panel data Again this is easily described by writing an equation or two In fact it is useful to show how to difference the equations over time to remove timeconstant unobservables this gives an equation that can be estimated by OLS Or if you are using fixed effects estimation instead you simply state so As a simple example suppose you are testing whether higher county tax rates reduce economic activity as measured by per capita manufacturing output Suppose that for the years 1982 1987 and 1992 the model is log1manufit2 5 b0 1 d1d87t 1 d2d92t 1 b1taxit 1 p 1 ai 1 uit where d87t and d92t are year dummy variables and taxit is the tax rate for county i at time t in percent form We would have other variables that change over time in the equation including measures for costs of doing business such as average wages measures of worker productivity as measured by average education and so on The term ai is the fixed effect containing all factors that do not vary over time and uit is the idiosyncratic error term To remove ai we can either difference across the years or use timedemeaning the fixed effects transformation 195d The Data You should always have a section that carefully describes the data used in the empirical analysis This is particularly important if your data are nonstandard or have not been widely used by other research ers Enough information should be presented so that a reader could in principle obtain the data and redo your analysis In particular all applicable public data sources should be included in the refer ences and short data sets can be listed in an appendix If you used your own survey to collect the data a copy of the questionnaire should be presented in an appendix Along with a discussion of the data sources be sure to discuss the units of each of the variables for example is income measured in hundreds or thousands of dollars Including a table of variable definitions is very useful to the reader The names in the table should correspond to the names used in describing the econometric results in the following section It is also very informative to present a table of summary statistics such as minimum and maximum values means and standard deviations for each variable Having such a table makes it easier to inter pret the coefficient estimates in the next section and it emphasizes the units of measurement of the variables For binary variables the only necessary summary statistic is the fraction of ones in the sample which is the same as the sample mean For trending variables things like means are less interesting It is often useful to compute the average growth rate in a variable over the years in your sample Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 618 You should always clearly state how many observations you have For time series data sets identify the years that you are using in the analysis including a description of any special periods in history such as World War II If you use a pooled cross section or a panel data set be sure to report how many crosssectional units people cities and so on you have for each year 195e Results The results section should include your estimates of any models formulated in the models section You might start with a very simple analysis For example suppose that percentage of students attend ing college from the graduating class percoll is used as a measure of the quality of the high school a person attended Then an equation to estimate is log1wage2 5 b0 1 b1percoll 1 u Of course this does not control for several other factors that may determine wages and that may be correlated with percoll But a simple analysis can draw the reader into the more sophisticated analysis and reveal the importance of controlling for other factors If only a few equations are estimated you can present the results in equation form with standard errors in parentheses below estimated coefficients If your model has several explanatory variables and you are presenting several variations on the general model it is better to report the results in tabu lar rather than equation form Most of your papers should have at least one table which should always include at least the Rsquared and the number of observations for each equation Other statistics such as the adjusted Rsquared can also be listed The most important thing is to discuss the interpretation and strength of your empirical results Do the coefficients have the expected signs Are they statistically significant If a coefficient is sta tistically significant but has a counterintuitive sign why might this be true It might be revealing a problem with the data or the econometric method for example OLS may be inappropriate due to omitted variables problems Be sure to describe the magnitudes of the coefficients on the major explanatory variables Often one or two policy variables are central to the study Their signs magnitudes and statistical significance should be treated in detail Remember to distinguish between economic and statistical significance If a t statistic is small is it because the coefficient is practically small or because its standard error is large In addition to discussing estimates from the most general model you can provide interesting special cases especially those needed to test certain multiple hypotheses For example in a study to determine wage differentials across industries you might present the equation without the indus try dummies this allows the reader to easily test whether the industry differentials are statistically significant using the Rsquared form of the F test Do not worry too much about dropping various variables to find the best combination of explanatory variables As we mentioned earlier this is a difficult and not even very welldefined task Only if eliminating a set of variables substantially alters the magnitudes andor significance of the coefficients of interest is this important Dropping a group of variables to simplify the modelsuch as quadratics or interactionscan be justified via an F test If you have used at least two different methodssuch as OLS and 2SLS or levels and differenc ing for a time series or pooled OLS versus differencing with a panel data setthen you should com ment on any critical differences If OLS gives counterintuitive results did using 2SLS or panel data methods improve the estimates Or did the opposite happen 195f Conclusions This can be a short section that summarizes what you have learned For example you might want to present the magnitude of a coefficient that was of particular interest The conclusion should also dis cuss caveats to the conclusions drawn and it might even suggest directions for further research It is useful to imagine readers turning first to the conclusion to decide whether to read the rest of the paper Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 619 195g Style Hints You should give your paper a title that reflects its topic but make sure the title is not so long as to be cumbersome The title should be on a separate title page that also includes your name affiliation andif relevantthe course number The title page can also include a short abstract or an abstract can be included on a separate page Papers should be typed and doublespaced All equations should begin on a new line and they should be centered and numbered consecutively that is 1 2 3 and so on Large graphs and tables may be included after the main body In the text refer to papers by author and date for example White 1980 The reference section at the end of the paper should be done in standard format Several examples are given in the references at the back of the text When you introduce an equation in the econometric models section you should describe the important variables the dependent variable and the key independent variable or variables To focus on a single independent variable you can write an equation such as GPA 5 b0 1 b1alcohol 1 xd 1 u or log1wage2 5 b0 1 b1educ 1 xd 1 u where the notation xd is shorthand for several other explanatory variables At this point you need only describe them generally they can be described specifically in the data section in a table For example in a study of the factors affecting chief executive officer salaries you might include a table like Table 191 A table of summary statistics obtained from Table I in Papke and Wooldridge 1996 and similar to the data in 401K might be set up as shown in Table 192 In the results section you can write the estimates either in equation form as we often have done or in a table Especially when several models have been estimated with different sets of explanatory variables tables are very useful If you write out the estimates as an equation for example log1salary2 5 245 1 236 log1sales2 1 008 roe 1 061 ceoten 10932 11152 10032 10282 n 5 204 R2 5 351 be sure to state near the first equation that standard errors are in parentheses It is acceptable to report the t statistics for testing H0 bj 5 0 or their absolute values but it is most important to state what you are doing TAblE 191 Variable Descriptions salary annual salary including bonuses in 1990 in thousands sales firm sales in 1990 in millions roe average return on equity 19881990 in percent pcsal percentage change in salary 19881990 pcroe percentage change in roe 19881990 indust 5 1 if an industrial company 0 otherwise finance 5 1 if a financial company 0 otherwise consprod 5 1 if a consumer products company 0 otherwise util 5 1 if a utility company 0 otherwise ceoten number of years as CEO of the company Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 620 If you report your results in tabular form make sure the dependent and independent variables are clearly indicated Again state whether standard errors or t statistics are below the coefficients with the former preferred Some authors like to use asterisks to indicate statistical significance at different significance levels for example one star means significant at 5 two stars mean significant at 10 but not 5 and so on This is not necessary if you carefully discuss the significance of the explana tory variables in the text A sample table of results derived from Table II in Papke and Wooldridge 1996 is shown in Table 193 Your results will be easier to read and interpret if you choose the units of both your dependent and independent variables so that coefficients are not too large or too small You should never report numbers such as 1051e2007 or 3524e1006 for your coefficients or standard errors and you should not use scientific notation If coefficients are either extremely small or large rescale the dependent or independent variables as we discussed in Chapter 6 You should limit the number of digits reported after the decimal point so as not to convey a false sense of precision For example if your regression TAblE 192 Summary Statistics Variable Mean Standard Deviation Minimum Maximum prate 869 167 023 1 mrate 746 844 011 5 employ 462101 1629964 53 443040 age 1314 963 4 76 sole 415 493 0 1 Number of observations 5 3784 Note The quantities in parentheses below the estimates are the standard errors TAblE 193 OLS Results Dependent Variable Participation Rate Independent Variables 1 2 3 mrate 156 012 239 042 218 342 mrate2 2087 043 2096 073 logemp 2112 014 2112 014 2098 111 log1emp2 2 0057 0009 0057 0009 0052 0007 age 0060 0010 0059 0010 0050 0021 age2 200007 00002 200007 00002 200006 00002 sole 20001 0058 0008 0058 0006 0061 constant 1213 051 198 052 085 041 industry dummies no no yes Observations R squared 3784 143 3784 152 3784 162 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 621 package estimates a coefficient to be 54821059 you should report this as 548 or even 55 in the paper As a rule the commands that your particular econometrics package uses to produce results should not appear in the paper only the results are important If some special command was used to carry out a certain estimation method this can be given in an appendix An appendix is also a good place to include extra results that support your analysis but are not central to it Summary In this chapter we have discussed the ingredients of a successful empirical study and have provided hints that can improve the quality of an analysis Ultimately the success of any study depends crucially on the care and effort put into it Key Terms Data Mining Internet Misspecification Analysis Online Databases Online Search Services Sensitivity Analysis Spreadsheet Text Editor Text ASCII File Sample Empirical Projects Throughout the text we have seen examples of econometric analysis that either came from or were moti vated by published works We hope these have given you a good idea about the scope of empirical analy sis We include the following list as additional examples of questions that others have found or are likely to find interesting These are intended to stimulate your imagination no attempt is made to fill in all the details of specific models data requirements or alternative estimation methods It should be possible to complete these projects in one term 1 Do your own campus survey to answer a question of interest at your university For example What is the effect of working on college GPA You can ask students about high school GPA college GPA ACT or SAT scores hours worked per week participation in athletics major gender race and so on Then use these variables to create a model that explains GPA How much of an effect if any does another hour worked per week have on GPA One issue of concern is that hours worked might be endogenous it might be correlated with unobserved factors that affect college GPA or lower GPAs might cause students to work more A better approach would be to collect cumulative GPA prior to the semester and then to obtain GPA for the most recent semester along with amount worked during that semester and the other vari ables Now cumulative GPA could be used as a control explanatory variable in the equation 2 There are many variants on the preceding topic You can study the effects of drug or alcohol usage or of living in a fraternity on grade point average You would want to control for many family back ground variables as well as previous performance variables 3 Do gun control laws at the city level reduce violent crimes Such questions can be difficult to answer with a single cross section because city and state laws are often endogenous See Kleck and Patterson 1993 for an example They used crosssectional data and instrumental variables methods but their IVs are questionable Panel data can be very useful for inferring causality in these contexts At a minimum you could control for a previous years violent crime rate 4 Low and McPheters 1983 used city crosssectional data on wage rates and estimates of risk of death for police officers along with other controls The idea is to determine whether police officers are compensated for working in cities with a higher risk of onthejob injury or death Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 622 PART 3 Advanced Topics 5 Do parental consent laws increase the teenage birthrate You can use state level data for this either a time series for a given state or even better a panel data set of states Do the same laws reduce abortion rates among teenagers The Statistical Abstract of the United States contains all kinds of statelevel data Levine Trainor and Zimmerman 1996 studied the effects of abortion funding restrictions on similar outcomes Other factors such as access to abortions may affect teen birth and abortion rates There is also recent interest in the effects of abstinenceonly sex education curricula One can again use statelevel panel data or maybe even panel data at the school district level to determine the effects of abstinenceonly approaches to sex education on various outcomes including rates of sexu ally transmitted diseases and teen birthrates 6 Do changes in traffic laws affect traffic fatalities McCarthy 1994 contains an analysis of monthly time series data for the state of California A set of dummy variables can be used to indicate the months in which certain laws were in effect The file TRAFFIC2 contains the data used by McCarthy An alternative is to obtain a panel data set on states in the United States where you can exploit vari ation in laws across states as well as across time Freeman 2007 is a good example of a statelevel analysis using 25 years of data that straddle changes in various state drunk driving seat belt and speed limit laws The data can be found in the file DRIVING Mullahy and Sindelar 1994 used individuallevel data matched with state laws and taxes on alcohol to estimate the effects of laws and taxes on the probability of driving drunk 7 Are blacks discriminated against in the lending market Hunter and Walker 1996 looked at this ques tion in fact we used their data in Computer Exercises C8 in Chapter 7 and C2 in Chapter 17 8 Is there a marriage premium for professional athletes Korenman and Neumark 1991 found a signifi cant wage premium for married men after using a variety of econometric methods but their analysis is limited because they cannot directly observe productivity Plus Korenman and Neumark used men in a variety of occupations Professional athletes provide an interesting group in which to study the marriage premium because we can easily collect data on various productivity measures in addition to salary The data set NBASAL on players in the National Basketball Association NBA is one example For each player we have information on points scored rebounds assists playing time and demographics As in Computer Exercise C9 in Chapter 6 we can use multiple regression analysis to test whether the productivity measures differ by marital status We can also use this kind of data to test whether married men are paid more after we account for productivity differences For example NBA owners may think that married men bring stability to the team or are better for the team image For individual sportssuch as golf and tennisannual earnings directly reflect productivity Such data along with age and experience are relatively easy to collect 9 Answer this question Are cigarette smokers less productive A variant on this is Do workers who smoke take more sick days everything else being equal Mullahy and Portney 1990 use individual level data to evaluate this question You could use data at say the metropolitan level Something like average productivity in manufacturing can be related to percentage of manufacturing workers who smoke Other variables such as average worker education capital per worker and size of the city you can think of more should be controlled for 10 Do minimum wages alleviate poverty You can use state or county data to answer this question The idea is that the minimum wage varies across states because some states have higher minimums than the federal minimum Further there are changes over time in the nominal minimum within a state some due to changes at the federal level and some because of changes at the state level Neumark and Wascher 1995 used a panel data set on states to estimate the effects of the minimum wage on the employment rates of young workers as well as on school enrollment rates 11 What factors affect student performance at public schools It is fairly easy to get schoollevel or at least districtlevel data in most states Does spending per student matter Do studentteacher ratios have any effects It is difficult to estimate ceteris paribus effects because spending is related to other factors such as family incomes or poverty rates The data set MEAP93 for Michigan high schools contains a Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 623 measure of the poverty rates Another possibility is to use panel data or at least to control for a previ ous years performance measure such as average test score or percentage of students passing an exam You can look at less obvious factors that affect student performance For example after control ling for income does family structure matter Perhaps families with two parents but only one work ing for a wage have a positive effect on performance There could be at least two channels parents spend more time with the children and they might also volunteer at school What about the effect of singleparent households controlling for income and other factors You can merge census data for one or two years with school district data Do public schools with more charter or private schools nearby better educate their students be cause of competition There is a tricky simultaneity issue here because private schools are probably located in areas where the public schools are already poor Hoxby 1994 used an instrumental vari ables approach where population proportions of various religions were IVs for the number of private schools Rouse 1998 studied a different question Did students who were able to attend a private school due to the Milwaukee voucher program perform better than those who did not She used panel data and was able to control for an unobserved student effect A subset of Rouses data is contained in the file VOUCHER 12 Can excess returns on a stock or a stock index be predicted by the lagged pricedividend ratio Or by lagged interest rates or weekly monetary policy It would be interesting to pick a foreign stock index or one of the less wellknown US indexes Cochrane 1997 provides a nice survey of recent theories and empirical results for explaining excess stock returns 13 Is there racial discrimination in the market for baseball cards This involves relating the prices of baseball cards to factors that should affect their prices such as career statistics whether the player is in the Hall of Fame and so on Holding other factors fixed do cards of black or Hispanic players sell at a discount 14 You can test whether the market for gambling on sports is efficient For example does the spread on football or basketball games contain all usable information for picking against the spread The data set PNTSPRD contains information on mens college basketball games The outcome variable is binary Was the spread covered or not Then you can try to find information that was known prior to each games being played in order to predict whether the spread is covered Good luck A useful website that contains historical spreads and outcomes for college football and mens basketball games is wwwgoldsheetcom 15 What effect if any does success in college athletics have on other aspects of the university applica tions quality of students quality of nonathletic departments McCormick and Tinsley 1987 looked at the effects of athletic success at major colleges on changes in SAT scores of entering freshmen Timing is important here presumably it is recent past success that affects current applications and student quality One must control for many other factorssuch as tuition and measures of school qualityto make the analysis convincing because without controlling for other factors there is a negative correlation between academics and athletic performance A more recent examination of the link between academic and athletic performance is provided by Tucker 2004 who also looks at how alumni contributions are affected by athletic success A variant is to match natural rivals in football or mens basketball and to look at differences across schools as a function of which school won the football game or one or more basketball games ATHLET1 and ATHLET2 are small data sets that could be expanded and updated 16 Collect murder rates for a sample of counties say from the FBI Uniform Crime Reports for two years Make the latter year such that economic and demographic variables are easy to obtain from the County and City Data Book You can obtain the total number of people on death row plus executions for intervening years at the county level If the years are 1990 and 1985 you might estimate mrdrte90 5 b0 1 b1mrdrte85 1 b2executions 1 other factors Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 624 PART 3 Advanced Topics where interest is in the coefficient on executions The lagged murder rate and other factors serve as controls If more than two years of data are obtained then the panel data methods in Chapters 13 and 14 can be applied Other factors may also act as a deterrent to crime For example Cloninger 1991 presented a crosssectional analysis of the effects of lethal police response on crime rates As a different twist what factors affect crime rates on college campuses Does the fraction of students living in fraternities or sororities have an effect Does the size of the police force matter or the kind of policing used Be careful about inferring causality here Does having an escort program help reduce crime What about crime rates in nearby communities Recently colleges and universities have been required to report crime statistics in previous years reporting was voluntary 17 What factors affect manufacturing productivity at the state level In addition to levels of capital and worker education you could look at degree of unionization A panel data analysis would be most convincing here using multiple years of census data say 1980 1990 2000 and 2010 Clark 1984 provides an analysis of how unionization affects firm performance and productivity What other vari ables might explain productivity Firmlevel data can be obtained from Compustat For example other factors being fixed do changes in unionization affect stock price of a firm 18 Use state or countylevel data or if possible school districtlevel data to look at the factors that affect education spending per pupil An interesting question is Other things being equal such as income and education levels of residents do districts with a larger percentage of elderly people spend less on schools Census data can be matched with school district spending data to obtain a very large cross section The US Department of Education compiles such data 19 What are the effects of state regulations such as motorcycle helmet laws on motorcycle fatalities Or do differences in boating lawssuch as minimum operating agehelp to explain boating accident rates The US Department of Transportation compiles such information This can be merged with data from the Statistical Abstract of the United States A panel data analysis seems to be warranted here 20 What factors affect output growth Two factors of interest are inflation and investment for example Blomström Lipsey and Zejan 1996 You might use time series data on a country you find interest ing Or you could use a cross section of countries as in De Long and Summers 1991 Friedman and Kuttner 1992 found evidence that at least in the 1980s the spread between the commercial paper rate and the Treasury bill rate affects real output 21 What is the behavior of mergers in the US economy or some other economy Shughart and Tollison 1984 characterize the log of annual mergers in the US economy as a random walk by showing that the difference in logsroughly the growth rateis unpredictable given past growth rates Does this still hold Does it hold across various industries What past measures of economic activity can be used to forecast mergers 22 What factors might explain racial and gender differences in employment and wages For example Holzer 1991 reviewed the evidence on the spatial mismatch hypothesis to explain differences in employment rates between blacks and whites Korenman and Neumark 1992 examined the effects of childbearing on womens wages while Hersch and Stratton 1997 looked at the effects of household responsibilities on mens and womens wages 23 Obtain monthly or quarterly data on teenage employment rates the minimum wage and factors that affect teen employment to estimate the effects of the minimum wage on teen employment Solon 1985 used quarterly US data while CastilloFreeman and Freeman 1992 used annual data on Puerto Rico It might be informative to analyze time series data on a lowwage state in the United Stateswhere changes in the minimum wage are likely to have the largest effect 24 At the city level estimate a time series model for crime An example is Cloninger and Sartorius 1979 As a twist you might estimate the effects of community policing or midnight basketball programs Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 625 relatively new innovations in fighting crime Inferring causality is tricky Including a lagged dependent variable might be helpful Because you are using time series data you should be aware of the spurious regression problem Grogger 1990 used data on daily homicide counts to estimate the deterrent effects of capital punishment Might there be other factorssuch as news on lethal response by policethat have an effect on daily crime counts 25 Are there aggregate productivity effects of computer usage You would need to obtain time series data perhaps at the national level on productivity percentage of employees using computers and other fac tors What about spending probably as a fraction of total sales on research and development What sociological factors for example alcohol usage or divorce rates might affect productivity 26 What factors affect chief executive officer salaries The files CEOSAL1 and CEOSAL2 are data sets that have various firm performance measures as well as information such as tenure and education You can certainly update these data files and look for other interesting factors Rose and Shepard 1997 considered firm diversification as one important determinant of CEO compensation 27 Do differences in tax codes across states affect the amount of foreign direct investment Hines 1996 studied the effects of state corporate taxes along with the ability to apply foreign tax credits on invest ment from outside the United States 28 What factors affect election outcomes Does spending matter Do votes on specific issues matter Does the state of the local economy matter See for example Levitt 1994 and the data sets VOTE1 and VOTE2 Fair 1996 performed a time series analysis of US presidential elections 29 Test whether stores or restaurants practice price discrimination based on race or ethnicity Graddy 1997 used data on fastfood restaurants in New Jersey and Pennsylvania along with ZIP codelevel characteristics to see whether prices vary by characteristics of the local population She found that prices of standard items such as sodas increase when the fraction of black residents increases Her data are contained in the file DISCRIM You can collect similar data in your local area by surveying stores or restaurants for prices of common items and matching those with recent census data See Graddys paper for details of her analysis 30 Do your own audit study to test for race or gender discrimination in hiring One such study is described in Example C3 of Appendix C Have pairs of equally qualified friends say one male and one female apply for job openings in local bars or restaurants You can provide them with phony résu més that give each the same experience and background where the only difference is gender or race Then you can keep track of who gets the interviews and job offers Neumark 1996 described one such study conducted in Philadelphia A variant would be to test whether general physical attractive ness or a specific characteristic such as being obese or having visible tattoos or body piercings plays a role in hiring decisions You would want to use the same gender in the matched pairs and it may not be easy to get volunteers for such a study 31 Following Hamermesh and Parker 2005 try to establish a link between the physical appearance of college instructors and student evaluations This can be done on campus via a survey Somewhat crude data can be obtained from websites that allow students to rank their professors and provide some information about appearance Ideally though any evaluations of attractiveness are not done by cur rent or former students as those evaluations can be influenced by the grade received 32 Use panel data to study the effects of various economic policies on regional economic growth Studying the effects of taxes and spending is natural but other policies may be of interest For example Craig Jackson and Thomson 2007 study the effects of Small Business Association Loan Guarantee pro grams on per capita income growth 33 Blinder and Watson 2014 have recently studied explanations for systematic differences in economic variables particularly growth in real GDP in the United States based on the political party of the sit ting president One might update the data to the most recent quarters and also study variables other than GDP such as unemployment Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 626 PART 3 Advanced Topics List of Journals The following is a partial list of popular journals containing empirical research in business economics and other social sciences A complete list of journals can be found on the Internet at httpwwweconlitorg American Economic Journal Applied Economics American Economic Journal Economic Policy American Economic Review American Journal of Agricultural Economics American Political Science Review Applied Economics Brookings Papers on Economic Activity Canadian Journal of Economics Demography Economic Development and Cultural Change Economic Inquiry Economica Economics of Education Review Education Finance and Policy Economics Letters Empirical Economics Federal Reserve Bulletin International Economic Review International Tax and Public Finance Journal of Applied Econometrics Journal of Business and Economic Statistics Journal of Development Economics Journal of Economic Education Journal of Empirical Finance Journal of Environmental Economics and Management Journal of Finance Journal of Health Economics Journal of Human Resources Journal of Industrial Economics Journal of International Economics Journal of Labor Economics Journal of Monetary Economics Journal of Money Credit and Banking Journal of Political Economy Journal of Public Economics Journal of Quantitative Criminology Journal of Urban Economics National Bureau of Economic Research Working Papers Series National Tax Journal Public Finance Quarterly Quarterly Journal of Economics Regional Science Urban Economics Review of Economic Studies Review of Economics and Statistics Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 627 Data Sources Numerous data sources are available throughout the world Governments of most countries compile a wealth of data some general and easily accessible data sources for the United States such as the Economic Report of the President the Statistical Abstract of the United States and the County and City Data Book have already been mentioned International financial data on many countries are published annually in International Financial Statistics Various magazines like BusinessWeek and US News and World Report often publish statisticssuch as CEO salaries and firm performance or ranking of academic programs that are novel and can be used in an econometric analysis Rather than attempting to provide a list here we instead give some Internet addresses that are compre hensive sources for economists A very useful site for economists called Resources for Economists on the Internet is maintained by Bill Goffe at Pennsylvania State University The address is httpwwwrfeorg This site provides links to journals data sources and lists of professional and academic economists It is quite simple to use Another very useful site is httpeconometriclinkscom which contains links to lots of data sources as well as to other sites of interest to empirical economists In addition the Journal of Applied Econometrics and the Journal of Business and Economic Statistics have data archives that contain data sets used in most papers published in the journals over the past several years If you find a data set that interests you this is a good way to go as much of the cleaning and format ting of the data have already been done The downside is that some of these data sets are used in economet ric analyses that are more advanced than we have learned about in this text On the other hand it is often useful to estimate simpler models using standard econometric methods for comparison Many universities such as the University of CaliforniaBerkeley the University of Michigan and the University of Maryland maintain very extensive data sets as well as links to a variety of data sets Your own library possibly contains an extensive set of links to databases in business economics and the other social sciences The regional Federal Reserve banks such as the one in St Louis manage a variety of data The National Bureau of Economic Research posts data sets used by some of its researchers State and federal governments now publish a wealth of data that can be accessed via the Internet Census data are publicly available from the US Census Bureau Two useful publications are the Economic Census published in years ending with two and seven and the Census of Population and Housing published at the beginning of each decade Other agencies such as the US Department of Justice also make data avail able to the public Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 628 Basic Mathematical Tools Appendix A T his appendix covers some basic mathematics that are used in econometric analysis We summa rize various properties of the summation operator study properties of linear and certain nonlin ear equations and review proportions and percentages We also present some special functions that often arise in applied econometrics including quadratic functions and the natural logarithm The first four sections require only basic algebra skills Section A5 contains a brief review of differential calculus although a knowledge of calculus is not necessary to understand most of the text it is used in some endofchapter appendices and in several of the more advanced chapters in Part 3 A1 The Summation Operator and Descriptive Statistics The summation operator is a useful shorthand for manipulating expressions involving the sums of many numbers and it plays a key role in statistics and econometric analysis If 5xi i 5 1 p n6 denotes a sequence of n numbers then we write the sum of these numbers as a n i51 xi x1 1 x2 1 p 1 xn A1 With this definition the summation operator is easily shown to have the following properties Property Sum1 For any constant c a n i51 c 5 nc A2 Property Sum2 For any constant c a n i51 cxi 5 c a n i51 xi A3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 629 Property Sum3 If 51xi yi2 i 5 1 2 p n6 is a set of n pairs of numbers and a and b are constants then a n i51 1axi 1 byi2 5 a a n i51 xi 1 b a n i51 yi A4 It is also important to be aware of some things that cannot be done with the summation operator Let51xi yi2 i 5 1 2 p n6 again be a set of n pairs of numbers with yi 2 0 for each i Then a n i51 1xi yi2 2 a a n i51 xiba a n i51 yib In other words the sum of the ratios is not the ratio of the sums In the n 5 2 case the application of familiar elementary algebra also reveals this lack of equality x1y1 1 x2y2 2 1x1 1 x221y1 1 y22 Similarly the sum of the squares is not the square of the sum g n i51x2 i 2 1 g n i51xi2 2 ex cept in special cases That these two quantities are not generally equal is easiest to see when n 5 2 x2 1 1 x2 2 2 1x1 1 x22 2 5 x2 1 1 2x1x2 1 x2 2 Given n numbers 5xi i 5 1 p n6 we compute their average or mean by adding them up and dividing by n x 5 11n2 a n i51 xi A5 When the xi are a sample of data on a particular variable such as years of education we often call this the sample average or sample mean to emphasize that it is computed from a particular set of data The sample average is an example of a descriptive statistic in this case the statistic describes the central tendency of the set of points xi There are some basic properties about averages that are important to understand First suppose we take each observation on x and subtract off the average di xi 2 x the d here stands for devia tion from the average Then the sum of these deviations is always zero a n i51 di 5 a n i51 1xi 2 x2 5 a n i51 xi 2 a n i51 x 5 a n i51 xi 2 nx 5 nx 2 nx 5 0 We summarize this as a n i51 1xi 2 x2 5 0 A6 A simple numerical example shows how this works Suppose n 5 5 and x1 5 6 x2 5 1 x3 5 22 x4 5 0 and x5 5 5 Then x 5 2 and the demeaned sample is 54 21 24 22 36 Adding these gives zero which is just what equation A6 says In our treatment of regression analysis in Chapter 2 we need to know some additional algebraic facts involving deviations from sample averages An important one is that the sum of squared devia tions is the sum of the squared xi minus n times the square of x a n i51 1xi 2 x2 2 5 a n i51 x2 i 2 n1x2 2 A7 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 630 This can be shown using basic properties of the summation operator a n i51 1xi 2 x2 2 5 a n i51 1x2 i 2 2xix 1 x22 5 a n i51 x2 i 2 2x a n i51 xi 1 n1x2 2 5 a n i51 x2 i 2 2n1x2 2 1 n1x2 2 5 a n i51 x2 i 2 n1x2 2 Given a data set on two variables 51xi yi2 i 5 1 2 p n6 it can also be shown that a n i51 1xi 2 x2 1yi 2 y2 5 a n i51 xi1yi 2 y2 A8 5 a n i51 1xi 2 x2yi 5 a n i51 xiyi 2 n1x y2 this is a generalization of equation A7 There yi 5 xi for all i The average is the measure of central tendency that we will focus on in most of this text How ever it is sometimes informative to use the median or sample median to describe the central value To obtain the median of the n numbers 5x1 p xn6 we first order the values of the xi from smallest to largest Then if n is odd the sample median is the middle number of the ordered observations For example given the numbers 524 8 2 0 21 210 186 the median value is 2 because the or dered sequence is 5210 24 0 2 8 18 216 If we change the largest number in this list 21 to twice its value 42 the median is still 2 By contrast the sample average would increase from 5 to 8 a sizable change Generally the median is less sensitive than the average to changes in the extreme values large or small in a list of numbers This is why median incomes or median housing values are often reported rather than averages when summarizing income or housing values in a city or county If n is even there is no unique way to define the median because there are two numbers at the center Usually the median is defined to be the average of the two middle values again after ordering the numbers from smallest to largest Using this rule the median for the set of numbers 54 12 2 66 would be 14 1 622 5 5 A2 Properties of Linear Functions Linear functions play an important role in econometrics because they are simple to interpret and ma nipulate If x and y are two variables related by y 5 b0 1 b1x A9 then we say that y is a linear function of x and b0 and b1 are two parameters numbers describing this relationship The intercept is b0 and the slope is b1 The defining feature of a linear function is that the change in y is always b1 times the change in x Dy 5 b1Dx A10 where D denotes change In other words the marginal effect of x on y is constant and equal to b1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 631 ExamplE a1 linear Housing Expenditure Function Suppose that the relationship between monthly housing expenditure and monthly income is housing 5 164 1 27 income A11 Then for each additional dollar of income 27 cents is spent on housing If family income increases by 200 then housing expenditure increases by 1272200 5 54 This function is graphed in Figure A1 According to equation A11 a family with no income spends 164 on housing which of course cannot be literally true For low levels of income this linear function would not describe the relation ship between housing and income very well which is why we will eventually have to use other types of functions to describe such relationships In A11 the marginal propensity to consume MPC housing out of income is 27 This is dif ferent from the average propensity to consume APC which is housing income 5 164income 1 27 The APC is not constant it is always larger than the MPC and it gets closer to the MPC as income increases Linear functions are easily defined for more than two variables Suppose that y is related to two variables x1 and x2 in the general form y 5 b0 1 b1x1 1 b2x2 A12 164 1514 housing 5000 income housing income 27 D D Figure A1 Graph of housing 5 164 1 27 income Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 632 It is rather difficult to envision this function because its graph is threedimensional Nevertheless b0 is still the intercept the value of y when x1 5 0 and x2 5 0 and b1 and b2 measure particular slopes From A12 the change in y for given changes in x1 and x2 is Dy 5 b1Dx1 1 b2Dx2 A13 If x2 does not change that is Dx2 5 0 then we have Dy 5 b1Dx1 if Dx2 5 0 so that b1 is the slope of the relationship in the direction of x1 b1 5 Dy Dx1 if Dx2 5 0 Because it measures how y changes with x1 holding x2 fixed b1 is often called the partial effect of x1 on y Because the partial effect involves holding other factors fixed it is closely linked to the notion of ceteris paribus The parameter b2 has a similar interpretation b2 5 DyDx2 if Dx1 5 0 so that b2 is the partial effect of x2 on y ExamplE a2 Demand for Compact Discs For college students suppose that the monthly quantity demanded of compact discs is related to the price of compact discs and monthly discretionary income by quantity 5 120 2 98 price 1 03 income where price is dollars per disc and income is measured in dollars The demand curve is the relationship between quantity and price holding income and other factors fixed This is graphed in two dimensions in Figure A2 at an income level of 900 The slope of the demand curve 298 is the partial effect of price on quantity holding income fixed if the price of compact discs increases by one dollar then the quantity demanded falls by 98 We abstract from the fact that CDs can only be purchased in discrete units An increase in income simply shifts the demand curve up changes the intercept but the slope remains the same 147 quantity 15 price 98 D price D quantity Figure A2 Graph of quantity 5 120 2 98 price 1 03 income with income fixed at 900 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 633 A3 Proportions and Percentages Proportions and percentages play such an important role in applied economics that it is necessary to become very comfortable in working with them Many quantities reported in the popular press are in the form of percentages a few examples are interest rates unemployment rates and high school graduation rates An important skill is being able to convert proportions to percentages and vice versa A percent age is easily obtained by multiplying a proportion by 100 For example if the proportion of adults in a county with a high school degree is 82 then we say that 82 82 percent of adults have a high school degree Another way to think of percentages and proportions is that a proportion is the deci mal form of a percentage For example if the marginal tax rate for a family earning 30000 per year is reported as 28 then the proportion of the next dollar of income that is paid in income taxes is 28 or 28 When using percentages we often need to convert them to decimal form For example if a state sales tax is 6 and 200 is spent on a taxable item then the sales tax paid is 20006 5 12 If the annual return on a certificate of deposit CD is 76 and we invest 3000 in such a CD at the begin ning of the year then our interest income is 3000076 5 228 As much as we would like it the interest income is not obtained by multiplying 3000 by 76 We must be wary of proportions that are sometimes incorrectly reported as percentages in the popular media If we read The percentage of high school students who drink alcohol is 57 we know that this really means 57 not just over onehalf of a percent as the statement literally im plies College volleyball fans are probably familiar with press clips containing statements such as Her hitting percentage was 372 This really means that her hitting percentage was 372 In econometrics we are often interested in measuring the changes in various quantities Let x de note some variable such as an individuals income the number of crimes committed in a community or the profits of a firm Let x0 and x1 denote two values for x x0 is the initial value and x1 is the sub sequent value For example x0 could be the annual income of an individual in 1994 and x1 the income of the same individual in 1995 The proportionate change in x in moving from x0 to x1 sometimes called the relative change is simply 1x1 2 x02x0 5 Dxx0 A14 assuming of course that x0 2 0 In other words to get the proportionate change we simply divide the change in x by its initial value This is a way of standardizing the change so that it is free of units For example if an individuals income goes from 30000 per year to 36000 per year then the pro portionate change is 600030000 5 20 It is more common to state changes in terms of percentages The percentage change in x in go ing from x0 to x1 is simply 100 times the proportionate change Dx 5 1001Dxx02 A15 the notation Dx is read as the percentage change in x For example when income goes from 30000 to 33750 income has increased by 125 to get this we simply multiply the proportionate change 125 by 100 Again we must be on guard for proportionate changes that are reported as percentage changes In the previous example for instance reporting the percentage change in income as 125 is incorrect and could lead to confusion When we look at changes in things like dollar amounts or population there is no ambiguity about what is meant by a percentage change By contrast interpreting percentage change calculations can be tricky when the variable of interest is itself a percentage something that happens often in economics and other social sciences To illustrate let x denote the percentage of adults in a particular city having a college education Suppose the initial value is x0 5 24 24 have a college education and the new Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 634 value is x1 5 30 We can compute two quantities to describe how the percentage of collegeeducated people has changed The first is the change in x Dx In this case Dx 5 x1 2 x0 5 6 the percentage of people with a college education has increased by six percentage points On the other hand we can compute the percentage change in x using equation A15 Dx 5 1003 130 2 242244 5 25 In this example the percentage point change and the percentage change are very different The percentage point change is just the change in the percentages The percentage change is the change relative to the initial value Generally we must pay close attention to which number is being com puted The careful researcher makes this distinction perfectly clear unfortunately in the popular press as well as in academic research the type of reported change is often unclear ExamplE a3 michigan Sales Tax Increase In March 1994 Michigan voters approved a sales tax increase from 4 to 6 In political advertise ments supporters of the measure referred to this as a two percentage point increase or an increase of two cents on the dollar Opponents of the tax increase called it a 50 increase in the sales tax rate Both claims are correct they are simply different ways of measuring the increase in the sales tax Naturally each group reported the measure that made its position most favorable For a variable such as salary it makes no sense to talk of a percentage point change in salary because salary is not measured as a percentage We can describe a change in salary either in dollar or percentage terms A4 Some Special Functions and Their Properties In Section A2 we reviewed the basic properties of linear functions We already indicated one impor tant feature of functions like y 5 b0 1 b1x a oneunit change in x results in the same change in y re gardless of the initial value of x As we noted earlier this is the same as saying the marginal effect of x on y is constant something that is not realistic for many economic relationships For example the im portant economic notion of diminishing marginal returns is not consistent with a linear relationship In order to model a variety of economic phenomena we need to study several nonlinear func tions A nonlinear function is characterized by the fact that the change in y for a given change in x depends on the starting value of x Certain nonlinear functions appear frequently in empirical eco nomics so it is important to know how to interpret them A complete understanding of nonlinear functions takes us into the realm of calculus Here we simply summarize the most significant aspects of the functions leaving the details of some derivations for Section A5 A4a Quadratic Functions One simple way to capture diminishing returns is to add a quadratic term to a linear relationship Consider the equation y 5 b0 1 b1x 1 b2x2 A16 where b0 b1 and b2 are parameters When b1 0 and b2 0 the relationship between y and x has the parabolic shape given in Figure A3 where b0 5 6 b1 5 8 and b2 5 22 When b1 0 and b2 0 it can be shown using calculus in the next section that the maximum of the function occurs at the point xp 5 b1122b22 A17 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 635 For example if y 5 6 1 8x 2 2x2 so b1 5 8 and b2 5 2 2 then the largest value of y occurs at xp 5 84 5 2 and this value is 6 1 8122 2 2122 2 5 14 see Figure A3 The fact that equation A16 implies a diminishing marginal effect of x on y is easily seen from its graph Suppose we start at a low value of x and then increase x by some amount say c This has a larger effect on y than if we start at a higher value of x and increase x by the same amount c In fact once x xp an increase in x actually decreases y The statement that x has a diminishing marginal effect on y is the same as saying that the slope of the function in Figure A3 decreases as x increases Although this is clear from looking at the graph we usually want to quantify how quickly the slope is changing An application of calculus gives the approximate slope of the quadratic function as slope 5 Dy Dx b1 1 2b2x A18 for small changes in x The righthand side of equation A18 is the derivative of the function in equation A16 with respect to x Another way to write this is Dy 1b1 1 2b2x2Dx for small Dx A19 To see how well this approximation works consider again the function y 5 6 1 8x 2 2x2 Then according to equation A19 Dy 18 2 4x2Dx Now suppose we start at x 5 1 and change x by Dx 5 1 Using A19 Dy 18 2 42 112 5 4 Of course we can compute the change exactly by finding the values of y when x 5 1 and x 5 11 y0 5 6 1 8112 2 2112 2 5 12 and y1 5 6 1 81112 2 21112 2 5 1238 so the exact change in y is 38 The approximation is pretty close in this case Now suppose we start at x 5 1 but change x by a larger amount Dx 5 5 Then the approxima tion gives Dy 4152 5 2 The exact change is determined by finding the difference in y when x 5 1 0 1 2 3 0 x 2 4 6 8 10 12 14 4 y x Figure A3 Graph of y 5 6 1 8x 2 2x2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 636 and x 5 15 The former value of y was 12 and the latter value is 6 1 81152 2 21152 2 5 135 so the actual change is 15 not 2 The approximation is worse in this case because the change in x is larger For many applications equation A19 can be used to compute the approximate marginal effect of x on y for any initial value of x and small changes And we can always compute the exact change if necessary ExamplE a4 a Quadratic Wage Function Suppose the relationship between hourly wages and years in the workforce exper is given by wage 5 525 1 48 exper 2 008 exper2 A20 This function has the same general shape as the one in Figure A3 Using equation A17 exper has a positive effect on wage up to the turning point experp 5 483210082 4 5 30 The first year of experi ence is worth approximately 48 or 48 cents see A19 with x 5 0 Dx 5 1 Each additional year of experience increases wage by less than the previous yearreflecting a diminishing marginal return to experience At 30 years an additional year of experience would actually lower the wage This is not very realistic but it is one of the consequences of using a quadratic function to capture a dimin ishing marginal effect at some point the function must reach a maximum and curve downward For practical purposes the point at which this happens is often large enough to be inconsequential but not always The graph of the quadratic function in A16 has a Ushape if b1 0 and b2 0 in which case there is an increasing marginal return The minimum of the function is at the point 2b112b22 A4b The Natural Logarithm The nonlinear function that plays the most important role in econometric analysis is the natural logarithm In this text we denote the natural logarithm which we often refer to simply as the log function as y 5 log1x2 A21 You might remember learning different symbols for the natural log ln1x2 or loge1x2 are the most common These different notations are useful when logarithms with several different bases are be ing used For our purposes only the natural logarithm is important and so log1x2 denotes the natural logarithm throughout this text This corresponds to the notational usage in many statistical packages although some use ln1x2 and most calculators use ln1x2 Economists use both log1x2 and ln1x2 which is useful to know when you are reading papers in applied economics The function y 5 log1x2 is defined only for x 0 and it is plotted in Figure A4 It is not very important to know how the values of log1x2 are obtained For our purposes the function can be thought of as a black box we can plug in any x 0 and obtain log1x2 from a calculator or a computer Several things are apparent from Figure A4 First when y 5 log1x2 the relationship between y and x displays diminishing marginal returns One important difference between the log and the qua dratic function in Figure A3 is that when y 5 log1x2 the effect of x on y never becomes negative the slope of the function gets closer and closer to zero as x gets large but the slope never quite reaches zero and certainly never becomes negative Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 637 The following are also apparent from Figure A4 log1x2 0 for 0 x 1 log112 5 0 log1x2 0 for x 1 In particular log1x2 can be positive or negative Some useful algebraic facts about the log function are log1x1 x22 5 log1x12 1 log1x22 x1 x2 0 log1x1x22 5 log1x12 2 log1x22 x1 x2 0 log1xc2 5 c log1x2 x 0 c any number Occasionally we will need to rely on these properties The logarithm can be used for various approximations that arise in econometric applications First log11 1 x2 x for x 0 You can try this with x 5 02 1 and 5 to see how the quality of the approximation deteriorates as x gets larger Even more useful is the fact that the difference in logs can be used to approximate proportionate changes Let x0 and x1 be positive values Then it can be shown using calculus that log1x12 2 log1x02 1x1 2 x02x0 5 Dxx0 A22 for small changes in x If we multiply equation A22 by 100 and write Dlog1x2 5 log1x12 2 log1x02 then 100 Dlog1x2 Dx A23 for small changes in x The meaning of small depends on the context and we will encounter several examples throughout this text Why should we approximate the percentage change using A23 when the exact percentage change is so easy to compute Momentarily we will see why the approximation in A23 is useful in econometrics First let us see how good the approximation is in two examples 0 y 1 x y logx Figure A4 Graph of y 5 log1x2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 638 First suppose x0 5 40 and x1 5 41 Then the percentage change in x in moving from x0 to x1 is 25 using 1001x1 2 x02x0 Now log1412 2 log1402 5 0247 to four decimal places which when multiplied by 100 is very close to 25 The approximation works pretty well Now con sider a much bigger change x0 5 40 and x1 5 60 The exact percentage change is 50 However log1602 2 log1402 4055 so the approximation gives 4055 which is much farther off Why is the approximation in A23 useful if it is only satisfactory for small changes To build up to the answer we first define the elasticity of y with respect to x as Dy Dx x y 5 Dy Dx A24 In other words the elasticity of y with respect to x is the percentage change in y when x increases by 1 This notion should be familiar from introductory economics If y is a linear function of x y 5 b0 1 b1x then the elasticity is Dy Dx x y 5 b1 x y 5 b1 x b0 1 b1x A25 which clearly depends on the value of x This is a generalization of the wellknown result from basic demand theory the elasticity is not constant along a straightline demand curve Elasticities are of critical importance in many areas of applied economics not just in demand theory It is convenient in many situations to have constant elasticity models and the log function al lows us to specify such models If we use the approximation in A23 for both x and y then the elas ticity is approximately equal to Dlog1y2Dlog1x2 Thus a constant elasticity model is approximated by the equation log1y2 5 b0 1 b1log1x2 A26 and b1 is the elasticity of y with respect to x assuming that x y 0 ExamplE a5 Constant Elasticity Demand Function If q is quantity demanded and p is price and these variables are related by log1q2 5 47 2 125 log1p2 then the price elasticity of demand is 2125 Roughly a 1 increase in price leads to a 125 fall in the quantity demanded For our purposes the fact that b1 in A26 is only close to the elasticity is not important In fact when the elasticity is defined using calculusas in Section A5the definition is exact For the pur poses of econometric analysis A26 defines a constant elasticity model Such models play a large role in empirical economics Other possibilities for using the log function often arise in empirical work Suppose that y 0 and log1y2 5 b0 1 b1x A27 Then Dlog1y2 5 b1Dx so 100 Dlog1y2 5 1100 b12Dx It follows that when y and x are related by equation A27 Dy 1100 b12Dx A28 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 639 ExamplE a6 logarithmic Wage Equation Suppose that hourly wage and years of education are related by log1wage2 5 278 1 094 educ Then using equation A28 Dwage 10010942Deduc 5 94 Deduc It follows that one more year of education increases hourly wage by about 94 Generally the quantity DyDx is called the semielasticity of y with respect to x The semi elasticity is the percentage change in y when x increases by one unit What we have just shown is that in model A27 the semielasticity is constant and equal to 100 b1 In Example A6 we can conve niently summarize the relationship between wages and education by saying that one more year of educationstarting from any amount of educationincreases the wage by about 94 This is why such models play an important role in economics Another relationship of some interest in applied economics is y 5 b0 1 b1log1x2 A29 where x 0 How can we interpret this equation If we take the change in y we get Dy 5 b1Dlog1x2 which can be rewritten as Dy 5 1b11002 3100 Dlog1x2 4 Thus using the approximation in A23 we have Dy 1b11002 1Dx2 A30 In other words b1100 is the unit change in y when x increases by 1 ExamplE a7 labor Supply Function Assume that the labor supply of a worker can be described by hours 5 33 1 451 log1wage2 where wage is hourly wage and hours is hours worked per week Then from A30 Dhours 14511002 1Dwage2 5 451 Dwage In other words a 1 increase in wage increases the weekly hours worked by about 45 or slightly less than onehalf hour If the wage increases by 10 then Dhours 5 4511102 5 451 or about four and onehalf hours We would not want to use this approximation for much larger percentage changes in wages A4c The Exponential Function Before leaving this section we need to discuss a special function that is related to the log As motiva tion consider equation A27 There logy is a linear function of x But how do we find y itself as a function of x The answer is given by the exponential function We will write the exponential function as y 5 exp1x2 which is graphed in Figure A5 From Fig ure A5 we see that exp1x2 is defined for any value of x and is always greater than zero Sometimes the exponential function is written as y 5 ex but we will not use this notation Two important values of the exponential function are exp102 5 1 and exp112 5 27183 to four decimal places Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 640 The exponential function is the inverse of the log function in the following sense log3exp1x2 4 5 x for all x and exp3log1x2 4 5 x for x 0 In other words the log undoes the exponential and vice versa This is why the exponential function is sometimes called the antilog function In particular note that log1y2 5 b0 1 b1x is equivalent to y 5 exp1b0 1 b1x2 If b1 0 the relationship between x and y has the same shape as in Figure A5 Thus if log1y2 5 b0 1 b1x with b1 0 then x has an increasing marginal effect on y In Example A6 this means that another year of education leads to a larger change in wage than the previous year of education Two useful facts about the exponential function are exp1x1 1 x22 5 exp1x12exp1x22 and exp3c log1x2 4 5 xc A5 Differential Calculus In the previous section we asserted several approximations that have foundations in calculus Let y 5 f 1x2 for some function f Then for small changes in x Dy df dx Dx A31 where dfdx is the derivative of the function f evaluated at the initial point x0 We also write the de rivative as dydx For example if y 5 log1x2 then dydx 5 1x Using A31 with dydx evaluated at x0 we have Dy 11x02Dx or Dlog1x2 Dxx0 which is the approximation given in A22 In applying econometrics it helps to recall the derivatives of a handful of functions because we use the derivative to define the slope of a function at a given point We can then use A31 to find the 0 y x y expx Figure A5 Graph of y 5 exp1x2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 641 approximate change in y for small changes in x In the linear case the derivative is simply the slope of the line as we would hope if y 5 b0 1 b1x then dydx 5 b1 If y 5 xc then dydx 5 cxc21 The derivative of a sum of two functions is the sum of the deriva tives d3 f 1x2 1 g1x2 4dx 5 df 1x2dx 1 dg1x2dx The derivative of a constant times any function is that same constant times the derivative of the function d3cf 1x2 4dx 5 c3df 1x2dx4 These simple rules allow us to find derivatives of more complicated functions Other rules such as the product quotient and chain rules will be familiar to those who have taken calculus but we will not review those here Some functions that are often used in economics along with their derivatives are y 5 b0 1 b1x 1 b2x2 dydx 5 b1 1 2b2x y 5 b0 1 b1x dydx 5 2b11x22 y 5 b0 1 b1x dydx 5 1b122x212 y 5 b0 1 b1log1x2 dydx 5 b1x y 5 exp1b0 1 b1x2 dydx 5 b1exp1b0 1 b1x2 If b0 5 0 and b1 5 1 in this last expression we get dydx 5 exp1x2 when y 5 exp1x2 In Section A4 we noted that equation A26 defines a constant elasticity model when calculus is used The calculus definition of elasticity is 1dydx2 1xy2 It can be shown using properties of logs and exponentials that when A26 holds 1dydx2 1xy2 5 b1 When y is a function of multiple variables the notion of a partial derivative becomes important Suppose that y 5 f 1x1 x22 A32 Then there are two partial derivatives one with respect to x1 and one with respect to x2 The partial derivative of y with respect to x1 denoted here by yx1 is just the usual derivative of A32 with respect to x1 where x2 is treated as a constant Similarly yx2 is just the derivative of A32 with respect to x2 holding x1 fixed Partial derivatives are useful for much the same reason as ordinary derivatives We can approxi mate the change in y as Dy y x1 Dx1 holding x2 fixed A33 Thus calculus allows us to define partial effects in nonlinear models just as we could in linear models In fact if y 5 b0 1 b1x1 1 b2x2 then y x1 5 b1 y x2 5 b2 These can be recognized as the partial effects defined in Section A2 A more complicated example is y 5 5 1 4x1 1 x2 1 2 3x2 1 7x1 x2 A34 Now the derivative of A34 with respect to x1 treating x2 as a constant is simply y x1 5 4 1 2x1 1 7x2 note how this depends on x1 and x2 The derivative of A34 with respect to x2 is yx2 5 23 1 7x1 so this depends only on x1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 642 ExamplE a8 Wage Function with Interaction A function relating wages to years of education and experience is wage 5 310 1 41 educ 1 19 exper 2 004 exper2 A35 1 007 educ exper The partial effect of exper on wage is the partial derivative of A35 wage exper 5 19 2 008 exper 1 007 educ This is the approximate change in wage due to increasing experience by one year Notice that this partial effect depends on the initial level of exper and educ For example for a worker who is starting with educ 5 12 and exper 5 5 the next year of experience increases wage by about 19 2 008152 1 0071122 5 234 or 234 cents per hour The exact change can be calculated by computing A35 at exper 5 5 educ 5 12 and at exper 6 educ 12 and then taking the difference This turns out to be 23 which is very close to the approximation Differential calculus plays an important role in minimizing and maximizing functions of one or more variables If f 1x1 x2 p xk2 is a differentiable function of k variables then a necessary condi tion for x1 p x2 p p xk p to either minimize or maximize f over all possible values of xj is f xi 1xp 1 xp 2 p xp k2 5 0 j 5 1 2 p k A36 In other words all of the partial derivatives of f must be zero when they are evaluated at the xhp These are called the first order conditions for minimizing or maximizing a function Practically we hope to solve equation A36 for the xhp Then we can use other criteria to determine whether we have mini mized or maximized the function We will not need those here See Sydsaeter and Hammond 1995 for a discussion of multivariable calculus and its use in optimizing functions Summary The math tools reviewed here are crucial for understanding regression analysis and the probability and sta tistics that are covered in Appendices B and C The material on nonlinear functionsespecially quadratic logarithmic and exponential functionsis critical for understanding modern applied economic research The level of comprehension required of these functions does not include a deep knowledge of calculus although calculus is needed for certain derivations Key Terms Average Ceteris Paribus Constant Elasticity Model Derivative Descriptive Statistic Diminishing Marginal Effect Elasticity Exponential Function Intercept Linear Function Log Function Marginal Effect Median Natural Logarithm Nonlinear Function Partial Derivative Partial Effect Percentage Change Percentage Point Change Proportionate Change Relative Change SemiElasticity Slope Summation Operator Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 643 i Find the average monthly housing expenditure ii Find the median monthly housing expenditure iii If monthly housing expenditures were measured in hundreds of dollars rather than in dollars what would be the average and median expenditures iv Suppose that family number 8 increases its monthly housing expenditure to 900 but the ex penditures of all other families remain the same Compute the average and median housing expenditures 2 Suppose the following equation describes the relationship between the average number of classes missed during a semester missed and the distance from school distance measured in miles missed 5 3 1 02 distance i Sketch this line being sure to label the axes How do you interpret the intercept in this equation ii What is the average number of classes missed for someone who lives five miles away iii What is the difference in the average number of classes missed for someone who lives 10 miles away and someone who lives 20 miles away 3 In Example A2 quantity of compact discs was related to price and income by quantity 5 120 2 98 price 1 03 income What is the demand for CDs if price 5 15 and income 5 200 What does this suggest about using linear functions to describe demand curves 4 Suppose the unemployment rate in the United States goes from 64 in one year to 56 in the next i What is the percentage point decrease in the unemployment rate ii By what percentage has the unemployment rate fallen 5 Suppose that the return from holding a particular firms stock goes from 15 in one year to 18 in the following year The majority shareholder claims that the stock return only increased by 3 while the chief executive officer claims that the return on the firms stock increased by 20 Reconcile their disagreement 6 Suppose that Person A earns 35000 per year and Person B earns 42000 i Find the exact percentage by which Person Bs salary exceeds Person As ii Now use the difference in natural logs to find the approximate percentage difference Problems 1 The following table contains monthly housing expenditures for 10 families Family Monthly Housing Expenditures Dollars 1 300 2 440 3 350 4 1100 5 640 6 480 7 450 8 700 9 670 10 530 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 644 7 Suppose the following model describes the relationship between annual salary salary and the number of previous years of labor market experience exper log1salary2 5 106 1 027 exper i What is salary when exper 5 0 When exper 5 5 Hint You will need to exponentiate ii Use equation A28 to approximate the percentage increase in salary when exper increases by five years iii Use the results of part i to compute the exact percentage difference in salary when exper 5 5 and exper 5 0 Comment on how this compares with the approximation in part ii 8 Let grthemp denote the proportionate growth in employment at the county level from 1990 to 1995 and let salestax denote the county sales tax rate stated as a proportion Interpret the intercept and slope in the equation grthemp 5 043 2 78 sales tax 9 Suppose the yield of a certain crop in bushels per acre is related to fertilizer amount in pounds per acre as yield 5 120 1 19fertilizer i Graph this relationship by plugging in several values for fertilizer ii Describe how the shape of this relationship compares with a linear relationship between yield and fertilizer 10 Suppose that in a particular state a standardized test is given to all graduating seniors Let score denote a students score on the test Someone discovers that performance on the test is related to the size of the students graduating high school class The relationship is quadratic score 5 456 1 082 class 2 000147 class2 where class is the number of students in the graduating class i How do you literally interpret the value 456 in the equation By itself is it of much interest Explain ii From the equation what is the optimal size of the graduating class the size that maximizes the test score Round your answer to the nearest integer What is the highest achievable test score iii Sketch a graph that illustrates your solution in part ii iv Does it seem likely that score and class would have a deterministic relationship That is is it realistic to think that once you know the size of a students graduating class you know with certainty his or her test score Explain 11 Consider the line y 5 b0 1 b1x i Let 1x1 y12 and 1x2 y22 be two points on the line Show that 1x y2 is also on the line where x 5 1x1 1 x222 is the average of the two values and y 5 1y1 1 y222 ii Extend the result of part i to n points on the line 5 1xi yi2 i 5 1 p n6 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 645 Fundamentals of Probability Appendix B T his appendix covers key concepts from basic probability Appendices B and C are primarily for review they are not intended to replace a course in probability and statistics However all of the probability and statistics concepts that we use in the text are covered in these appendices Probability is of interest in its own right for students in business economics and other social sciences For example consider the problem of an airline trying to decide how many reservations to accept for a flight that has 100 available seats If fewer than 100 people want reservations then these should all be accepted But what if more than 100 people request reservations A safe solution is to accept at most 100 reservations However because some people book reservations and then do not show up for the flight there is some chance that the plane will not be full even if 100 reservations are booked This results in lost revenue to the airline A different strategy is to book more than 100 reser vations and to hope that some people do not show up so the final number of passengers is as close to 100 as possible This policy runs the risk of the airline having to compensate people who are neces sarily bumped from an overbooked flight A natural question in this context is Can we decide on the optimal or best number of reserva tions the airline should make This is a nontrivial problem Nevertheless given certain information on airline costs and how frequently people show up for reservations we can use basic probability to arrive at a solution B1 Random Variables and Their Probability Distributions Suppose that we flip a coin 10 times and count the number of times the coin turns up heads This is an example of an experiment Generally an experiment is any procedure that can at least in theory be infinitely repeated and has a welldefined set of outcomes We could in principle carry out the coinflipping procedure again and again Before we flip the coin we know that the number of heads appearing is an integer from 0 to 10 so the outcomes of the experiment are well defined A random variable is one that takes on numerical values and has an outcome that is determined by an experiment In the coinflipping example the number of heads appearing in 10 flips of a coin is an example of a random variable Before we flip the coin 10 times we do not know how many Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 646 times the coin will come up heads Once we flip the coin 10 times and count the number of heads we obtain the outcome of the random variable for this particular trial of the experiment Another trial can produce a different outcome In the airline reservation example mentioned earlier the number of people showing up for their flight is a random variable before any particular flight we do not know how many people will show up To analyze data collected in business and the social sciences it is important to have a basic understanding of random variables and their properties Following the usual conventions in probabil ity and statistics throughout Appendices B and C we denote random variables by uppercase letters usually W X Y and Z particular outcomes of random variables are denoted by the corresponding lowercase letters w x y and z For example in the coinflipping experiment let X denote the number of heads appearing in 10 flips of a coin Then X is not associated with any particular value but we know X will take on a value in the set 50 1 2 p 106 A particular outcome is say x 5 6 We indicate large collections of random variables by using subscripts For example if we record last years income of 20 randomly chosen households in the United States we might denote these random variables by X1 X2 p X20 the particular outcomes would be denoted x1 x2 p x20 As stated in the definition random variables are always defined to take on numerical values even when they describe qualitative events For example consider tossing a single coin where the two outcomes are heads and tails We can define a random variable as follows X 5 1 if the coin turns up heads and X 5 0 if the coin turns up tails A random variable that can only take on the values zero and one is called a Bernoulli or binary random variable In basic probability it is traditional to call the event X 5 1 a success and the event X 5 0 a failure For a particular application the successfailure nomenclature might not correspond to our notion of a success or failure but it is a useful terminology that we will adopt B1a Discrete Random Variables A discrete random variable is one that takes on only a finite or countably infinite number of values The notion of countably infinite means that even though an infinite number of values can be taken on by a random variable those values can be put in a onetoone correspondence with the positive in tegers Because the distinction between countably infinite and uncountably infinite is somewhat subtle we will concentrate on discrete random variables that take on only a finite number of values Larsen and Marx 1986 Chapter 3 provide a detailed treatment A Bernoulli random variable is the simplest example of a discrete random variable The only thing we need to completely describe the behavior of a Bernoulli random variable is the probability that it takes on the value one In the coinflipping example if the coin is fair then P1X 5 12 5 12 read as the probability that X equals one is onehalf Because probabilities must sum to one P1X 5 02 5 12 also Social scientists are interested in more than flipping coins so we must allow for more general situations Again consider the example where the airline must decide how many people to book for a flight with 100 available seats This problem can be analyzed in the context of several Bernoulli random variables as follows for a randomly selected customer define a Bernoulli random variable as X 5 1 if the person shows up for the reservation and X 5 0 if not There is no reason to think that the probability of any particular customer showing up is 12 in principle the probability can be any number between 0 and 1 Call this number u so that P1X 5 12 5 u B1 P1X 5 02 5 1 2 u B2 For example if u 5 75 then there is a 75 chance that a customer shows up after making a reservation and a 25 chance that the customer does not show up Intuitively the value of u is crucial in determining the airlines strategy for booking reservations Methods for estimating u given his torical data on airline reservations are a subject of mathematical statistics something we turn to in Appendix C Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 647 More generally any discrete random variable is completely described by listing its possible values and the associated probability that it takes on each value If X takes on the k possible values 5x1 p xk6 then the probabilities p1 p2 p pk are defined by pj 5 P1X 5 xj2 j 5 1 2 p k B3 where each pj is between 0 and 1 and p1 1 p2 1 p 1 pk 5 1 B4 Equation B3 is read as The probability that X takes on the value xj is equal to pj Equations B1 and B2 show that the probabilities of success and failure for a Bernoulli ran dom variable are determined entirely by the value of u Because Bernoulli random variables are so prevalent we have a special notation for them X Bernoulli1u2 is read as X has a Bernoulli distribution with probability of success equal to u The probability density function pdf of X summarizes the information concerning the possible outcomes of X and the corresponding probabilities f 1xj2 5 pj j 5 1 2 p k B5 with f 1x2 5 0 for any x not equal to xj for some j In other words for any real number x f 1x2 is the probability that the random variable X takes on the particular value x When dealing with more than one random variable it is sometimes useful to subscript the pdf in question fX is the pdf of X fY is the pdf of Y and so on Given the pdf of any discrete random variable it is simple to compute the probability of any event involving that random variable For example suppose that X is the number of free throws made by a basketball player out of two attempts so that X can take on the three values 50 1 26 Assume that the pdf of X is given by f 102 5 20 f 112 5 44 and f 122 5 36 The three probabilities sum to one as they must Using this pdf we can calculate the probability that the player makes at least one free throw P1X 12 5 P1X 5 12 1 P1X 5 22 5 44 1 36 5 80 The pdf of X is shown in Figure B1 fx 0 1 2 x 20 44 36 Figure B1 The pdf of the number of free throws made out of two attempts Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 648 B1b Continuous Random Variables A variable X is a continuous random variable if it takes on any real value with zero probability This definition is somewhat counterintuitive because in any application we eventually observe some outcome for a random variable The idea is that a continuous random variable X can take on so many possible values that we cannot count them or match them up with the positive integers so logical con sistency dictates that X can take on each value with probability zero While measurements are always discrete in practice random variables that take on numerous values are best treated as continuous For example the most refined measure of the price of a good is in terms of cents We can imagine listing all possible values of price in order even though the list may continue indefinitely which technically makes price a discrete random variable However there are so many possible values of price that using the mechanics of discrete random variables is not feasible We can define a probability density function for continuous random variables and as with discrete random variables the pdf provides information on the likely outcomes of the random variable However because it makes no sense to discuss the probability that a continuous random variable takes on a particular value we use the pdf of a continuous random variable only to compute events involving a range of values For example if a and b are constants where a b the probability that X lies between the numbers a and b P1a X b2 is the area under the pdf between points a and b as shown in Figure B2 If you are familiar with calculus you recognize this as the integral of the function f between the points a and b The entire area under the pdf must always equal one When computing probabilities for continuous random variables it is easiest to work with the cumulative distribution function cdf If X is any random variable then its cdf is defined for any real number x by F1x2 P1X x2 B6 a fx b x Figure B2 The probability that X lies between the points a and b Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 649 For discrete random variables B6 is obtained by summing the pdf over all values xj such that xj x For a continuous random variable F1x2 is the area under the pdf f to the left of the point x Because F1x2 is simply a probability it is always between 0 and 1 Further if x1 x2 then P1X x12 P1X x22 that is F1x12 F1x22 This means that a cdf is an increasing or at least a nondecreasing function of x Two important properties of cdfs that are useful for computing probabilities are the following For any number c P1X c2 5 1 2 F1c2 B7 For any numbers a b P1a X b2 5 F1b2 2 F1a2 B8 In our study of econometrics we will use cdfs to compute probabilities only for continuous random variables in which case it does not matter whether inequalities in probability statements are strict or not That is for a continuous random variable X P1X c2 5 P1X c2 B9 and P1a X b2 5 P1a X b2 5 P1a X b2 5 P1a X b2 B10 Combined with B7 and B8 equations B9 and B10 greatly expand the probability calcula tions that can be done using continuous cdfs Cumulative distribution functions have been tabulated for all of the important continuous distri butions in probability and statistics The most well known of these is the normal distribution which we cover along with some related distributions in Section B5 B2 Joint Distributions Conditional Distributions and Independence In economics we are usually interested in the occurrence of events involving more than one random variable For example in the airline reservation example referred to earlier the airline might be inter ested in the probability that a person who makes a reservation shows up and is a business traveler this is an example of a joint probability Or the airline might be interested in the following conditional probability conditional on the person being a business traveler what is the probability of his or her showing up In the next two subsections we formalize the notions of joint and conditional distribu tions and the important notion of independence of random variables B2a Joint Distributions and Independence Let X and Y be discrete random variables Then 1X Y2 have a joint distribution which is fully de scribed by the joint probability density function of 1X Y2 fX Y1x y2 5 P1X 5 x Y 5 y2 B11 where the righthand side is the probability that X x and Y 5 y When X and Y are continuous a joint pdf can also be defined but we will not cover such details because joint pdfs for continuous ran dom variables are not used explicitly in this text In one case it is easy to obtain the joint pdf if we are given the pdfs of X and Y In particular random variables X and Y are said to be independent if and only if fX Y1x y2 5 fX1x2fY1y2 B12 for all x and y where fX is the pdf of X and fY is the pdf of Y In the context of more than one random variable the pdfs fX and fY are often called marginal probability density functions to distinguish them from the joint pdf fX Y This definition of independence is valid for discrete and continuous random variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 650 To understand the meaning of B12 it is easiest to deal with the discrete case If X and Y are discrete then B12 is the same as P1X 5 x Y 5 y2 5 P1X 5 x2P1Y 5 y2 B13 in other words the probability that X 5 x and Y 5 y is the product of the two probabilities P1X 5 x2 and P1Y 5 y2 One implication of B13 is that joint probabilities are fairly easy to compute since they only require knowledge of P1X 5 x2 and P1Y 5 y2 If random variables are not independent then they are said to be dependent ExamplE B1 Free Throw Shooting Consider a basketball player shooting two free throws Let X be the Bernoulli random variable equal to one if she or he makes the first free throw and zero otherwise Let Y be a Bernoulli random variable equal to one if he or she makes the second free throw Suppose that she or he is an 80 free throw shooter so that P1X 5 12 5 P1Y 5 12 5 8 What is the probability of the player making both free throws I f X a n d Y a r e i n d e p e n d e n t w e c a n e a s i l y a n s w e r t h i s q u e s t i o n P1X 5 1 Y 5 12 5 P1X 5 12P1Y 5 12 5 182 182 5 64 Thus there is a 64 chance of making both free throws If the chance of making the second free throw depends on whether the first was madethat is X and Y are not independentthen this simple calculation is not valid Independence of random variables is a very important concept In the next subsection we will show that if X and Y are independent then knowing the outcome of X does not change the probabilities of the possible outcomes of Y and vice versa One useful fact about independence is that if X and Y are independent and we define new random variables g1X2 and h1Y2 for any functions g and h then these new random variables are also independent There is no need to stop at two random variables If X1 X2 p Xn are discrete random variables then their joint pdf is f 1x1 x2 p xn2 5 P1X1 5 x1 X2 5 x2 p Xn 5 xn2 The random variables X1 X2 p Xn are independent random variables if and only if their joint pdf is the product of the individual pdfs for any 1x1 x2 p xn2 This definition of independence also holds for continuous random variables The notion of independence plays an important role in obtaining some of the classic distributions in probability and statistics Earlier we defined a Bernoulli random variable as a zeroone random variable indicating whether or not some event occurs Often we are interested in the number of suc cesses in a sequence of independent Bernoulli trials A standard example of independent Bernoulli trials is flipping a coin again and again Because the outcome on any particular flip has nothing to do with the outcomes on other flips independence is an appropriate assumption Independence is often a reasonable approximation in more complicated situations In the airline reservation example suppose that the airline accepts n reservations for a particular flight For each i 5 1 2 p n let Yi denote the Bernoulli random variable indicating whether customer i shows up Yi 5 1 if customer i appears and Yi 5 0 otherwise Letting u again denote the probability of success using reservation each Yi has a Bernoulli1u2 distribution As an approximation we might assume that the Yi are independent of one another although this is not exactly true in reality some people travel in groups which means that whether or not a person shows up is not truly independent of whether all others show up Modeling this kind of dependence is complex however so we might be willing to use independence as an approximation The variable of primary interest is the total number of customers showing up out of the n reservations call this variable X Since each Yi is unity when a person shows up we can write Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 651 X 5 Y1 1 Y2 1 p 1 Yn Now assuming that each Yi has probability of success u and that the Yi are independent X can be shown to have a binomial distribution That is the probability density function of X is f 1x2 5 QnxRux11 2 u2 n2x x 5 0 1 2 p n B14 where QnxR 5 n x1n 2 x2 and for any integer n n read n factorial is defined as n 5 n 1n 2 12 1n 2 22 c 1 By convention 0 5 1 When a random variable X has the pdf given in B14 we write X Binomial1n u2 Equation B14 can be used to compute P1X 5 x2 for any value of x from 0 to n If the flight has 100 available seats the airline is interested in P1X 1002 Suppose initially that n 5 120 so that the airline accepts 120 reservations and the probability that each person shows up is u 5 85 Then P1X 1002 5 P1X 5 1012 1 P1X 5 1022 1 p 1 P1X 5 1202 and each of the probabilities in the sum can be found from equation B14 with n 5 120 u 5 85 and the appropriate value of x 101 to 120 This is a difficult hand calculation but many statistical packages have commands for computing this kind of probability In this case the probability that more than 100 people will show up is about 659 which is probably more risk of overbooking than the airline wants to tolerate If instead the number of reservations is 110 the probability of more than 100 passengers showing up is only about 024 B2b Conditional Distributions In econometrics we are usually interested in how one random variable call it Y is related to one or more other variables For now suppose that there is only one variable whose effects we are interested in call it X The most we can know about how X affects Y is contained in the conditional distribution of Y given X This information is summarized by the conditional probability density function defined by fY 0 X1y 0 x2 5 fX Y1x y2fX1x2 B15 for all values of x such that fX1x2 0 The interpretation of B15 is most easily seen when X and Y are discrete Then fY 0 X1y 0 x2 5 P1Y 5 y 0 X 5 x2 B16 where the righthand side is read as the probability that Y 5 y given that X 5 x When Y is continu ous fY 0 X1y 0 x2 is not interpretable directly as a probability for the reasons discussed earlier but condi tional probabilities are found by computing areas under the conditional pdf An important feature of conditional distributions is that if X and Y are independent random vari ables knowledge of the value taken on by X tells us nothing about the probability that Y takes on vari ous values and vice versa That is fY 0 X1y 0 x2 5 fY1y2 and fX 0 Y1x 0 y2 5 fX1x2 ExamplE B2 Free Throw Shooting Consider again the basketballshooting example where two free throws are to be attempted Assume that the conditional density is fY 0 X11 0 12 5 85 fY 0 X10 0 12 5 15 fY 0 X11 0 02 5 70 fY 0 X10 0 02 5 30 This means that the probability of the player making the second free throw depends on whether the first free throw was made if the first free throw is made the chance of making the second is 85 if the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 652 first free throw is missed the chance of making the second is 70 This implies that X and Y are not independent they are dependent We can still compute P1X 5 1 Y 5 12 provided we know P1X 5 12 Assume that the probability of making the first free throw is 8 that is P1X 5 12 5 8 Then from B15 we have P1X 5 1 Y 5 12 5 P1Y 5 1 0 X 5 12 P1X 5 12 5 1852 182 5 68 B3 Features of Probability Distributions For many purposes we will be interested in only a few aspects of the distributions of random vari ables The features of interest can be put into three categories measures of central tendency measures of variability or spread and measures of association between two random variables We cover the last of these in Section B4 B3a A Measure of Central Tendency The Expected Value The expected value is one of the most important probabilistic concepts that we will encounter in our study of econometrics If X is a random variable the expected value or expectation of X denoted E1X2 and sometimes mX or simply m is a weighted average of all possible values of X The weights are determined by the probability density function Sometimes the expected value is called the popu lation mean especially when we want to emphasize that X represents some variable in a population The precise definition of expected value is simplest in the case that X is a discrete random variable taking on a finite number of values say 5x1 p xk6 Let f 1x2 denote the probability density function of X The expected value of X is the weighted average E1X2 5 x1 f 1x12 1 x2 f 1x22 1 p 1 xk f 1xk2 a k j51 xj f 1xj2 B17 This is easily computed given the values of the pdf at each possible outcome of X ExamplE B3 Computing an Expected Value Suppose that X takes on the values 1 0 and 2 with probabilities 18 12 and 38 respectively Then E1X2 5 1212 1182 1 0 1122 1 2 1382 5 58 This example illustrates something curious about expected values the expected value of X can be a number that is not even a possible outcome of X We know that X takes on the values 21 0 or 2 yet its expected value is 58 This makes the expected value deficient for summarizing the central tendency of certain discrete random variables but calculations such as those just mentioned can be useful as we will see later If X is a continuous random variable then E1X2 is defined as an integral E1X2 5 e 2x f 1x2dx B18 which we assume is well defined This can still be interpreted as a weighted average For the most common continuous distributions E1X2 is a number that is a possible outcome of X In this text we will not need to compute expected values using integration although we will draw on some well known results from probability for expected values of special random variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 653 Given a random variable X and a function g we can create a new random variable g1X2 For example if X is a random variable then so is X2 and log1X2 1if X 02 The expected value of g1X2 is again simply a weighted average E3g1X2 4 5 a k j51 g1xj2fX1xj2 B19 or for a continuous random variable E3g1X2 4 5 e 2g1x2fX1x2dx B20 ExamplE B4 Expected Value of X2 For the random variable in Example B3 let g1X2 5 X2 Then E1X22 5 1212 21182 1 102 21122 1 122 21382 5 138 In Example B3 we computed E1X2 5 58 so that 3E1X2 42 5 2564 This shows that E1X22 is not the same as 3E1X2 42 In fact for a nonlinear function g1X2 E3g1X2 4 2 g3E1X2 4 except in very special cases If X and Y are random variables then g1X Y2 is a random variable for any function g and so we can define its expectation When X and Y are both discrete taking on values 5x1 x2 p xk6 and 5y1 y2 p ym6 respectively the expected value is E3g1X Y2 4 5 a k h51 a m j51 g1xh yj2fX Y1xh yj2 where fX Y is the joint pdf of 1X Y2 The definition is more complicated for continuous random vari ables since it involves integration we do not need it here The extension to more than two random variables is straightforward B3b Properties of Expected Values In econometrics we are not so concerned with computing expected values from various distributions the major calculations have been done many times and we will largely take these on faith We will need to manipulate some expected values using a few simple rules These are so important that we give them labels Property E1 For any constant c E1c2 5 c Property E2 For any constants a and b E1aX 1 b2 5 aE1X2 1 b One useful implication of E2 is that if m 5 E1X2 and we define a new random variable as Y 5 X 2 m then E1Y2 5 0 in E2 take a 5 1 and b 5 2m As an example of Property E2 let X be the temperature measured in Celsius at noon on a par ticular day at a given location suppose the expected temperature is E1X2 5 25 If Y is the tempera ture measured in Fahrenheit then Y 5 32 1 1952X From Property E2 the expected temperature in Fahrenheit is E1Y2 5 32 1 1952 E1X2 5 32 1 1952 25 5 77 Generally it is easy to compute the expected value of a linear function of many random variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 654 Property E3 If 5a1 a2 p an6 are constants and 5X1 X2 p Xn6 are random variables then E1a1X1 1 a2X2 1 p 1 anXn2 5 a1E1X12 1 a2E1X22 1 p 1 anE1Xn2 Or using summation notation Ea a n i51 aiXib 5 a n i51 aiE1Xi2 B21 As a special case of this we have with each ai 5 1 Ea a n i51 Xib 5 a n i51 E1Xi2 B22 so that the expected value of the sum is the sum of expected values This property is used often for derivations in mathematical statistics ExamplE B5 Finding Expected Revenue Let X1 X2 and X3 be the numbers of small medium and large pizzas respectively sold during the day at a pizza parlor These are random variables with expected values E1X12 5 25 E1X22 5 57 and E1X32 5 40 The prices of small medium and large pizzas are 550 760 and 915 Therefore the expected revenue from pizza sales on a given day is E1550 X1 1 760 X2 1 915 X32 5 550 E1X12 1 760 E1X22 1 915 E1X32 5 5501252 1 7601572 1 9151402 5 93670 that is 93670 The actual revenue on any particular day will generally differ from this value but this is the expected revenue We can also use Property E3 to show that if X Binomial1n u2 then E1X2 5 nu That is the expected number of successes in n Bernoulli trials is simply the number of trials times the probability of success on any particular trial This is easily seen by writing X as X 5 Y1 1 Y2 1 p 1 Yn where each Yi Bernoulli1u2 Then E1X2 5 a n i51 E1Yi2 5 a n i51 u 5 nu We can apply this to the airline reservation example where the airline makes n 5 120 reserva tions and the probability of showing up is u 5 85 The expected number of people showing up is 1201852 5 102 Therefore if there are 100 seats available the expected number of people show ing up is too large this has some bearing on whether it is a good idea for the airline to make 120 reservations Actually what the airline should do is define a profit function that accounts for the net revenue earned per seat sold and the cost per passenger bumped from the flight This profit function is random because the actual number of people showing up is random Let r be the net revenue from each pas senger You can think of this as the price of the ticket for simplicity Let c be the compensation owed to any passenger bumped from the flight Neither r nor c is random these are assumed to be known to the airline Let Y denote profits for the flight Then with 100 seats available Y 5 rX if X 100 5 100r 2 c1X 2 1002 if X 100 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 655 The first equation gives profit if no more than 100 people show up for the flight the second equation is profit if more than 100 people show up In the latter case the net revenue from ticket sales is 100r since all 100 seats are sold and then c1X 2 1002 is the cost of making more than 100 reservations Using the fact that X has a Binomialn85 distribution where n is the number of reservations made expected profits EY can be found as a function of n and r and c Computing EY directly would be quite difficult but it can be found quickly using a computer Once values for r and c are given the value of n that maximizes expected profits can be found by searching over different values of n B3c Another Measure of Central Tendency The Median The expected value is only one possibility for defining the central tendency of a random variable Another measure of central tendency is the median A general definition of median is too compli cated for our purposes If X is continuous then the median of X say m is the value such that onehalf of the area under the pdf is to the left of m and onehalf of the area is to the right of m When X is discrete and takes on a finite number of odd values the median is obtained by ordering the possible values of X and then selecting the value in the middle For example if X can take on the values 524 0 2 8 10 13 176 then the median value of X is 8 If X takes on an even number of val ues there are really two median values sometimes these are averaged to get a unique median value Thus if X takes on the values 525 3 9 176 then the median values are 3 and 9 if we average these we get a median equal to 6 In general the median sometimes denoted MedX and the expected value E1X2 are different Neither is better than the other as a measure of central tendency they are both valid ways to mea sure the center of the distribution of X In one special case the median and expected value or mean are the same If X has a symmetric distribution about the value m then m is both the expected value and the median Mathematically the condition is f 1m 1 x2 5 f 1m 2 x2 for all x This case is illus trated in Figure B3 x fx m Figure B3 A symmetric probability distribution Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 656 B3d Measures of Variability Variance and Standard Deviation Although the central tendency of a random variable is valuable it does not tell us everything we want to know about the distribution of a random variable Figure B4 shows the pdfs of two random variables with the same mean Clearly the distribution of X is more tightly centered about its mean than is the distribu tion of Y We would like to have a simple way of summarizing differences in the spreads of distributions B3e Variance For a random variable X let m 5 E1X2 There are various ways to measure how far X is from its expected value but the simplest one to work with algebraically is the squared difference 1X 2 m2 2 The squaring eliminates the sign from the distance measure the resulting positive value corresponds to our intuitive notion of distance and treats values above and below m symmetrically This distance is itself a random variable since it can change with every outcome of X Just as we needed a number to summarize the central tendency of X we need a number that tells us how far X is from m on average One such number is the variance which tells us the expected distance from X to its mean Var1X2 E3 1X 2 m2 24 B23 Variance is sometimes denoted s2 X or simply s2 when the context is clear From B23 it follows that the variance is always nonnegative As a computational device it is useful to observe that s2 5 E1X2 2 2Xm 1 m22 5 E1X22 2 2m2 1 m2 5 E1X22 2 m2 B24 In using either B23 or B24 we need not distinguish between discrete and continuous ran dom variables the definition of variance is the same in either case Most often we first compute E1X2 then E1X22 and then we use the formula in B24 For example if X Bernoulli1u2 then E1X2 5 u and since X2 5 X E1X22 5 u It follows from equation B24 that Var1X2 5 E1X22 2 m2 5 u 2 u2 5 u11 2 u2 xy pdf m fX fY Figure B4 Random variables with the same mean but different distributions Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 657 Two important properties of the variance follow Property VAR1 Var1X2 5 0 if and only if there is a constant c such that P1X 5 c2 5 1 in which case E1X2 5 c This first property says that the variance of any constant is zero and if a random variable has zero variance then it is essentially constant Property VAR2 For any constants a and b Var1aX 1 b2 5 a2Var1X2 This means that adding a constant to a random variable does not change the variance but multiplying a random variable by a constant increases the variance by a factor equal to the square of that constant For example if X denotes temperature in Celsius and Y 5 32 1 1952X is temperature in Fahrenheit then Var1Y2 5 1952 2Var1X2 5 181252Var1X2 B3f Standard Deviation The standard deviation of a random variable denoted sd1X2 is simply the positive square root of the variance sd1X2 1Var1X2 The standard deviation is sometimes denoted sX or simply s when the random variable is understood Two standard deviation properties immediately follow from Properties VAR1 and VAR2 Property SD1 For any constant c sd1c2 5 0 Property SD2 For any constants a and b sd1aX 1 b2 5 0a0sd1X2 In particular if a 0 then sd1aX2 5 a sd1X2 This last property makes the standard deviation more natural to work with than the variance For example suppose that X is a random variable measured in thousands of dollars say income If we define Y 5 1000X then Y is income measured in dollars Suppose that E1X2 5 20 and sd1X2 5 6 Then E1Y2 5 1000E1X2 5 20000 and sd1Y2 5 1000 sd1X2 5 6000 so that the expected value and standard deviation both increase by the same factor 1000 If we worked with variance we would have Var1Y2 5 110002 2Var1X2 so that the variance of Y is one million times larger than the variance of X B3g Standardizing a Random Variable As an application of the properties of variance and standard deviationand a topic of practical inter est in its own rightsuppose that given a random variable X we define a new random variable by subtracting off its mean m and dividing by its standard deviation s Z X 2 m s B25 which we can write as Z 5 aX 1 b where a 11s2 and b 21ms2 Then from Property E2 E1Z2 5 aE1X2 1 b 5 1ms2 2 1ms2 5 0 From Property VAR2 Var1Z2 5 a2Var1X2 5 1s2s22 5 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 658 Thus the random variable Z has a mean of zero and a variance and therefore a standard deviation equal to one This procedure is sometimes known as standardizing the random variable X and Z is called a standardized random variable In introductory statistics courses it is sometimes called the ztransform of X It is important to remember that the standard deviation not the variance ap pears in the denominator of B25 As we will see this transformation is frequently used in statistical inference As a specific example suppose that E1X2 5 2 and Var1X2 5 9 Then Z 5 1X 2 223 has ex pected value zero and variance one B3h Skewness and Kurtosis We can use the standardized version of a random variable to define other features of the distribution of a random variable These features are described by using what are called higher order moments For example the third moment of the random variable Z in B25 is used to determine whether a dis tribution is symmetric about its mean We can write E1Z32 5 E3 1X 2 m2 34s3 If X has a symmetric distribution about m then Z has a symmetric distribution about zero The divi sion by s3 does not change whether the distribution is symmetric That means the density of Z at any two points z and z is the same which means that in computing E1Z32 positive values z3 when z 0 are exactly offset with the negative value 12z2 3 5 2z3 It follows that if X is symmetric about zero then E1Z2 5 0 Generally E3 1X 2 m2 34 s3 is viewed as a measure of skewness in the distribution of X In a statistical setting we might use data to estimate E1Z32 to determine whether an underlying population distribution appears to be symmetric Computer Exercise C54 in Chapter 5 provides an illustration It also can be informative to compute the fourth moment of Z E1Z42 5 E3 1X 2 m2 44s4 Because Z4 0 E1Z42 0 and in any interesting case strictly greater than zero Without having a reference value it is difficult to interpret values of E1Z42 but larger values mean that the tails in the distribution of X are thicker The fourth moment E1Z42 is called a measure of kurtosis in the distribu tion of X In Section B5 we will obtain E1Z42 for the normal distribution B4 Features of Joint and Conditional Distributions B4a Measures of Association Covariance and Correlation While the joint pdf of two random variables completely describes the relationship between them it is useful to have summary measures of how on average two random variables vary with one another As with the expected value and variance this is similar to using a single number to summarize some thing about an entire distribution which in this case is a joint distribution of two random variables B4b Covariance Let mX 5 E1X2 and mY 5 E1Y2 and consider the random variable 1X 2 mX2 1Y 2 mY2 Now if X is above its mean and Y is above its mean then 1X 2 mX2 1Y 2 mY2 0 This is also true if X mX and Y mY On the other hand if X mX and Y mY or vice versa then 1X 2 mX2 1Y 2 mY2 0 How then can this product tell us anything about the relationship between X and Y Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 659 The covariance between two random variables X and Y sometimes called the population covari ance to emphasize that it concerns the relationship between two variables describing a population is defined as the expected value of the product 1X 2 mX2 1Y 2 mY2 Cov1X Y2 E3 1X 2 mX2 1Y 2 mY2 4 B26 which is sometimes denoted sXY If sXY 0 then on average when X is above its mean Y is also above its mean If sXY 0 then on average when X is above its mean Y is below its mean Several expressions useful for computing Cov1X Y2 are as follows Cov1X Y2 5 E3 1X 2 mX2 1Y 2 mY2 4 5 E3 1X 2 mX2Y4 5 E3X1Y 2 mY2 4 5 E1XY2 2 mXmY B27 It follows from B27 that if E1X2 5 0 or E1Y2 5 0 then Cov1X Y2 5 E1XY2 Covariance measures the amount of linear dependence between two random variables A positive covariance indicates that two random variables move in the same direction while a negative covari ance indicates they move in opposite directions Interpreting the magnitude of a covariance can be a little tricky as we will see shortly Because covariance is a measure of how two random variables are related it is natural to ask how covariance is related to the notion of independence This is given by the following property Property CoV1 If X and Y are independent then Cov1X Y2 5 0 This property follows from equation B27 and the fact that E1XY2 5 E1X2E1Y2 when X and Y are independent It is important to remember that the converse of COV1 is not true zero covariance between X and Y does not imply that X and Y are independent In fact there are random variables X such that if Y 5 X2 Cov1X Y2 5 0 Any random variable with E1X2 5 0 and E1X32 5 0 has this property If Y 5 X2 then X and Y are clearly not independent once we know X we know Y It seems rather strange that X and X2 could have zero covariance and this reveals a weakness of covariance as a general measure of association between random variables The covariance is useful in contexts when relationships are at least approximately linear The second major property of covariance involves covariances between linear functions Property CoV2 For any constants a1 b1 a2 and b2 Cov1a1X 1 b1 a2Y 1 b22 5 a1a2Cov1X Y2 B28 An important implication of COV2 is that the covariance between two random variables can be al tered simply by multiplying one or both of the random variables by a constant This is important in economics because monetary variables inflation rates and so on can be defined with different units of measurement without changing their meaning Finally it is useful to know that the absolute value of the covariance between any two random variables is bounded by the product of their standard deviations this is known as the CauchySchwartz inequality Property CoV3 0Cov1X Y2 0 sd1X2sd1Y2 B4c Correlation Coefficient Suppose we want to know the relationship between amount of education and annual earnings in the working population We could let X denote education and Y denote earnings and then compute their covariance But the answer we get will depend on how we choose to measure education and earnings Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 660 Property COV2 implies that the covariance between education and earnings depends on whether earnings are measured in dollars or thousands of dollars or whether education is measured in months or years It is pretty clear that how we measure these variables has no bearing on how strongly they are related But the covariance between them does depend on the units of measurement The fact that the covariance depends on units of measurement is a deficiency that is overcome by the correlation coefficient between X and Y Corr1X Y2 Cov1X Y2 sd1X2 sd1Y2 5 sXY sXsY B29 the correlation coefficient between X and Y is sometimes denoted rXY and is sometimes called the population correlation Because sX and sY are positive Cov1X Y2 and Corr1X Y2 always have the same sign and Corr1X Y2 5 0 if and only if Cov1X Y2 5 0 Some of the properties of covariance carry over to correlation If X and Y are independent then Corr1X Y2 5 0 but zero correlation does not imply in dependence Like the covariance the correlation coefficient is also a measure of linear dependence However the magnitude of the correlation coefficient is easier to interpret than the size of the covari ance due to the following property Property CoRR1 21 Corr1X Y2 1 If Corr1X Y2 5 0 or equivalently Cov1X Y2 5 0 then there is no linear relationship between X and Y and X and Y are said to be uncorrelated random variables otherwise X and Y are corre lated Corr1X Y2 5 1 implies a perfect positive linear relationship which means that we can write Y 5 a 1 bX for some constant a and some constant b 0 Corr1X Y2 5 21 implies a perfect nega tive linear relationship so that Y 5 a 1 bX for some b 0 The extreme cases of positive or negative 1 rarely occur Values of rXY closer to 1 or 21 indicate stronger linear relationships As mentioned earlier the correlation between X and Y is invariant to the units of measurement of either X or Y This is stated more generally as follows Property CoRR2 For constants a1 b1 a2 and b2 with a1a2 0 Corr1a1X 1 b1 a2Y 1 b22 5 Corr1X Y2 If a1a2 0 then Corr1a1X 1 b1 a2Y 1 b22 5 2Corr1X Y2 As an example suppose that the correlation between earnings and education in the working popula tion is 15 This measure does not depend on whether earnings are measured in dollars thousands of dollars or any other unit it also does not depend on whether education is measured in years quarters months and so on B4d Variance of Sums of Random Variables Now that we have defined covariance and correlation we can complete our list of major properties of the variance Property VAR3 For constants a and b Var1aX 1 bY2 5 a2Var1X2 1 b2Var1Y2 1 2abCov1X Y2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 661 It follows immediately that if X and Y are uncorrelatedso that Cov1X Y2 5 0then Var1X 1 Y2 5 Var1X2 1 Var1Y2 B30 and Var1X 2 Y2 5 Var1X2 1 Var1Y2 B31 In the latter case note how the variance of the difference is the sum of the variances not the differ ence in the variances As an example of B30 let X denote profits earned by a restaurant during a Friday night and let Y be profits earned on the following Saturday night Then Z 5 X 1 Y is profits for the two nights Sup pose X and Y each have an expected value of 300 and a standard deviation of 15 so that the vari ance is 225 Expected profits for the two nights is E1Z2 5 E1X2 1 E1Y2 5 2 13002 5 600 dollars If X and Y are independent and therefore uncorrelated then the variance of total profits is the sum of the variances Var1Z2 5 Var1X2 1 Var1Y2 5 2 12252 5 450 It follows that the standard deviation of total profits is 450 or about 2121 Expressions B30 and B31 extend to more than two random variables To state this extension we need a definition The random variables 5X1 p Xn6 are pairwise uncorrelated random variables if each variable in the set is uncorrelated with every other variable in the set That is Cov1Xi Xj2 5 0 for all i 2 j Property VAR4 If 5X1 p Xn6 are pairwise uncorrelated random variables and ai i 5 1 p n are constants then Var1a1X1 1 p 1 anXn2 5 a2 1Var1X12 1 p 1 a2 nVar1Xn2 In summation notation we can write Vara a n i51 aiXib 5 a n i51 a2 iVar1Xi2 B32 A special case of Property VAR4 occurs when we take ai 5 1 for all i Then for pairwise uncorre lated random variables the variance of the sum is the sum of the variances Vara a n i51 Xib 5 a n i51 Var1Xi2 B33 Because independent random variables are uncorrelated see Property COV1 the variance of a sum of independent random variables is the sum of the variances If the Xi are not pairwise uncorrelated then the expression for Var1 g n i51aiXi2 is much more com plicated we must add to the righthand side of B32 the terms 2aiajCov1xi xj2 for all i j We can use B33 to derive the variance for a binomial random variable Let X Binomial1n u2 and write X 5 Y1 1 p 1 Yn where the Yi are independent Bernoulli 1u2 random variables Then by B33 Var1X2 5 Var1Y12 1 p 1 Var1Yn2 5 nu11 2 u2 In the airline reservation example with n 5 120 and u 5 85 the variance of the number of pas sengers arriving for their reservations is 1201852 1152 5 153 so the standard deviation is about 39 B4e Conditional Expectation Covariance and correlation measure the linear relationship between two random variables and treat them symmetrically More often in the social sciences we would like to explain one variable called Y in terms of another variable say X Further if Y is related to X in a nonlinear fashion we would like Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 662 to know this Call Y the explained variable and X the explanatory variable For example Y might be hourly wage and X might be years of formal education We have already introduced the notion of the conditional probability density function of Y given X Thus we might want to see how the distribution of wages changes with education level However we usually want to have a simple way of summarizing this distribution A single number will no longer suffice since the distribution of Y given X 5 x generally depends on the value of x Nevertheless we can summarize the relationship between Y and X by looking at the conditional expectation of Y given X sometimes called the conditional mean The idea is this Suppose we know that X has taken on a particular value say x Then we can compute the expected value of Y given that we know this outcome of X We denote this expected value by E1Y 0 X 5 x2 or sometimes E1Y 0 x2 for shorthand Generally as x changes so does E1Y 0 x2 When Y is a discrete random variable taking on values y1 p ym then E1Y 0 x2 5 a m j51 yj fY 0 X1yj 0 x2 When Y is continuous E1Y 0 x2 is defined by integrating yfY 0 X1y 0 x2 over all possible values of y As with unconditional expectations the conditional expectation is a weighted average of possible values of Y but now the weights reflect the fact that X has taken on a specific value Thus E1Y 0 x2 is just some function of x which tells us how the expected value of Y varies with x As an example let 1X Y2 represent the population of all working individuals where X is years of education and Y is hourly wage Then E1Y 0 X 5 122 is the average hourly wage for all people in the population with 12 years of education roughly a high school education E1Y 0 X 5 162 is the average hourly wage for all people with 16 years of education Tracing out the expected value for various levels of education provides important information on how wages and education are related See Figure B5 for an illustration 4 8 12 EWAGEEDUC 16 20 EDUC Figure B5 The expected value of hourly wage given various levels of education Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 663 In principle the expected value of hourly wage can be found at each level of education and these expectations can be summarized in a table Because education can vary widelyand can even be measured in fractions of a yearthis is a cumbersome way to show the relationship between average wage and amount of education In econometrics we typically specify simple functions that capture this relationship As an example suppose that the expected value of WAGE given EDUC is the linear function E1WAGE 0 EDUC2 5 105 1 45 EDUC If this relationship holds in the population of working people the average wage for people with eight years of education is 105 1 45182 5 465 or 465 The average wage for people with 16 years of education is 825 or 825 The coefficient on EDUC implies that each year of education increases the expected hourly wage by 45 or 45 Conditional expectations can also be nonlinear functions For example suppose that E1Y 0 x2 5 10x where X is a random variable that is always greater than zero This function is graphed in Figure B6 This could represent a demand function where Y is quantity demanded and X is price If Y and X are related in this way an analysis of linear association such as correlation analysis would be incomplete B4f Properties of Conditional Expectation Several basic properties of conditional expectations are useful for derivations in econometric analysis Property CE1 E3c1X2 0 X4 5 c1X2 for any function cX This first property means that functions of X behave as constants when we compute expectations con ditional on X For example E1X2 0 X2 5 X2 Intuitively this simply means that if we know X then we also know X2 Property CE2 For functions aX and bX E3a1X2Y 1 b1X2 0 X4 5 a1X2E1Y 0 X2 1 b1X2 1 5 10 1 2 EYx 10 EYx 10x x Figure B6 Graph of E1Y 0X2 5 10x Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 664 For example we can easily compute the conditional expectation of a function such as XY 1 2X2 E1XY 1 2X2 0 X2 5 XE1Y 0 X2 1 2X2 The next property ties together the notions of independence and conditional expectations Property CE3 If X and Y are independent then E1Y 0 X2 5 E1Y2 This property means that if X and Y are independent then the expected value of Y given X does not depend on X in which case E1Y 0 X2 always equals the unconditional expectedvalue of Y In the wage and education example if wages were independent of education then the average wages of high school and college graduates would be the same Since this is almost certainly false we cannot as sume that wage and education are independent A special case of Property CE3 is the following if U and X are independent and E1U2 5 0 then E1U 0 X2 5 0 There are also properties of the conditional expectation that have to do with the fact that E1Y 0 X2 is a function of X say E1Y 0 X2 5 m1X2 Because X is a random variable m1X2 is also a random vari able Furthermore m1X2 has a probability distribution and therefore an expected value Generally the expected value of m1X2 could be very difficult to compute directly The law of iterated expectations says that the expected value of m1X2 is simply equal to the expected value of Y We write this as follows Property CE4 E3E1Y 0 X2 4 5 E1Y2 This property is a little hard to grasp at first It means that if we first obtain E1Y 0 X2 as a function of X and take the expected value of this with respect to the distribution of X of course then we end up with E1Y2 This is hardly obvious but it can be derived using the definition of expected values As an example of how to use Property CE4 let Y 5 WAGE and X 5 EDUC where WAGE is mea sured in hours and EDUC is measured in years Suppose the expected value of WAGE given EDUC is E1WAGE 0 EDUC2 5 4 1 60 EDUC Further E1EDUC2 5 115 Then the law of iterated expecta tions implies that E1WAGE2 5 E14 1 60 EDUC2 5 4 1 60 E1EDUC2 5 4 1 6011152 5 1090 or 1090 an hour The next property states a more general version of the law of iterated expectations Property CE4 E1Y 0 X2 5 E3E1Y 0 X Z2 0 X4 In other words we can find E1Y 0 X2 in two steps First find E1Y 0 X Z2 for any other random vari able Z Then find the expected value of E1Y 0 X Z2 conditional on X Property CE5 If E1Y 0 X2 5 E1Y2 then Cov1X Y2 5 0 and so Corr1X Y2 5 0 In fact every function of X is uncorrelated with Y This property means that if knowledge of X does not change the expected value of Y then X and Y must be uncorrelated which implies that if X and Y are correlated then E1Y 0 X2 must depend on X The converse of Property CE5 is not true if X and Y are uncorrelated E1Y 0 X2 could still depend on X For example suppose Y 5 X2 Then E1Y 0 X2 5 X2 which is clearly a function of X However as we mentioned in our discussion of covariance and correlation it is possible that X and X2 are un correlated The conditional expectation captures the nonlinear relationship between X and Y that cor relation analysis would miss entirely Properties CE4 and CE5 have two important implications if U and X are random variables such that E1U 0 X2 5 0 then E1U2 5 0 and U and X are uncorrelated Property CE6 If E1Y22 and E3g1X2 24 for some function g then E53Y 2 m1X2 42 0 X6 E53Y 2 g1X2 420X6 and E53Y 2 m1X2 426 E53Y 2 g1X2 426 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 665 Property CE6 is very useful in predicting or forecasting contexts The first inequality says that if we measure prediction inaccuracy as the expected squared prediction error conditional on X then the conditional mean is better than any other function of X for predicting Y The conditional mean also minimizes the unconditional expected squared prediction error B4g Conditional Variance Given random variables X and Y the variance of Y conditional on X 5 x is simply the variance as sociated with the conditional distribution of Y given X 5 x E53Y 2 E1Y 0 x2 42 0 x6 The formula Var1Y 0 X 5 x2 5 E1Y2 0 x2 2 3E1Y 0 x2 42 is often useful for calculations Only occasionally will we have to compute a conditional variance But we will have to make assumptions about and manipulate conditional variances for certain topics in regression analysis As an example let Y 5 SAVING and X 5 INCOME both of these measured annually for the population of all families Suppose that Var1SAVING 0 INCOME2 5 400 1 25 INCOME This says that as income increases the variance in saving levels also increases It is important to see that the relationship between the variance of SAVING and INCOME is totally separate from that between the expected value of SAVING and INCOME We state one useful property about the conditional variance Property CV1 If X and Y are independent then Var1Y 0 X2 5 Var1Y2 This property is pretty clear since the distribution of Y given X does not depend on X and Var1Y 0 X2 is just one feature of this distribution B5 The Normal and Related Distributions B5a The Normal Distribution The normal distribution and those derived from it are the most widely used distributions in statistics and econometrics Assuming that random variables defined over populations are normally distributed simplifies probability calculations In addition we will rely heavily on the normal and related distri butions to conduct inference in statistics and econometricseven when the underlying population is not necessarily normal We must postpone the details but be assured that these distributions will arise many times throughout this text A normal random variable is a continuous random variable that can take on any value Its prob ability density function has the familiar bell shape graphed in Figure B7 Mathematically the pdf of X can be written as f 1x2 5 1 s2p exp321x 2 m2 22s24 2 x B34 where m 5 E1X2 and s2 5 Var1X2 We say that X has a normal distribution with expected value m and variance s2 written as X Normal1m s22 Because the normal distribution is symmetric about m m is also the median of X The normal distribution is sometimes called the Gaussian distribution after the famous mathematician C F Gauss Certain random variables appear to roughly follow a normal distribution Human heights and weights test scores and county unemployment rates have pdfs roughly the shape in Figure B7 Other dis tributions such as income distributions do not appear to follow the normal probability function In most countries income is not symmetrically distributed about any value the distribution is skewed toward the upper tail In some cases a variable can be transformed to achieve normality A popular transformation is Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 666 the natural log which makes sense for positive random variables If X is a positive random variable such as income and Y 5 log1X2 has a normal distribution then we say that X has a lognormal distribution It turns out that the lognormal distribution fits income distribution pretty well in many countries Other variables such as prices of goods appear to be well described as lognormally distributed B5b The Standard Normal Distribution One special case of the normal distribution occurs when the mean is zero and the variance and there fore the standard deviation is unity If a random variable Z has a Normal01 distribution then we say it has a standard normal distribution The pdf of a standard normal random variable is denoted f1z2 from B34 with m 5 0 and s2 5 1 it is given by f1z2 5 1 2p exp12z222 2 z B35 The standard normal cumulative distribution function is denoted F1z2 and is obtained as the area under f to the left of z see Figure B8 Recall that F1z2 5 P1Z z2 because Z is continuous F1z2 5 P1Z z2 as well No simple formula can be used to obtain the values of F1z2 because F1z2 is the integral of the function in B35 and this integral has no closed form Nevertheless the values for F1z2 are easily tabulated they are given for z between 31 and 31 in Table G1 in Appendix G For z 231 F1z2 is less than 001 and for z 31 F1z2 is greater than 999 Most statistics and econometrics software packages include simple commands for computing values of the standard normal cdf so we can often avoid printed tables entirely and obtain the probabilities for any value of z Using basic facts from probabilityand in particular properties B7 and B8 concerning cdfswe can use the standard normal cdf for computing the probability of any event involving a standard normal random variable The most important formulas are P1Z z2 5 1 2 F1z2 B36 P1Z 2z2 5 P1Z z2 B37 m x fX for a normal random variable Figure B7 The general shape of the normal probability density function Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 667 and P1a Z b2 5 F1b2 2 F1a2 B38 Because Z is a continuous random variable all three formulas hold whether or not the in equalities are strict Some examples include P1Z 442 5 1 2 67 5 33 P1Z 2922 5 P1Z 922 5 1 2 821 5 179 and P121 Z 52 5 692 2 159 5 533 Another useful expression is that for any c 0 P1 0Z0 c2 5 P1Z c2 1 P1Z 2c2 B39 5 2 P1Z c2 5 231 2 F1c2 4 Thus the probability that the absolute value of Z is bigger than some positive constant c is simply twice the probability P1Z c2 this reflects the symmetry of the standard normal distribution In most applications we start with a normally distributed random variable X Normal1m s22 where m is different from zero and s2 2 1 Any normal random variable can be turned into a standard normal using the following property Property Normal1 If X Normal1m s22 then 1X 2 m2s Normal10 12 Property Normal1 shows how to turn any normal random variable into a standard normal Thus suppose X Normal13 42 and we would like to compute P1X 12 The steps always involve the normalization of X to a standard normal P1X 12 5 P1X 2 3 1 2 32 5 P aX 2 3 2 21b 5 P1Z 212 5 F1212 5 159 0 z 1 0 5 23 3 Figure B8 The standard normal cumulative distribution function Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 668 ExamplE B6 probabilities for a Normal Random Variable First let us compute P12 X 62 when X Normal49 whether we use or is irrelevant because X is a continuous random variable Now P12 X 62 5 P a2 2 4 3 X 2 4 3 6 2 4 3 b 5 P1223 Z 232 5 F1672 2 F12672 5 749 2 251 5 498 Now let us compute P1 0X0 22 P1 0X0 22 5 P1X 22 1 P1X 222 5 P3 1X 2 423 12 2 4234 1 P3 1X 2 423 122 2 4234 5 1 2 F12232 1 F1222 5 1 2 251 1 023 5 772 B5c Additional Properties of the Normal Distribution We end this subsection by collecting several other facts about normal distributions that we will later use Property Normal2 If X Normal1m s22 then aX 1 b Normal1am 1 b a2s22 Thus if X Normal19 then Y 5 2X 1 3 is distributed as normal with mean 2E1X2 1 3 5 5 and variance 22 9 5 36 sd1Y2 5 2sd1X2 5 2 3 5 6 Earlier we discussed how in general zero correlation and independence are not the same In the case of normally distributed random variables it turns out that zero correlation suffices for independence Property Normal3 If X and Y are jointly normally distributed then they are independent if and only if Cov1X Y2 5 0 Property Normal4 Any linear combination of independent identically distributed normal random variables has a normal distribution For example let Xi for i 5 1 2 and 3 be independent random variables distributed as Normal1m s22 Define W 5 X1 1 2X2 2 3X3 Then W is normally distributed we must simply find its mean and variance Now E1W2 5 E1X12 1 2E1X22 2 3E1X32 5 m 1 2m 2 3m 5 0 Also Var1W2 5 Var1X12 1 4Var1X22 1 9Var1X32 5 14s2 Property Normal4 also implies that the average of independent normally distributed random variables has a normal distribution If Y1 Y2 p Yn are independent random variables and each is distributed as Normal1m s22 then Y Normal1m s2n2 B40 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 669 This result is critical for statistical inference about the mean in a normal population Other features of the normal distribution are worth knowing although they do not play a central role in the text Because a normal random variable is symmetric about its mean it has zero skewness that is E3 1X 2 m2 34 5 0 Further it can be shown that E3 1X 2 m2 44s4 5 3 or E1Z42 5 3 where Z has a standard normal distribution Because the normal distribution is so prev alent in probability and statistics the measure of kurtosis for any given random variable X whose fourth moment exists is often defined to be E3 1X 2 m2 44s4 2 3 that is relative to the value for the standard normal distribution If E3 1X 2 m2 44s4 3 then the distribution of X has fatter tails than the normal distribution a somewhat common occurrence such as with the t distribution to be intro duced shortly if E3 1X 2 m2 44 s4 3 then the distribution has thinner tails than the normal a rarer situation B5d The ChiSquare Distribution The chisquare distribution is obtained directly from independent standard normal random variables Let Zi i 5 1 2 p n be independent random variables each distributed as standard normal Define a new random variable as the sum of the squares of the Zi X 5 a n i51 Z2 i B41 Then X has what is known as a chisquare distribution with n degrees of freedom or df for short We write this as X x2 n The df in a chisquare distribution corresponds to the number of terms in the sum in B41 The concept of degrees of freedom will play an important role in our statistical and econometric analyses The pdf for chisquare distributions with varying degrees of freedom is given in Figure B9 we will not need the formula for this pdf and so we do not reproduce it here From equation B41 it is clear that a chisquare random variable is always nonnegative and that unlike the normal distribution the chisquare distribution is not symmetric about any point It can be shown that if X x2 n then the expected value of X is n the number of terms in B41 and the variance of X is 2n B5e The t Distribution The t distribution is the workhorse in classical statistics and multiple regression analysis We obtain a t distribution from a standard normal and a chisquare random variable Let Z have a standard normal distribution and let X have a chisquare distribution with n degrees of freedom Further assume that Z and X are independent Then the random variable T 5 Z Xn B42 has a t distribution with n degrees of freedom We will denote this by T tn The t distribution gets its degrees of freedom from the chisquare random variable in the denominator of B42 The pdf of the t distribution has a shape similar to that of the standard normal distribution except that it is more spread out and therefore has more area in the tails The expected value of a t distributed random variable is zero strictly speaking the expected value exists only for n 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 670 and the variance is n1n 2 22 for n 2 The variance does not exist for n 2 because the distri bution is so spread out The pdf of the t distribution is plotted in Figure B10 for various degrees of freedom As the degrees of freedom gets large the t distribution approaches the standard normal distribution B5f The F Distribution Another important distribution for statistics and econometrics is the F distribution In particular the F distribution will be used for testing hypotheses in the context of multiple regression analysis To define an F random variable let X1 x2 k1 and X2 x2 k2 and assume that X1 and X2 are inde pendent Then the random variable F 5 1X1k12 1X2k22 B43 has an F distribution with 1k1 k22 degrees of freedom We denote this as F Fk1 k2 The pdf of the F distribution with different degrees of freedom is given in Figure B11 The order of the degrees of freedom in Fk1 k2 is critical The integer k1 is called the numera tor degrees of freedom because it is associated with the chisquare variable in the numerator Like wise the integer k2 is called the denominator degrees of freedom because it is associated with the chisquare variable in the denominator This can be a little tricky because B43 can also be writ ten as 1X1k221X2k12 so that k1 appears in the denominator Just remember that the numerator df is the integer associated with the chisquare variable in the numerator of B43 and similarly for the denominator df x df 2 fx df 4 df 8 Figure B9 The chisquare distribution with various degrees of freedom Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 671 0 3 df 1 23 df 2 df 24 Figure B10 The t distribution with various degrees of freedom x df 2 8 fx df 6 8 df 6 20 0 Figure B11 The Fk1 k2 distribution for various degrees of freedom k1 and k2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 672 Appendices Summary In this appendix we have reviewed the probability concepts that are needed in econometrics Most of the concepts should be familiar from your introductory course in probability and statistics Some of the more advanced topics such as features of conditional expectations do not need to be mastered nowthere is time for that when these concepts arise in the context of regression analysis in Part 1 In an introductory statistics course the focus is on calculating means variances covariances and so on for particular distributions In Part 1 we will not need such calculations we mostly rely on the properties of expectations variances and so on that have been stated in this appendix Key Terms Bernoulli or Binary Random Variable Binomial Distribution ChiSquare Distribution Conditional Distribution Conditional Expectation Continuous Random Variable Correlation Coefficient Covariance Cumulative Distribution Function cdf Degrees of Freedom Discrete Random Variable Expected Value Experiment F Distribution Independent Random Variables Joint Distribution Kurtosis Law of Iterated Expectations Median Normal Distribution Pairwise Uncorrelated Random Variables Probability Density Function pdf Random Variable Skewness Standard Deviation Standard Normal Distribution Standardized Random Variable Symmetric Distribution t Distribution Uncorrelated Random Variables Variance Problems 1 Suppose that a high school student is preparing to take the SAT exam Explain why his or her eventual SAT score is properly viewed as a random variable 2 Let X be a random variable distributed as Normal54 Find the probabilities of the following events i P1X 62 ii P1X 42 iii P1 0X 2 50 12 3 Much is made of the fact that certain mutual funds outperform the market year after year that is the return from holding shares in the mutual fund is higher than the return from holding a portfolio such as the SP 500 For concreteness consider a 10year period and let the population be the 4170 mutual funds reported in The Wall Street Journal on January 1 1995 By saying that performance relative to the market is random we mean that each fund has a 5050 chance of outperforming the market in any year and that performance is independent from year to year i If performance relative to the market is truly random what is the probability that any particular fund outperforms the market in all 10 years ii Of the 4170 mutual funds what is the expected number of funds that will outperform the market in all 10 years iii Find the probability that at least one fund out of 4170 funds outperforms the market in all 10 years What do you make of your answer iv If you have a statistical package that computes binomial probabilities find the probability that at least five funds outperform the market in all 10 years Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 673 4 For a randomly selected county in the United States let X represent the proportion of adults over age 65 who are employed or the elderly employment rate Then X is restricted to a value between zero and one Suppose that the cumulative distribution function for X is given by F1x2 5 3x2 2 2x3 for 0 x 1 Find the probability that the elderly employment rate is at least 6 60 5 Just prior to jury selection for O J Simpsons murder trial in 1995 a poll found that about 20 of the adult population believed Simpson was innocent after much of the physical evidence in the case had been revealed to the public Ignore the fact that this 20 is an estimate based on a subsample from the population for illustration take it as the true percentage of people who thought Simpson was innocent prior to jury selection Assume that the 12 jurors were selected randomly and independently from the population although this turned out not to be true i Find the probability that the jury had at least one member who believed in Simpsons innocence prior to jury selection Hint Define the Binomial1220 random variable X to be the number of jurors believing in Simpsons innocence ii Find the probability that the jury had at least two members who believed in Simpsons innocence Hint P1X 22 5 1 2 P1X 12 and P1X 12 5 P1X 5 02 1 P1X 5 12 6 Requires calculus Let X denote the prison sentence in years for people convicted of auto theft in a particular state in the United States Suppose that the pdf of X is given by f 1x2 5 1192x2 0 x 3 Use integration to find the expected prison sentence 7 If a basketball player is a 74 free throw shooter then on average how many free throws will he or she make in a game with eight free throw attempts 8 Suppose that a college student is taking three courses a twocredit course a threecredit course and a fourcredit course The expected grade in the twocredit course is 35 while the expected grade in the three and fourcredit courses is 30 What is the expected overall grade point average for the semester Remember that each course grade is weighted by its share of the total number of units 9 Let X denote the annual salary of university professors in the United States measured in thousands of dollars Suppose that the average salary is 523 with a standard deviation of 146 Find the mean and standard deviation when salary is measured in dollars 10 Suppose that at a large university college grade point average GPA and SAT score SAT are related by the conditional expectation E1GPA 0 SAT2 5 70 1 002 SAT i Find the expected GPA when SAT 5 800 Find E1GPA 0 SAT 5 14002 Comment on the difference ii If the average SAT in the university is 1100 what is the average GPA Hint Use Property CE4 iii If a students SAT score is 1100 does this mean he or she will have the GPA found in part ii Explain 11 i L et X be a random variable taking on the values 21 and 1 each with probability 12 Find E1X2 and E1X22 ii Now let X be a random variable taking on the values 1 and 2 each with probability 12 Find E1X2 and E11X2 iii Conclude from parts i and ii that in general E3g1X2 4 2 g3E1X2 4 for a nonlinear function g12 iv Given the definition of the F random variable in equation B43 show that E1F2 5 E c 1 1X2k22 d Can you conclude that E1F2 5 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 674 C1 Populations Parameters and Random Sampling Statistical inference involves learning something about a population given the availability of a sam ple from that population By population we mean any welldefined group of subjects which could be individuals firms cities or many other possibilities By learning we can mean several things which are broadly divided into the categories of estimation and hypothesis testing A couple of examples may help you understand these terms In the population of all working adults in the United States labor economists are interested in learning about the return to education as measured by the average percentage increase in earnings given another year of education It would be impractical and costly to obtain information on earnings and education for the entire working population in the United States but we can obtain data on a subset of the population Using the data collected a labor economist may report that his or her best estimate of the return to another year of education is 75 This is an example of a point estimate Or she or he may report a range such as the return to education is between 56 and 94 This is an example of an interval estimate An urban economist might want to know whether neighborhood crime watch programs are associ ated with lower crime rates After comparing crime rates of neighborhoods with and without such pro grams in a sample from the population he or she can draw one of two conclusions neighborhood watch programs do affect crime or they do not This example falls under the rubric of hypothesis testing The first step in statistical inference is to identify the population of interest This may seem obvi ous but it is important to be very specific Once we have identified the population we can specify a model for the population relationship of interest Such models involve probability distributions or features of probability distributions and these depend on unknown parameters Parameters are simply constants that determine the directions and strengths of relationships among variables In the labor eco nomics example just presented the parameter of interest is the return to education in the population C1a Sampling For reviewing statistical inference we focus on the simplest possible setting Let Y be a random variable representing a population with a probability density function f1y u2 which depends on the single parameter u The probability density function pdf of Y is assumed to be known except for the value of u different values of u imply different population distributions and therefore we are interested in the value of u If we can obtain certain kinds of samples from the population then we can learn something about u The easiest sampling scheme to deal with is random sampling Fundamentals of Mathematical Statistics Appendix C Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 675 Random Sampling If Y1 Y2 p Yn are independent random variables with a common prob ability density function f 1y u2 then 5Y1 p Yn6 is said to be a random sample from f 1y u2 or a random sample from the population represented by f 1y u2 When 5Y1 p Yn6 is a random sample from the density f 1y u2 we also say that the Yi are indepen dent identically distributed or iid random variables from f 1y u2 In some cases we will not need to entirely specify what the common distribution is The random nature of Y1 Y2 p Yn in the definition of random sampling reflects the fact that many different outcomes are possible before the sampling is actually carried out For example if fam ily income is obtained for a sample of n 5 100 families in the United States the incomes we observe will usually differ for each different sample of 100 families Once a sample is obtained we have a set of numbers say 5y1 y2 p yn6 which constitute the data that we work with Whether or not it is ap propriate to assume the sample came from a random sampling scheme requires knowledge about the actual sampling process Random samples from a Bernoulli distribution are often used to illustrate statistical concepts and they also arise in empirical applications If Y1 Y2 p Yn are independent random variables and each is distributed as Bernoulliu so that P1Yi 5 12 5 u and P1Yi 5 02 5 1 2 u then 5Y1 Y2 p Yn6 constitutes a random sample from the Bernoulliu distribution As an illustration consider the airline reservation example carried along in Appendix B Each Yi denotes whether customer i shows up for his or her reservation Yi 5 1 if passenger i shows up and Yi 5 0 otherwise Here u is the probability that a randomly drawn person from the population of all people who make airline reservations shows up for his or her reservation For many other applications random samples can be assumed to be drawn from a normal distri bution If 5Y1 p Yn6 is a random sample from the Normal1m s22 population then the population is characterized by two parameters the mean m and the variance s2 Primary interest usually lies in m but s2 is of interest in its own right because making inferences about m often requires learning about s2 C2 Finite Sample Properties of Estimators In this section we study what are called finite sample properties of estimators The term finite sample comes from the fact that the properties hold for a sample of any size no matter how large or small Sometimes these are called small sample properties In Section C3 we cover asymptotic properties which have to do with the behavior of estimators as the sample size grows without bound C2a Estimators and Estimates To study properties of estimators we must define what we mean by an estimator Given a random sam ple 5Y1 Y2 p Yn6 drawn from a population distribution that depends on an unknown parameter u an estimator of u is a rule that assigns each possible outcome of the sample a value of u The rule is specified before any sampling is carried out in particular the rule is the same regardless of the data actually obtained As an example of an estimator let 5Y1 p Yn6 be a random sample from a population with mean m A natural estimator of m is the average of the random sample Y 5 n21 a n i51 Yi C1 Y is called the sample average but unlike in Appendix A where we defined the sample average of a set of numbers as a descriptive statistic Y is now viewed as an estimator Given any outcome of the random variables Y1 p Yn we use the same rule to estimate m we simply average them For actual data outcomes 5y1 p yn6 the estimate is just the average in the sample y 5 1y1 1 y2 1 p 1 yn2n Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 676 ExamplE C1 City Unemployment Rates Suppose we obtain the following sample of unemployment rates for 10 cities in the United States City Unemployment Rate 1 51 2 64 3 92 4 41 5 75 6 83 7 26 8 35 9 58 10 75 Our estimate of the average city unemployment rate in the United States is y 5 60 Each sample gen erally results in a different estimate But the rule for obtaining the estimate is the same regardless of which cities appear in the sample or how many More generally an estimator W of a parameter u can be expressed as an abstract mathematical formula W 5 h1Y1 Y2 p Yn2 C2 for some known function h of the random variables Y1 Y2 p Yn As with the special case of the sample average W is a random variable because it depends on the random sample as we obtain different random samples from the population the value of W can change When a particular set of numbers say 5y1 y2 p yn6 is plugged into the function h we obtain an estimate of u denoted w 5 h1y1 p yn2 Sometimes W is called a point estimator and w a point estimate to distinguish these from interval estimators and estimates which we will come to in Section C5 For evaluating estimation procedures we study various properties of the probability distribution of the random variable W The distribution of an estimator is often called its sampling distribution because this distribution describes the likelihood of various outcomes of W across different random samples Because there are unlimited rules for combining data to estimate parameters we need some sensible criteria for choosing among estimators or at least for eliminating some estimators from con sideration Therefore we must leave the realm of descriptive statistics where we compute things such as the sample average to simply summarize a body of data In mathematical statistics we study the sampling distributions of estimators C2b Unbiasedness In principle the entire sampling distribution of W can be obtained given the probability distribution of Yi and the function h It is usually easier to focus on a few features of the distribution of W in evaluating it as an estimator of u The first important property of an estimator involves its expected value Unbiased Estimator An estimator W of u is an unbiased estimator if E1W2 5 u C3 for all possible values of u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 677 If an estimator is unbiased then its probability distribution has an expected value equal to the parameter it is supposed to be estimating Unbiasedness does not mean that the estimate we get with any particular sample is equal to u or even very close to u Rather if we could indefinitely draw random samples on Y from the population compute an estimate each time and then average these estimates over all random samples we would obtain u This thought experiment is abstract because in most applications we just have one random sample to work with For an estimator that is not unbiased we define its bias as follows Bias of an Estimator If W is a biased estimator of u its bias is defined Bias1W2 E1W2 2 u C4 Figure C1 shows two estimators the first one is unbiased and the second one has a positive bias The unbiasedness of an estimator and the size of any possible bias depend on the distribution of Y and on the function h The distribution of Y is usually beyond our control although we often choose a model for this distribution it may be determined by nature or social forces But the choice of the rule h is ours and if we want an unbiased estimator then we must choose h accordingly Some estimators can be shown to be unbiased quite generally We now show that the sample average Y is an unbiased estimator of the population mean m regardless of the underlying population distribution We use the properties of expected values E1 and E2 that we covered in Section B3 E1Y2 5 Ea 11n2 a n i51 Yib 5 11n2Ea a n i51 Yib 5 11n2 a a n i51 E1Yi2 b 5 11n2 a a n i51 mb 5 11n2 1nm2 5 m w u EW1 EW2 pdf of W1 pdf of W2 fw Figure C1 An unbiased estimator W1 and an estimator with positive bias W2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 678 For hypothesis testing we will need to estimate the variance s2 from a population with mean m Letting 5Y1 p Yn6 denote the random sample from the population with E1Y2 5 m and Var1Y2 5 s2 define the estimator as S2 5 1 n 2 1 a n i51 1Yi 2 Y2 2 C5 which is usually called the sample variance It can be shown that S2 is unbiased for s2 E1S22 5 s2 The division by n 2 1 rather than n accounts for the fact that the mean m is estimated rather than known If m were known an unbiased estimator of s2 would be n21g n i511Yi 2 m2 2 but m is rarely known in practice Although unbiasedness has a certain appeal as a property for an estimatorindeed its antonym biased has decidedly negative connotationsit is not without its problems One weakness of unbi asedness is that some reasonable and even some very good estimators are not unbiased We will see an example shortly Another important weakness of unbiasedness is that unbiased estimators exist that are actually quite poor estimators Consider estimating the mean m from a population Rather than using the sample average Y to estimate m suppose that after collecting a sample of size n we discard all of the observa tions except the first That is our estimator of m is simply W Y1 This estimator is unbiased because E1Y12 5 m Hopefully you sense that ignoring all but the first observation is not a prudent approach to estimation it throws out most of the information in the sample For example with n 5 100 we obtain 100 outcomes of the random variable Y but then we use only the first of these to estimate EY C2d The Sampling Variance of Estimators The example at the end of the previous subsection shows that we need additional criteria to evaluate estimators Unbiasedness only ensures that the sampling distribution of an estimator has a mean value equal to the parameter it is supposed to be estimating This is fine but we also need to know how spread out the distribution of an estimator is An estimator can be equal to u on average but it can also be very far away with large probability In Figure C2 W1 and W2 are both unbiased estimators of u But the distribution of W1 is more tightly centered about u the probability that W1 is greater than any given distance from u is less than the probability that W2 is greater than that same distance from u Using W1 as our estimator means that it is less likely that we will obtain a random sample that yields an estimate very far from u To summarize the situation shown in Figure C2 we rely on the variance or standard deviation of an estimator Recall that this gives a single measure of the dispersion in the distribution The vari ance of an estimator is often called its sampling variance because it is the variance associated with a sampling distribution Remember the sampling variance is not a random variable it is a constant but it might be unknown We now obtain the variance of the sample average for estimating the mean m from a population Var1Y2 5 Vara 11n2 a n i51 Yib 5 11n22Vara a n i51 Yib 5 11n22 a a n i51 Var1Yi2 b 5 11n22 a a n i51 s2b 5 11n22 1ns22 5 s2n C6 Notice how we used the properties of variance from Sections B3 and B4 VAR2 and VAR4 as well as the independence of the Yi To summarize If 5Yi i 5 1 2 p n6 is a random sample from a population with mean m and variance s2 then Y has the same mean as the population but its sampling variance equals the population variance s2 divided by the sample size Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 679 An important implication of Var1Y2 5 s2n is that it can be made very close to zero by increasing the sample size n This is a key feature of a reasonable estimator and we return to it in Section C3 As suggested by Figure C2 among unbiased estimators we prefer the estimator with the small est variance This allows us to eliminate certain estimators from consideration For a random sample from a population with mean m and variance s2 we know that Y is unbiased and Var1Y2 5 s2n What about the estimator Y1 which is just the first observation drawn Because Y1 is a random draw from the population Var1Y12 5 s2 Thus the difference between Var1Y12 and Var1Y2 can be large even for small sample sizes If n 5 10 then Var1Y12 is 10 times as large as Var1Y2 5 s210 This gives us a formal way of excluding Y1 as an estimator of m To emphasize this point Table C1 contains the outcome of a small simulation study Using the statistical package Stata 20 random samples of size 10 were generated from a normal distribution with m 5 2 and s2 5 1 we are interested in estimating m here For each of the 20 random samples we compute two estimates y1 and y these values are listed in Table C1 As can be seen from the table the values for y1 are much more spread out than those for y y1 ranges from 2064 to 427 while y ranges only from 116 to 258 Further in 16 out of 20 cases y is closer than y1 to m 5 2 The aver age of y1 across the simulations is about 189 while that for y is 196 The fact that these averages are close to 2 illustrates the unbiasedness of both estimators and we could get these averages closer to 2 by doing more than 20 replications But comparing just the average outcomes across random draws masks the fact that the sample average Y is far superior to Y1 as an estimator of m C2e Efficiency Comparing the variances of Y and Y1 in the previous subsection is an example of a general approach to comparing different unbiased estimators Relative Efficiency If W1 and W2 are two unbiased estimators of u W1 is efficient relative to W2 when Var1W12 Var1W22 for all u with strict inequality for at least one value of u w u fw pdf of W1 pdf of W2 Figure C2 The sampling distributions of two unbiased estimators of u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 680 TAble C1 Simulation of Estimators for a Normal1m 12 Distribution with m 5 2 Replication y1 y 1 2064 198 2 106 143 3 427 165 4 103 188 5 316 234 6 277 258 7 168 158 8 298 223 9 225 196 10 204 211 11 095 215 12 136 193 13 262 202 14 297 210 15 193 218 16 114 210 17 208 194 18 152 221 19 133 116 20 121 175 Earlier we showed that for estimating the population mean m Var1Y2 Var1Y12 for any value of s2 whenever n 1 Thus Y is efficient relative to Y1 for estimating m We cannot always choose between unbiased estimators based on the smallest variance criterion given two unbiased estimators of u one can have smaller variance from some values of u while the other can have smaller variance for other values of u If we restrict our attention to a certain class of estimators we can show that the sample average has the smallest variance Problem C2 asks you to show that Y has the smallest variance among all unbiased estimators that are also linear functions of Y1 Y2 p Yn The assumptions are that the Yi have common mean and variance and that they are pairwise uncorrelated If we do not restrict our attention to unbiased estimators then comparing variances is meaning less For example when estimating the population mean m we can use a trivial estimator that is equal to zero regardless of the sample that we draw Naturally the variance of this estimator is zero since it is the same value for every random sample But the bias of this estimator is 2m so it is a very poor estimator when 0m0 is large One way to compare estimators that are not necessarily unbiased is to compute the mean squared error MSE of the estimators If W is an estimator of u then the MSE of W is defined as MSE1W2 5 E3 1W 2 u2 24 The MSE measures how far on average the estimator is away from u It can be shown that MSE1W2 5 Var1W2 1 3Bias1W2 42 so that MSEW depends on the variance and bias if any is present This allows us to compare two estimators when one or both are biased Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 681 C3 Asymptotic or Large Sample Properties of Estimators In Section C2 we encountered the estimator Y1 for the population mean m and we saw that even though it is unbiased it is a poor estimator because its variance can be much larger than that of the sample mean One notable feature of Y1 is that it has the same variance for any sample size It seems reasonable to require any estimation procedure to improve as the sample size increases For estimat ing a population mean m Y improves in the sense that its variance gets smaller as n gets larger Y1 does not improve in this sense We can rule out certain silly estimators by studying the asymptotic or large sample properties of estimators In addition we can say something positive about estimators that are not unbiased and whose variances are not easily found Asymptotic analysis involves approximating the features of the sampling distribution of an es timator These approximations depend on the size of the sample Unfortunately we are necessarily limited in what we can say about how large a sample size is needed for asymptotic analysis to be appropriate this depends on the underlying population distribution But large sample approximations have been known to work well for sample sizes as small as n 5 20 C3a Consistency The first asymptotic property of estimators concerns how far the estimator is likely to be from the parameter it is supposed to be estimating as we let the sample size increase indefinitely Consistency Let Wn be an estimator of u based on a sample Y1 Y2 p Yn of size n Then Wn is a consistent estimator of u if for every e 0 P1 0Wn 2 u0 e2 S 0 as n S C7 If Wn is not consistent for u then we say it is inconsistent When Wn is consistent we also say that u is the probability limit of Wn written as plim1Wn2 5 u Unlike unbiasednesswhich is a feature of an estimator for a given sample sizeconsistency involves the behavior of the sampling distribution of the estimator as the sample size n gets large To emphasize this we have indexed the estimator by the sample size in stating this definition and we will continue with this convention throughout this section Equation C7 looks technical and it can be rather difficult to establish based on fundamental probability principles By contrast interpreting C7 is straightforward It means that the distribution of Wn becomes more and more concentrated about u which roughly means that for larger sample sizes Wn is less and less likely to be very far from u This tendency is illustrated in Figure C3 If an estimator is not consistent then it does not help us to learn about u even with an unlimited amount of data For this reason consistency is a minimal requirement of an estimator used in statis tics or econometrics We will encounter estimators that are consistent under certain assumptions and inconsistent when those assumptions fail When estimators are inconsistent we can usually find their probability limits and it will be important to know how far these probability limits are from u As we noted earlier unbiased estimators are not necessarily consistent but those whose vari ances shrink to zero as the sample size grows are consistent This can be stated formally If Wn is an unbiased estimator of u and Var1Wn2 S 0 as n S then plim1Wn2 5 u Unbiased estimators that use the entire data sample will usually have a variance that shrinks to zero as the sample size grows thereby being consistent A good example of a consistent estimator is the average of a random sample drawn from a popu lation with mean m and variance s2 We have already shown that the sample average is unbiased for m Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 682 In Equation C6 we derived Var1Yn2 5 s2n for any sample size n Therefore Var1Yn2 S 0 as n S so Yn is a consistent estimator of m in addition to being unbiased The conclusion that Yn is consistent for m holds even if Var1Yn2 does not exist This classic result is known as the law of large numbers LLN Law of Large Numbers Let Y1 Y2 p Yn be independent identically distributed random variables with mean m Then plim1Yn2 5 m C8 The law of large numbers means that if we are interested in estimating the population average m we can get arbitrarily close to m by choosing a sufficiently large sample This fundamental result can be combined with basic properties of plims to show that fairly complicated estimators are consistent Property PLim1 Let u be a parameter and define a new parameter g 5 g1u2 for some continuous function g1u2 Suppose that plim1Wn2 5 u Define an estimator of g by Gn 5 g1Wn2 Then plim1Gn2 5 g C9 This is often stated as plim g1Wn2 5 g1plim Wn2 C10 for a continuous function g1u2 The assumption that g1u2 is continuous is a technical requirement that has often been described nontechnically as a function that can be graphed without lifting your pencil from the paper Because all the functions we encounter in this text are continuous we do not provide a formal definition of a continuous function Examples of continuous functions are g1u2 5 a 1 bu for constants a and b g1u2 5 u2 g1u2 5 1u g1u2 5 u g1u2 5 exp1u2 and many variants on these We will not need to mention the continuity assumption again fWnw u n 40 n 16 n 4 w Figure C3 The sampling distributions of a consistent estimator for three sample sizes Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 683 As an important example of a consistent but biased estimator consider estimating the standard deviation s from a population with mean m and variance s2 We already claimed that the sample variance S2 n 5 1n 2 12 21g n i511Yi 2 Yn2 2 is unbiased for s2 Using the law of large numbers and some algebra S2 n can also be shown to be consistent for s2 The natural estimator of s 5 s2 is Sn 5 S2 n where the square root is always the positive square root Sn which is called the sample standard deviation is not an unbiased estimator because the expected value of the square root is not the square root of the expected value see Section B3 Nevertheless by PLIM1 plim Sn 5 plim S2 n 5 s2 5 s so Sn is a consistent estimator of s Here are some other useful properties of the probability limit Property PLim2 If plim1Tn2 5 a and plim1Un2 5 b then i plim1Tn 1 Un2 5 a 1 b ii plim1TnUn2 5 ab iii plim1TnUn2 5 ab provided b 2 0 These three facts about probability limits allow us to combine consistent estimators in a variety of ways to get other consistent estimators For example let 5Y1 p Yn6 be a random sample of size n on annual earnings from the population of workers with a high school education and denote the population mean by mY Let 5Z1 p Zn6 be a random sample on annual earnings from the population of workers with a college education and denote the population mean by mZ We wish to estimate the percentage difference in annual earnings between the two groups which is g 5 100 1mZ 2 mY2mY This is the percentage by which average earnings for college graduates differs from average earnings for high school graduates Because Yn is consistent for mY and Zn is consistent for mZ it follows from PLIM1 and part iii of PLIM2 that Gn 100 1Zn 2 Yn2Yn is a consistent estimator of g Gn is just the percentage difference between Zn and Yn in the sample so it is a natural estimator Gn is not an unbiased estimator of g but it is still a good estimator except possibly when n is small C3b Asymptotic Normality Consistency is a property of point estimators Although it does tell us that the distribution of the esti mator is collapsing around the parameter as the sample size gets large it tells us essentially nothing about the shape of that distribution for a given sample size For constructing interval estimators and testing hypotheses we need a way to approximate the distribution of our estimators Most econo metric estimators have distributions that are well approximated by a normal distribution for large samples which motivates the following definition Asymptotic Normality Let 5Zn n 5 1 2 p 6 be a sequence of random variables such that for all numbers z P1Zn z2 S F1z2 as n S C11 where F1z2 is the standard normal cumulative distribution function Then Zn is said to have an as ymptotic standard normal distribution In this case we often write Zn a Normal10 12 The a above the tilde stands for asymptotically or approximately Property C11 means that the cumulative distribution function for Zn gets closer and closer to the cdf of the standard normal distribution as the sample size n gets large When asymptotic normality holds for large n we have the approximation P1Zn z2 F1z2 Thus probabilities concerning Zn can be approximated by standard normal probabilities Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 684 The central limit theorem CLT is one of the most powerful results in probability and statis tics It states that the average from a random sample for any population with finite variance when standardized has an asymptotic standard normal distribution Central Limit Theorem Let 5Y1 Y2 p Yn6 be a random sample with mean m and variance s2 Then Zn 5 Yn 2 m sn C12 has an asymptotic standard normal distribution The variable Zn in C12 is the standardized version of Yn we have subtracted off E1Yn2 5 m and divided by sd1Yn2 5 sn Thus regardless of the population distribution of Y Zn has mean zero and variance one which coincides with the mean and variance of the standard normal distribution Remarkably the entire distribution of Zn gets arbitrarily close to the standard normal distribution as n gets large We can write the standardized variable in equation C12 as n1Yn 2 m2s which shows that we must multiply the difference between the sample mean and the population mean by the square root of the sample size in order to obtain a useful limiting distribution Without the multiplication by n we would just have 1Yn 2 m2s which converges in probability to zero In other words the distribu tion of 1Yn 2 m2s simply collapses to a single point as n S which we know cannot be a good approximation to the distribution of 1Yn 2 m2s for reasonable sample sizes Multiplying by n ensures that the variance of Zn remains constant Practically we often treat Yn as being approximately normally distributed with mean m and variance s2n and this gives us the correct statistical proce dures because it leads to the standardized variable in equation C12 Most estimators encountered in statistics and econometrics can be written as functions of sample averages in which case we can apply the law of large numbers and the central limit theorem When two consistent estimators have asymptotic normal distributions we choose the estimator with the smallest asymptotic variance In addition to the standardized sample average in C12 many other statistics that depend on sample averages turn out to be asymptotically normal An important one is obtained by replacing s with its consistent estimator Sn in equation C12 Yn 2 m Snn C13 also has an approximate standard normal distribution for large n The exact finite sample distribu tions of C12 and C13 are definitely not the same but the difference is often small enough to be ignored for large n Throughout this section each estimator has been subscripted by n to emphasize the nature of as ymptotic or large sample analysis Continuing this convention clutters the notation without providing additional insight once the fundamentals of asymptotic analysis are understood Henceforth we drop the n subscript and rely on you to remember that estimators depend on the sample size and properties such as consistency and asymptotic normality refer to the growth of the sample size without bound C4 General Approaches to Parameter Estimation Until this point we have used the sample average to illustrate the finite and large sample properties of estimators It is natural to ask Are there general approaches to estimation that produce estimators with good properties such as unbiasedness consistency and efficiency The answer is yes A detailed treatment of various approaches to estimation is beyond the scope of this text here we provide only an informal discussion A thorough discussion is given in Larsen and Marx 1986 Chapter 5 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 685 C4a Method of Moments Given a parameter u appearing in a population distribution there are usually many ways to obtain unbiased and consistent estimators of u Trying all different possibilities and comparing them on the basis of the criteria in Sections C2 and C3 is not practical Fortunately some methods have been shown to have good general properties and for the most part the logic behind them is intuitively appealing In the previous sections we have studied the sample average as an unbiased estimator of the popu lation average and the sample variance as an unbiased estimator of the population variance These estimators are examples of method of moments estimators Generally method of moments estimation proceeds as follows The parameter u is shown to be related to some expected value in the distribution of Y usually EY or E1Y22 although more exotic choices are sometimes used Suppose for example that the parameter of interest u is related to the population mean as u 5 g1m2 for some function g Because the sample average Y is an unbiased and consistent estimator of m it is natural to replace m with Y which gives us the estimator g1Y2 of u The estimator g1Y2 is consistent for u and if g1m2 is a linear function of m then g1Y2 is unbiased as well What we have done is replace the population mo ment m with its sample counterpart Y This is where the name method of moments comes from We cover two additional method of moments estimators that will be useful for our discus sion of regression analysis Recall that the covariance between two random variables X and Y is defined as sXY 5 E3 1X 2 mX2 1Y 2 mY2 4 The method of moments suggests estimating sXY by n21g n i511Xi 2 X2 1Yi 2 Y2 This is a consistent estimator of sXY but it turns out to be biased for es sentially the same reason that the sample variance is biased if n rather than n 2 1 is used as the divi sor The sample covariance is defined as SXY 5 1 n 2 1 a n i51 1Xi 2 X2 1Yi 2 Y2 C14 It can be shown that this is an unbiased estimator of sXY Replacing n with n 2 1 makes no difference as the sample size grows indefinitely so this estimator is still consistent As we discussed in Section B4 the covariance between two variables is often difficult to in terpret Usually we are more interested in correlation Because the population correlation is rXY 5 sXY1sXsY2 the method of moments suggests estimating rXY as RXY 5 SXY SXSY 5 a n i51 1Xi 2 X2 1Yi 2 Y2 a a n i51 1Xi 2 X2 2b 12 a a n i51 1Yi 2 Y2 2b 12 C15 which is called the sample correlation coefficient or sample correlation for short Notice that we have canceled the division by n 2 1 in the sample covariance and the sample standard deviations In fact we could divide each of these by n and we would arrive at the same final formula It can be shown that the sample correlation coefficient is always in the interval 32114 as it should be Because SXY SX and SY are consistent for the corresponding population pa rameter RXY is a consistent estimator of the population correlation rXY However RXY is a biased estimator for two reasons First SX and SY are biased estimators of sX and sY respectively Second RXY is a ratio of estimators so it would not be unbiased even if SX and SY were For our purposes this is not important although the fact that no unbiased estimator of rXY exists is a classical result in mathematical statistics C4b Maximum Likelihood Another general approach to estimation is the method of maximum likelihood a topic covered in many introductory statistics courses A brief summary in the simplest case will suffice here Let 5Y1 Y2 p Yn6 be a random sample from the population distribution f 1y u2 Because of the random Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 686 sampling assumption the joint distribution of 5Y1 Y2 p Yn6 is simply the product of the densities f 1y1 u2f 1y2 u2 c f 1yn u2 In the discrete case this is P1Y1 5 y1 Y2 5 y2 p Yn 5 yn2 Now de fine the likelihood function as L1u Y1 p Yn2 5 f 1Y1 u2f 1Y2 u2 c f 1Yn u2 which is a random variable because it depends on the outcome of the random sample 5Y1 Y2 p Yn6 The maximum likelihood estimator of u call it W is the value of u that maximizes the likelihood function This is why we write L as a function of u followed by the random sample Clearly this value depends on the random sample The maximum likelihood principle says that out of all the pos sible values for u the value that makes the likelihood of the observed data largest should be chosen Intuitively this is a reasonable approach to estimating u Usually it is more convenient to work with the loglikelihood function which is obtained by tak ing the natural log of the likelihood function log3L1u Y1 p Yn2 4 5 a n i51 log3 f 1Yi u2 4 C16 where we use the fact that the log of the product is the sum of the logs Because C16 is the sum of independent identically distributed random variables analyzing estimators that come from C16 is relatively easy Maximum likelihood estimation MLE is usually consistent and sometimes unbiased But so are many other estimators The widespread appeal of MLE is that it is generally the most asymp totically efficient estimator when the population model f 1y u2 is correctly specified In addition the MLE is sometimes the minimum variance unbiased estimator that is it has the smallest variance among all unbiased estimators of u See Larsen and Marx 1986 Chapter 5 for verification of these claims In Chapter 17 we will need maximum likelihood to estimate the parameters of more advanced econometric models In econometrics we are almost always interested in the distribution of Y con ditional on a set of explanatory variables say X1 X2 p Xk Then we replace the density in C16 with f 1Yi0Xi1 p Xik u1 p up2 where this density is allowed to depend on p parameters u1 p up Fortunately for successful application of maximum likelihood methods we do not need to delve much into the computational issues or the largesample statistical theory Wooldridge 2010 Chapter 13 covers the theory of MLE C4c Least Squares A third kind of estimator and one that plays a major role throughout the text is called a least squares estimator We have already seen an example of least squares the sample mean Y is a least squares estimator of the population mean m We already know Y is a method of moments estimator What makes it a least squares estimator It can be shown that the value of m that makes the sum of squared deviations a n i51 1Yi 2 m2 2 as small as possible is m 5 Y Showing this is not difficult but we omit the algebra For some important distributions including the normal and the Bernoulli the sample average Y is also the maximum likelihood estimator of the population mean m Thus the principles of least squares method of moments and maximum likelihood often result in the same estimator In other cases the estimators are similar but not identical Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 687 C5 Interval Estimation and Confidence Intervals C5a The Nature of Interval Estimation A point estimate obtained from a particular sample does not by itself provide enough information for testing economic theories or for informing policy discussions A point estimate may be the re searchers best guess at the population value but by its nature it provides no information about how close the estimate is likely to be to the population parameter As an example suppose a researcher reports on the basis of a random sample of workers that job training grants increase hourly wage by 64 How are we to know whether or not this is close to the effect in the population of workers who could have been trained Because we do not know the population value we cannot know how close an estimate is for a particular sample However we can make statements involving probabilities and this is where interval estimation comes in We already know one way of assessing the uncertainty in an estimator find its sampling standard deviation Reporting the standard deviation of the estimator along with the point estimate provides some information on the accuracy of our estimate However even if the problem of the standard de viations dependence on unknown population parameters is ignored reporting the standard deviation along with the point estimate makes no direct statement about where the population value is likely to lie in relation to the estimate This limitation is overcome by constructing a confidence interval We illustrate the concept of a confidence interval with an example Suppose the population has a Normal1m 12 distribution and let 5Y1 p Yn6 be a random sample from this population We assume that the variance of the population is known and equal to unity for the sake of illustration we then show what to do in the more realistic case that the variance is unknown The sample average Y has a normal distribution with mean m and variance 1n Y Normal1m 1n2 From this we can standard ize Y and because the standardized version of Y has a standard normal distribution we have Pa2196 Y 2 m 1n 196b 5 95 The event in parentheses is identical to the event Y 2 196n m Y 1 196n so P1Y 2 196n m Y 1 196n2 5 95 C17 Equation C17 is interesting because it tells us that the probability that the random interval 3Y 2 196n Y 1 196n4 contains the population mean m is 95 or 95 This information al lows us to construct an interval estimate of m which is obtained by plugging in the sample outcome of the average y Thus 3y 2 196n y 1 196n4 C18 is an example of an interval estimate of m It is also called a 95 confidence interval A shorthand notation for this interval is y 6 196n The confidence interval in equation C18 is easy to compute once the sample data 5y1 y2 p yn6 are observed y is the only factor that depends on the data For example suppose that n 5 16 and the average of the 16 data points is 73 Then the 95 confidence interval for m is 73 6 19616 5 73 6 49 which we can write in interval form as 681779 By construction y 5 73 is in the center of this interval Unlike its computation the meaning of a confidence interval is more difficult to understand When we say that equation C18 is a 95 confidence interval for m we mean that the random interval 3Y 2 196n Y 1 196n4 C19 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 688 contains m with probability 95 In other words before the random sample is drawn there is a 95 chance that C19 contains m Equation C19 is an example of an interval estimator It is a random interval since the endpoints change with different samples A confidence interval is often interpreted as follows The probability that m is in the interval C18 is 95 This is incorrect Once the sample has been observed and y has been computed the limits of the confidence interval are simply numbers 681 and 779 in the example just given The population parameter m though unknown is also just some number Therefore m either is or is not in the interval C18 and we will never know with certainty which is the case Probability plays no role once the confidence interval is computed for the particular data at hand The probabilistic inter pretation comes from the fact that for 95 of all random samples the constructed confidence interval will contain m To emphasize the meaning of a confidence interval Table C2 contains calculations for 20 ran dom samples or replications from the Normal21 distribution with sample size n 5 10 For each of the 20 samples y is obtained and C18 is computed as y 6 19610 5 y 6 62 each rounded to two decimals As you can see the interval changes with each random sample Nineteen of the twenty intervals contain the population value of m Only for replication number 19 is m not in the confidence interval In other words 95 of the samples result in a confidence interval that contains m This did not have to be the case with only 20 replications but it worked out that way for this particular simulation TAble C2 Simulated Confidence Intervals from a Normal1m 12 Distribution with m 5 2 Replication y 95 Interval Contains m 1 198 136260 Yes 2 143 081205 Yes 3 165 103227 Yes 4 188 126250 Yes 5 234 172296 Yes 6 258 196320 Yes 7 158 96220 Yes 8 223 161285 Yes 9 196 134258 Yes 10 211 149273 Yes 11 215 153277 Yes 12 193 131255 Yes 13 202 140264 Yes 14 210 148272 Yes 15 218 156280 Yes 16 210 148272 Yes 17 194 132256 Yes 18 221 159283 Yes 19 116 54178 No 20 175 113237 Yes Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 689 C5b Confidence Intervals for the Mean from a Normally Distributed Population The confidence interval derived in equation C18 helps illustrate how to construct and interpret con fidence intervals In practice equation C18 is not very useful for the mean of a normal population because it assumes that the variance is known to be unity It is easy to extend C18 to the case where the standard deviation s is known to be any value the 95 confidence interval is 3y 2 196sn y 1 196sn4 C20 Therefore provided s is known a confidence interval for m is readily constructed To allow for unknown s we must use an estimate Let s 5 a 1 n 2 1 a n i51 1yi 2 y2 2b 12 C21 denote the sample standard deviation Then we obtain a confidence interval that depends entirely on the observed data by replacing s in equation C20 with its estimate s Unfortunately this does not preserve the 95 level of confidence because s depends on the particular sample In other words the random interval 3Y 6 1961Sn2 4 no longer contains m with probability 95 because the constant s has been replaced with the random variable S How should we proceed Rather than using the standard normal distribution we must rely on the t distribution The t distribution arises from the fact that Y 2 m Sn tn21 C22 where Y is the sample average and S is the sample standard deviation of the random sample 5Y1 p Yn6 We will not prove C22 a careful proof can be found in a variety of places for example Larsen and Marx 1986 Chapter 7 To construct a 95 confidence interval let c denote the 975th percentile in the tn21 distri bution In other words c is the value such that 95 of the area in the tn21 is between 2c and c P12c tn21 c2 5 95 The value of c depends on the degrees of freedom n 2 1 but we do not 0 2c area 025 area 025 c area 95 Figure C4 The 975th percentile c in a t distribution Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 690 make this explicit The choice of c is illustrated in Figure C4 Once c has been properly chosen the random interval 3Y 2 c Sn Y 1 c Sn4 contains m with probability 95 For a particular sample the 95 confidence interval is calculated as 3y 2 c sn y 1 c sn4 C23 The values of c for various degrees of freedom can be obtained from Table G2 in Appendix G For example if n 5 20 so that the df is n 2 1 5 19 then c 5 2093 Thus the 95 confidence interval is 3y 6 20931s202 4 where y and s are the values obtained from the sample Even if s 5 s which is very unlikely the confidence interval in C23 is wider than that in C20 because c 196 For small degrees of freedom C23 is much wider More generally let ca denote the 10011 2 a2 percentile in the tn21 distribution Then a 10011 2 a2 confidence interval is obtained as 3y 2 Ca2Sn y 1 Ca2Sn4 C24 Obtaining ca2 requires choosing a and knowing the degrees of freedom n 2 1 then Table G2 can be used For the most part we will concentrate on 95 confidence intervals There is a simple way to remember how to construct a confidence interval for the mean of a nor mal distribution Recall that sd1Y2 5 sn Thus sn is the point estimate of sd1Y2 The associ ated random variable Sn is sometimes called the standard error of Y Because what shows up in formulas is the point estimate sn we define the standard error of y as se1y2 5 sn Then C24 can be written in shorthand as 3y 6 ca2 se1y2 4 C25 This equation shows why the notion of the standard error of an estimate plays an important role in econometrics ExamplE C2 Effect of Job Training Grants on Worker productivity Holzer Block Cheatham and Knott 1993 studied the effects of job training grants on worker pro ductivity by collecting information on scrap rates for a sample of Michigan manufacturing firms receiving job training grants in 1988 Table C3 lists the scrap ratesmeasured as number of items per 100 produced that are not usable and therefore need to be scrappedfor 20 firms Each of these firms received a job training grant in 1988 there were no grants awarded in 1987 We are interested in constructing a confidence interval for the change in the scrap rate from 1987 to 1988 for the popula tion of all manufacturing firms that could have received grants We assume that the change in scrap rates has a normal distribution Since n 5 20 a 95 confi dence interval for the mean change in scrap rates m is 3y 6 2093 se1y2 4 where se1y2 5 sn The value 2093 is the 975th percentile in a t19 distribution For the particular sample values y 5 2115 and se1y2 5 54 each rounded to two decimals so the 95 confidence interval is 32228 2024 The value zero is excluded from this interval so we conclude that with 95 confidence the average change in scrap rates in the population is not zero Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 691 At this point Example C2 is mostly illustrative because it has some potentially serious flaws as an econometric analysis Most importantly it assumes that any systematic reduction in scrap rates is due to the job training grants But many things can happen over the course of the year to change worker productivity From this analysis we have no way of knowing whether the fall in average scrap rates is attributable to the job training grants or if at least partly some external force is responsible C5c A Simple Rule of Thumb for a 95 Confidence Interval The confidence interval in C25 can be computed for any sample size and any confidence level As we saw in Section B5 the t distribution approaches the standard normal distribution as the degrees of freedom gets large In particular for a 5 05 ca2 S 196 as n S although c2 is always greater than 196 for each n A rule of thumb for an approximate 95 confidence interval is 3y 6 2 se1y2 4 C26 In other words we obtain y and its standard error and then compute y plus or minus twice its standard error to obtain the confidence interval This is slightly too wide for very large n and it is too narrow for small n As we can see from Example C2 even for n as small as 20 C26 is in the ball park for a 95 confidence interval for the mean from a normal distribution This means we can get pretty close to a 95 confidence interval without having to refer to t tables TAble C3 Scrap Rates for 20 Michigan Manufacturing Firms Firm 1987 1988 Change 1 10 3 27 2 1 1 0 3 6 5 21 4 45 5 05 5 125 154 29 6 13 15 2 7 106 8 226 8 3 2 21 9 818 67 2751 10 167 117 25 11 98 51 247 12 1 5 25 13 45 61 16 14 503 67 167 15 8 4 24 16 9 7 22 17 18 19 1 18 28 2 208 19 7 5 22 20 397 383 214 Average 438 323 2115 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 692 C5d Asymptotic Confidence Intervals for Nonnormal Populations In some applications the population is clearly nonnormal A leading case is the Bernoulli distribution where the random variable takes on only the values zero and one In other cases the nonnormal popu lation has no standard distribution This does not matter provided the sample size is sufficiently large for the central limit theorem to give a good approximation for the distribution of the sample average Y For large n an approximate 95 confidence interval is 3y 6 196 se1y2 4 C27 where the value 196 is the 975th percentile in the standard normal distribution Mechanically com puting an approximate confidence interval does not differ from the normal case A slight difference is that the number multiplying the standard error comes from the standard normal distribution rather than the t distribution because we are using asymptotics Because the t distribution approaches the standard normal as the df increases equation C25 is also perfectly legitimate as an approximate 95 interval some prefer this to C27 because the former is exact for normal populations ExamplE C3 Race Discrimination in Hiring The Urban Institute conducted a study in 1988 in Washington DC to examine the extent of race discrimination in hiring Five pairs of people interviewed for several jobs In each pair one person was black and the other person was white They were given résumés indicating that they were virtu ally the same in terms of experience education and other factors that determine job qualification The idea was to make individuals as similar as possible with the exception of race Each person in a pair interviewed for the same job and the researchers recorded which applicant received a job offer This is an example of a matched pairs analysis where each trial consists of data on two people or two firms two cities and so on that are thought to be similar in many respects but different in one important characteristic Let uB denote the probability that the black person is offered a job and let uW be the probability that the white person is offered a job We are primarily interested in the difference uB 2 uW Let Bi denote a Bernoulli variable equal to one if the black person gets a job offer from employer i and zero otherwise Similarly Wi 5 1 if the white person gets a job offer from employer i and zero otherwise Pooling across the five pairs of people there were a total of n 5 241 trials pairs of interviews with employers Unbiased estimators of uB and uW are B and W the fractions of interviews for which blacks and whites were offered jobs respectively To put this into the framework of computing a confidence interval for a population mean define a new variable Yi 5 Bi 2 Wi Now Yi can take on three values 21 if the black person did not get the job but the white person did 0 if both people either did or did not get the job and 1 if the black person got the job and the white person did not Then m E1Yi2 5 E1Bi2 2 E1Wi2 5 uB 2 uW The distribution of Yi is certainly not normalit is discrete and takes on only three values Nevertheless an approximate confidence interval for uB 2 uW can be obtained by using large sample methods The data from the Urban Institute audit study are in the file AUDIT Using the 241 observed data points b 5 224 and w 5 357 so y 5 224 2 357 5 2133 Thus 224 of black applicants were offered jobs while 357 of white applicants were offered jobs This is prima facie evidence of discrimination against blacks but we can learn much more by computing a confidence interval for m To compute an approximate 95 confidence interval we need the sample standard deviation This turns out to be s 5 482 using equation C21 Using C27 we obtain a 95 CI for m 5 uB 2 uW as 2133 6 19614822412 5 2133 6 031 5 32164 21024 The approximate 99 CI is 2133 6 25814822412 5 32213 20534 Naturally this contains a wider range of values than the 95 CI But even the 99 CI does not contain the value zero Thus we are very confident that the population difference uB 2 uW is not zero Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 693 Before we turn to hypothesis testing it is useful to review the various population and sample quantities that measure the spreads in the population distributions and the sampling distributions of the estimators These quantities appear often in statistical analysis and extensions of them are impor tant for the regression analysis in the main text The quantity s is the unknown population standard deviation it is a measure of the spread in the distribution of Y When we divide s by n we obtain the sampling standard deviation of Y the sample average While s is a fixed feature of the popula tion sd1Y2 5 sn shrinks to zero as n S our estimator of m gets more and more precise as the sample size grows The estimate of s for a particular sample s is called the sample standard deviation because it is obtained from the sample We also call the underlying random variable S which changes across different samples the sample standard deviation Like y as an estimate of m s is our best guess at s given the sample at hand The quantity sn is what we call the standard error of y and it is our best estimate of sn Confidence intervals for the population parameter m depend directly on se1y2 5 sn Because this standard error shrinks to zero as the sample size grows a larger sample size generally means a smaller confidence interval Thus we see clearly that one benefit of more data is that they result in narrower confidence intervals The notion of the standard error of an estimate which in the vast majority of cases shrinks to zero at the rate 1n plays a fundamental role in hy pothesis testing as we will see in the next section and for confidence intervals and testing in the context of multiple regression as discussed in Chapter 4 C6 Hypothesis Testing So far we have reviewed how to evaluate point estimators and we have seenin the case of a popu lation meanhow to construct and interpret confidence intervals But sometimes the question we are interested in has a definite yes or no answer Here are some examples 1 Does a job training program effectively increase average worker productivity see Example C2 2 Are blacks discriminated against in hiring see Example C3 3 Do stiffer state drunk driving laws reduce the number of drunk driving arrests Devising methods for answering such questions using a sample of data is known as hypothesis testing C6a Fundamentals of Hypothesis Testing To illustrate the issues involved with hypothesis testing consider an election example Suppose there are two candidates in an election Candidates A and B Candidate A is reported to have received 42 of the popular vote while Candidate B received 58 These are supposed to represent the true per centages in the voting population and we treat them as such Candidate A is convinced that more people must have voted for him so he would like to in vestigate whether the election was rigged Knowing something about statistics Candidate A hires a consulting agency to randomly sample 100 voters to record whether or not each person voted for him Suppose that for the sample collected 53 people voted for Candidate A This sample estimate of 53 clearly exceeds the reported population value of 42 Should Candidate A conclude that the election was indeed a fraud While it appears that the votes for Candidate A were undercounted we cannot be certain Even if only 42 of the population voted for Candidate A it is possible that in a sample of 100 we observe 53 people who did vote for Candidate A The question is How strong is the sample evidence against the officially reported percentage of 42 One way to proceed is to set up a hypothesis test Let u denote the true proportion of the popula tion voting for Candidate A The hypothesis that the reported results are accurate can be stated as H0 u 5 42 C28 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 694 This is an example of a null hypothesis We always denote the null hypothesis by H0 In hypothesis testing the null hypothesis plays a role similar to that of a defendant on trial in many judicial systems just as a defendant is presumed to be innocent until proven guilty the null hypothesis is presumed to be true until the data strongly suggest otherwise In the current example Candidate A must present fairly strong evidence against C28 in order to win a recount The alternative hypothesis in the election example is that the true proportion voting for Candi date A in the election is greater than 42 H1 u 42 C29 In order to conclude that H0 is false and that H1 is true we must have evidence beyond reason able doubt against H0 How many votes out of 100 would be needed before we feel the evidence is strongly against H0 Most would agree that observing 43 votes out of a sample of 100 is not enough to overturn the original election results such an outcome is well within the expected sampling varia tion On the other hand we do not need to observe 100 votes for Candidate A to cast doubt on H0 Whether 53 out of 100 is enough to reject H0 is much less clear The answer depends on how we quantify beyond reasonable doubt Before we turn to the issue of quantifying uncertainty in hypothesis testing we should head off some possible confusion You may have noticed that the hypotheses in equations C28 and C29 do not exhaust all possibilities it could be that u is less than 42 For the application at hand we are not particularly interested in that possibility it has nothing to do with overturning the results of the election Therefore we can just state at the outset that we are ignoring alternatives u with u 42 Nevertheless some authors prefer to state null and alternative hypotheses so that they are exhaustive in which case our null hypothesis should be H0 u 42 Stated in this way the null hypothesis is a composite null hypothesis because it allows for more than one value under H0 By contrast equa tion C28 is an example of a simple null hypothesis For these kinds of examples it does not mat ter whether we state the null as in C28 or as a composite null the most difficult value to reject if u 42 is u 5 42 That is if we reject the value u 5 42 against u 42 then logically we must reject any value less than 42 Therefore our testing procedure based on C28 leads to the same test as if H0 u 42 In this text we always state a null hypothesis as a simple null hypothesis In hypothesis testing we can make two kinds of mistakes First we can reject the null hypothesis when it is in fact true This is called a Type I error In the election example a Type I error occurs if we reject H0 when the true proportion of people voting for Candidate A is in fact 42 The second kind of error is failing to reject H0 when it is actually false This is called a Type II error In the election example a Type II error occurs if u 42 but we fail to reject H0 After we have made the decision of whether or not to reject the null hypothesis we have either decided correctly or we have committed an error We will never know with certainty whether an error was committed However we can compute the probability of making either a Type I or a Type II error Hypothesis testing rules are constructed to make the probability of committing a Type I error fairly small Generally we define the significance level or simply the level of a test as the probability of a Type I error it is typically denoted by a Symbolically we have a 5 P1Reject H0 0 H02 C30 The righthand side is read as The probability of rejecting H0 given that H0 is true Classical hypothesis testing requires that we initially specify a significance level for a test When we specify a value for a we are essentially quantifying our tolerance for a Type I error Common val ues for a are 10 05 and 01 If a 5 05 then the researcher is willing to falsely reject H0 5 of the time in order to detect deviations from H0 Once we have chosen the significance level we would then like to minimize the probability of a Type II error Alternatively we would like to maximize the power of a test against all relevant alter natives The power of a test is just one minus the probability of a Type II error Mathematically p1u2 5 P1Reject H0 0 u2 5 1 2 P1Type II 0 u2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 695 where u denotes the actual value of the parameter Naturally we would like the power to equal unity whenever the null hypothesis is false But this is impossible to achieve while keeping the significance level small Instead we choose our tests to maximize the power for a given significance level C6b Testing Hypotheses about the Mean in a Normal Population In order to test a null hypothesis against an alternative we need to choose a test statistic or statistic for short and a critical value The choices for the statistic and critical value are based on convenience and on the desire to maximize power given a significance level for the test In this subsection we re view how to test hypotheses for the mean of a normal population A test statistic denoted T is some function of the random sample When we compute the sta tistic for a particular outcome we obtain an outcome of the test statistic which we will denote by t Given a test statistic we can define a rejection rule that determines when H0 is rejected in fa vor of H1 In this text all rejection rules are based on comparing the value of a test statistic t to a critical value c The values of t that result in rejection of the null hypothesis are collectively known as the rejection region To determine the critical value we must first decide on a significance level of the test Then given a the critical value associated with a is determined by the distribution of T assuming that H0 is true We will write this critical value as c suppressing the fact that it depends on a Testing hypotheses about the mean m from a Normal1m s22 population is straightforward The null hypothesis is stated as H0 m 5 m0 C31 where m0 is a value that we specify In the majority of applications m0 5 0 but the general case is no more difficult The rejection rule we choose depends on the nature of the alternative hypothesis The three alter natives of interest are H1 m m0 C32 H1 m m0 C33 and H1 m 2 m0 C34 Equation C32 gives a onesided alternative as does C33 When the alternative hypothesis is C32 the null is effectively H0 m m0 since we reject H0 only when m m0 This is appropriate when we are interested in the value of m only when m is at least as large as m0 Equation C34 is a twosided alternative This is appropriate when we are interested in any departure from the null hypothesis Consider first the alternative in C32 Intuitively we should reject H0 in favor of H1 when the value of the sample average y is sufficiently greater than m0 But how should we determine when y is large enough for H0 to be rejected at the chosen significance level This requires knowing the prob ability of rejecting the null hypothesis when it is true Rather than working directly with y we use its standardized version where s is replaced with the sample standard deviation s t 5 n1y 2 m02s 5 1y 2 m02se1y2 C35 where se1y2 5 sn is the standard error of y Given the sample of data it is easy to obtain t We work with t because under the null hypothesis the random variable T 5 n1Y 2 m02S Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 696 has a tn21 distribution Now suppose we have settled on a 5 significance level Then the critical value c is chosen so that P1T c 0 H02 5 05 that is the probability of a Type I error is 5 Once we have found c the rejection rule is t c C36 where c is the 10011 2 a2 percentile in a tn21 distribution as a percent the significance level is 100 a This is an example of a onetailed test because the rejection region is in one tail of the t dis tribution For a 5 significance level c is the 95th percentile in the tn21 distribution this is illustrated in Figure C5 A different significance level leads to a different critical value The statistic in equation C35 is often called the t statistic for testing H0 m 5 m0 The t statistic measures the distance from y to m0 relative to the standard error of y se1y2 ExamplE C4 Effect of Enterprise Zones on Business Investments In the population of cities granted enterprise zones in a particular state see Papke 1994 for Indiana let Y denote the percentage change in investment from the year before to the year after a city became an enterprise zone Assume that Y has a Normal1m s22 distribution The null hypothesis that enter prise zones have no effect on business investment is H0 m 5 0 the alternative that they have a posi tive effect is H1 m 0 We assume that they do not have a negative effect Suppose that we wish to test H0 at the 5 level The test statistic in this case is t 5 y sn 5 y se1y2 C37 Suppose that we have a sample of 36 cities that are granted enterprise zones Then the critical value is c 5 169 see Table G2 and we reject H0 in favor of H1 if t 169 Suppose that the sample yields y 5 82 and s 5 239 Then t 206 and H0 is therefore rejected at the 5 level Thus we conclude 0 c rejection area 05 area 95 Figure C5 Rejection region for a 5 significance level test against the onesided alternative m m0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 697 that at the 5 significance level enterprise zones have an effect on average investment The 1 criti cal value is 244 so H0 is not rejected at the 1 level The same caveat holds here as in Example C2 we have not controlled for other factors that might affect investment in cities over time so we cannot claim that the effect is causal The rejection rule is similar for the onesided alternative C33 A test with a significance level of 100 a rejects H0 against C33 whenever t 2 c C38 in other words we are looking for negative values of the t statisticwhich implies y m0that are sufficiently far from zero to reject H0 For twosided alternatives we must be careful to choose the critical value so that the significance level of the test is still a If H1 is given by H1 m 2 m0 then we reject H0 if y is far from m0 in abso lute value a y much larger or much smaller than m0 provides evidence against H0 in favor of H1 A 100 a level test is obtained from the rejection rule 0t0 c C39 where 0t0 is the absolute value of the t statistic in C35 This gives a twotailed test We must now be careful in choosing the critical value c is the 10011 2 a22 percentile in the tn21 distribution For ex ample if a 5 05 then the critical value is the 975th percentile in the tn21 distribution This ensures that H0 is rejected only 5 of the time when it is true see Figure C6 For example if n 5 22 then the critical value is c 5 208 the 975th percentile in a t21 distribution see Table G2 The absolute value of the t statistic must exceed 208 in order to reject H0 against H1 at the 5 level It is important to know the proper language of hypothesis testing Sometimes the appropriate phrase we fail to reject H0 in favor of H1 at the 5 significance level is replaced with we accept H0 at the 5 significance level The latter wording is incorrect With the same set of data there are 0 c area 025 area 025 c area 95 rejection region rejection region Figure C6 Rejection region for a 5 significance level test against the twosided alternative H1 m 2 m0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 698 usually many hypotheses that cannot be rejected In the earlier election example it would be logically inconsistent to say that H0 u 5 42 and H0 u 5 43 are both accepted since only one of these can be true But it is entirely possible that neither of these hypotheses is rejected For this reason we al ways say fail to reject H0 rather than accept H0 C6c Asymptotic Tests for Nonnormal Populations If the sample size is large enough to invoke the central limit theorem see Section C3 the mechanics of hypothesis testing for population means are the same whether or not the population distribution is normal The theoretical justification comes from the fact that under the null hypothesis T 5 n1Y 2 m02S a Normal1012 Therefore with large n we can compare the t statistic in C35 with the critical values from a stan dard normal distribution Because the tn21 distribution converges to the standard normal distribution as n gets large the t and standard normal critical values will be very close for extremely large n Be cause asymptotic theory is based on n increasing without bound it cannot tell us whether the standard normal or t critical values are better For moderate values of n say between 30 and 60 it is traditional to use the t distribution because we know this is correct for normal populations For n 120 the choice between the t and standard normal distributions is largely irrelevant because the critical values are practically the same Because the critical values chosen using either the standard normal or t distribution are only approximately valid for nonnormal populations our chosen significance levels are also only approxi mate thus for nonnormal populations our significance levels are really asymptotic significance lev els Thus if we choose a 5 significance level but our population is nonnormal then the actual significance level will be larger or smaller than 5 and we cannot know which is the case When the sample size is large the actual significance level will be very close to 5 Practically speaking the distinction is not important so we will now drop the qualifier asymptotic ExamplE C5 Race Discrimination in Hiring In the Urban Institute study of discrimination in hiring see Example C3 using the data in AUDIT we are primarily interested in testing H0 m 5 0 against H1 m 0 where m 5 uB 2 uW is the differ ence in probabilities that blacks and whites receive job offers Recall that m is the population mean of the variable Y 5 B 2 W where B and W are binary indicators Using the n 5 241 paired compari sons in the data file AUDIT we obtained y 5 2133 and se1y2 5 482241 031 The t statis tic for testing H0 m 5 0 is t 5 2133031 2429 You will remember from Appendix B that the standard normal distribution is for practical purposes indistinguishable from the t distribution with 240 degrees of freedom The value 2429 is so far out in the left tail of the distribution that we reject H0 at any reasonable significance level In fact the 005 onehalf of a percent critical value for the onesided test is about 2258 A t value of 2429 is very strong evidence against H0 in favor of H1 Hence we conclude that there is discrimination in hiring C6d Computing and Using pValues The traditional requirement of choosing a significance level ahead of time means that different re searchers using the same data and same procedure to test the same hypothesis could wind up with different conclusions Reporting the significance level at which we are carrying out the test solves this problem to some degree but it does not completely remove the problem Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 699 To provide more information we can ask the following question What is the largest significance level at which we could carry out the test and still fail to reject the null hypothesis This value is known as the pvalue of a test sometimes called the probvalue Compared with choosing a signifi cance level ahead of time and obtaining a critical value computing a pvalue is somewhat more diffi cult But with the advent of quick and inexpensive computing pvalues are now fairly easy to obtain As an illustration consider the problem of testing H0 m 5 0 in a Normal1m s22 population Our test statistic in this case is T 5 n YS and we assume that n is large enough to treat T as hav ing a standard normal distribution under H0 Suppose that the observed value of T for our sample is t 5 152 Note how we have skipped the step of choosing a significance level Now that we have seen the value t we can find the largest significance level at which we would fail to reject H0 This is the significance level associated with using t as our critical value Because our test statistic T has a standard normal distribution under H0 we have pvalue 5 P1T 152 0 H02 5 1 2 F11522 5 065 C40 where F1 2 denotes the standard normal cdf In other words the pvalue in this example is simply the area to the right of 152 the observed value of the test statistic in a standard normal distribution See Figure C7 for illustration Because the pvalue 5 065 the largest significance level at which we can carry out this test and fail to reject is 65 If we carry out the test at a level below 65 such as at 5 we fail to reject H0 If we carry out the test at a level larger than 65 such as 10 we reject H0 With the pvalue at hand we can carry out the test at any level The pvalue in this example has another useful interpretation it is the probability that we observe a value of T as large as 152 when the null hypothesis is true If the null hypothesis is actually true we would observe a value of T as large as 152 due to chance only 65 of the time Whether this is small enough to reject H0 depends on our tolerance for a Type I error The pvalue has a similar interpreta tion in all other cases as we will see Generally small pvalues are evidence against H0 since they indicate that the outcome of the data occurs with small probability if H0 is true In the previous example if t had been a larger value say t 5 285 then the pvalue would be 1 2 F12852 002 This means that if the null hypoth esis were true we would observe a value of T as large as 285 with probability 002 How do we 0 152 area 065 pvalue Figure C7 The pvalue when t 5 152 for the onesided alternative m m0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 700 interpret this Either we obtained a very unusual sample or the null hypothesis is false Unless we have a very small tolerance for Type I error we would reject the null hypothesis On the other hand a large pvalue is weak evidence against H0 If we had gotten t 5 47 in the previous example then the pvalue 5 1 2 F1472 5 32 Observing a value of T larger than 47 happens with probability 32 even when H0 is true this is large enough so that there is insufficient doubt about H0 unless we have a very high tolerance for Type I error For hypothesis testing about a population mean using the t distribution we need detailed tables in order to compute pvalues Table G2 only allows us to put bounds on pvalues Fortunately many statistics and econometrics packages now compute pvalues routinely and they also provide calcula tion of cdfs for the t and other distributions used for computing pvalues ExamplE C6 Effect of Job Training Grants on Worker productivity Consider again the Holzer et al 1993 data in Example C2 From a policy perspective there are two questions of interest First what is our best estimate of the mean change in scrap rates m We have already obtained this for the sample of 20 firms listed in Table C3 the sample average of the change in scrap rates is 2115 Relative to the initial average scrap rate in 1987 this represents a fall in the scrap rate of about 263 12115438 22632 which is a nontrivial effect We would also like to know whether the sample provides strong evidence for an effect in the population of manufacturing firms that could have received grants The null hypothesis is H0 m 5 0 and we test this against H1 m 0 where m is the average change in scrap rates Under the null the job training grants have no effect on average scrap rates The alternative states that there is an effect We do not care about the alternative m 0 so the null hypothesis is effectively H0 m 0 Since y 5 2115 and se1y2 5 54 t 5 211554 5 2213 This is below the 5 critical value of 2173 from a t19 distribution but above the 1 critical value 2254 The pvalue in this case is computed as pvalue 5 P1T19 22132 C41 0 area pvalue 023 213 Figure C8 The pvalue when t 5 2213 with 19 degrees of freedom for the onesided alternative m 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 701 where T19 represents a t distributed random variable with 19 degrees of freedom The inequality is reversed from C40 because the alternative has the form in C33 The probability in C41 is the area to the left of 2213 in a t19 distribution see Figure C8 Using Table G2 the most we can say is that the pvalue is between 025 and 01 but it is closer to 025 since the 975th percentile is about 209 Using a statistical package such as Stata we can compute the exact pvalue It turns out to be about 023 which is reasonable evidence against H0 This is certainly enough evidence to reject the null hypothesis that the training grants had no effect at the 25 significance level and therefore at the 5 level Computing a pvalue for a twosided test is similar but we must account for the twosided nature of the rejection rule For t testing about population means the pvalue is computed as P1 0Tn210 0t0 2 5 2P1Tn21 0t0 2 C42 where t is the value of the test statistic and Tn21 is a t random variable For large n replace Tn21 with a standard normal random variable Thus compute the absolute value of the t statistic find the area to the right of this value in a tn21 distribution and multiply the area by two For nonnormal populations the exact pvalue can be difficult to obtain Nevertheless we can find asymptotic pvalues by using the same calculations These pvalues are valid for large sample sizes For n larger than say 120 we might as well use the standard normal distribution Table G1 is detailed enough to get accurate pvalues but we can also use a statistics or econometrics program ExamplE C7 Race Discrimination in Hiring Using the matched pairs data from the Urban Institute in the AUDIT data file n 5 241 we obtained t 5 2429 If Z is a standard normal random variable P1Z 24292 is for practical purposes zero In other words the asymptotic pvalue for this example is essentially zero This is very strong evi dence against H0 Summary of How to Use pValues i Choose a test statistic T and decide on the nature of the alternative This determines whether the rejection rule is t c t 2c or 0t0 c ii Use the observed value of the t statistic as the critical value and compute the correspond ing significance level of the test This is the pvalue If the rejection rule is of the form t c then pvalue 5 P1T t2 If the rejection rule is t 2c then pvalue 5 P1T t2 if the rejection rule is 0t0 c then pvalue 5 P1 0T0 0t0 2 iii If a significance level a has been chosen then we reject H0 at the 100 a level if pvalue a If pvalue a then we fail to reject H0 at the 100 a level Therefore it is a small pvalue that leads to rejection of the null hypothesis C6e The Relationship between Confidence Intervals and Hypothesis Testing Because constructing confidence intervals and hypothesis tests both involve probability statements it is natural to think that they are somehow linked It turns out that they are After a confidence interval has been constructed we can carry out a variety of hypothesis tests The confidence intervals we have discussed are all twosided by nature In this text we will have no need to construct onesided confidence intervals Thus confidence intervals can be used to Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 702 test against twosided alternatives In the case of a population mean the null is given by C31 and the alternative is C34 Suppose we have constructed a 95 confidence interval for m Then if the hypothesized value of m under H0 m0 is not in the confidence interval then H0 m 5 m0 is rejected against H1 m 2 m0 at the 5 level If m0 lies in this interval then we fail to reject H0 at the 5 level Notice how any value for m0 can be tested once a confidence interval is constructed and since a confi dence interval contains more than one value there are many null hypotheses that will not be rejected ExamplE C8 Training Grants and Worker productivity In the Holzer et al example we constructed a 95 confidence interval for the mean change in scrap rate m as 32228 2024 Since zero is excluded from this interval we reject H0 m 5 0 against H1 m 2 0 at the 5 level This 95 confidence interval also means that we fail to reject H0 m 5 22 at the 5 level In fact there is a continuum of null hypotheses that are not rejected given this confi dence interval C6f Practical versus Statistical Significance In the examples covered so far we have produced three kinds of evidence concerning population parameters point estimates confidence intervals and hypothesis tests These tools for learning about population parameters are equally important There is an understandable tendency for students to focus on confidence intervals and hypothesis tests because these are things to which we can attach confidence or significance levels But in any study we must also interpret the magnitudes of point estimates The sign and magnitude of y determine its practical significance and allow us to discuss the direction of an intervention or policy effect and whether the estimated effect is large or small On the other hand statistical significance of y depends on the magnitude of its t statistic For testing H0 m 5 0 the t statistic is simply t 5 yse1y2 In other words statistical significance depends on the ratio of y to its standard error Consequently a t statistic can be large because y is large or se1y2 is small In applications it is important to discuss both practical and statistical significance being aware that an estimate can be statistically significant without being especially large in a practical sense Whether an estimate is practically important depends on the context as well as on ones judgment so there are no set rules for determining practical significance ExamplE C9 Effect of Freeway Width on Commute Time Let Y denote the change in commute time measured in minutes for commuters in a metropolitan area from before a freeway was widened to after the freeway was widened Assume that Y Normal1ms22 The null hypothesis that the widening did not reduce average commute time is H0 m 5 0 the alterna tive that it reduced average commute time is H1 m 0 Suppose a random sample of commuters of size n 5 900 is obtained to determine the effectiveness of the freeway project The average change in commute time is computed to be y 5 236 and the sample standard deviation is s 5 327 thus se1y2 5 327900 5 109 The t statistic is t 5 236109 2330 which is very statistically sig nificant the pvalue is about 0005 Thus we conclude that the freeway widening had a statistically significant effect on average commute time If the outcome of the hypothesis test is all that were reported from the study it would be mis leading Reporting only statistical significance masks the fact that the estimated reduction in average commute time 36 minutes seems pretty meager although this depends to some extent on what the average commute time was prior to widening the freeway To be up front we should report the point estimate of 236 along with the significance test Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 703 Finding point estimates that are statistically significant without being practically significant can occur when we are working with large samples To discuss why this happens it is useful to have the following definition Test Consistency A consistent test rejects H0 with probability approaching one as the sam ple size grows whenever H1 is true Another way to say that a test is consistent is that as the sample size tends to infinity the power of the test gets closer and closer to unity whenever H1 is true All of the tests we cover in this text have this property In the case of testing hypotheses about a population mean test consistency follows because the variance of Y converges to zero as the sample size gets large The t statistic for testing H0 m 5 0 is T 5 Y1Sn2 Since plim1Y2 5 m and plim1S2 5 s it follows that if say m 0 then T gets larger and larger with high probability as n S In other words no matter how close m is to zero we can be almost certain to reject H0 m 5 0 given a large enough sample size This says nothing about whether m is large in a practical sense C7 Remarks on Notation In our review of probability and statistics here and in Appendix B we have been careful to use stan dard conventions to denote random variables estimators and test statistics For example we have used W to indicate an estimator random variable and w to denote a particular estimate outcome of the random variable W Distinguishing between an estimator and an estimate is important for un derstanding various concepts in estimation and hypothesis testing However making this distinction quickly becomes a burden in econometric analysis because the models are more complicated many random variables and parameters will be involved and being true to the usual conventions from prob ability and statistics requires many extra symbols In the main text we use a simpler convention that is widely used in econometrics If u is a popu lation parameter the notation u theta hat will be used to denote both an estimator and an estimate of u This notation is useful in that it provides a simple way of attaching an estimator to the popula tion parameter it is supposed to be estimating Thus if the population parameter is b then b denotes an estimator or estimate of b if the parameter is s2 s 2 is an estimator or estimate of s2 and so on Sometimes we will discuss two estimators of the same parameter in which case we will need a dif ferent notation such as u theta tilde Although dropping the conventions from probability and statistics to indicate estimators random variables and test statistics puts additional responsibility on you it is not a big deal once the differ ence between an estimator and an estimate is understood If we are discussing statistical properties of usuch as deriving whether or not it is unbiased or consistentthen we are necessarily viewing u as an estimator On the other hand if we write something like u 5 173 then we are clearly denoting a point estimate from a given sample of data The confusion that can arise by using u to denote both should be minimal once you have a good understanding of probability and statistics Summary We have discussed topics from mathematical statistics that are heavily relied upon in econometric analysis The notion of an estimator which is simply a rule for combining data to estimate a popula tion parameter is fundamental We have covered various properties of estimators The most important small sample properties are unbiasedness and efficiency the latter of which depends on comparing variances when estimators are unbiased Large sample properties concern the sequence of estimators Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 704 obtained as the sample size grows and they are also depended upon in econometrics Any useful esti mator is consistent The central limit theorem implies that in large samples the sampling distribution of most estimators is approximately normal The sampling distribution of an estimator can be used to construct confidence intervals We saw this for estimating the mean from a normal distribution and for computing approximate confidence intervals in nonnormal cases Classical hypothesis testing which requires specifying a null hypoth esis an alternative hypothesis and a significance level is carried out by comparing a test statistic to a critical value Alternatively a pvalue can be computed that allows us to carry out a test at any significance level Key Terms Alternative Hypothesis Asymptotic Normality Bias Biased Estimator Central Limit Theorem CLT Confidence Interval Consistent Estimator Consistent Test Critical Value Estimate Estimator Hypothesis Test Inconsistent Interval Estimator Law of Large Numbers LLN Least Squares Estimator Maximum Likelihood Estimator Mean Squared Error MSE Method of Moments Minimum Variance Unbiased Estimator Null Hypothesis OneSided Alternative OneTailed Test Population Power of a Test Practical Significance Probability Limit pValue Random Sample Rejection Region Sample Average Sample Correlation Coefficient Sample Covariance Sample Standard Deviation Sample Variance Sampling Distribution Sampling Standard Deviation Sampling Variance Significance Level Standard Error Statistical Significance t Statistic Test Statistic TwoSided Alternative TwoTailed Test Type I Error Type II Error Unbiased Estimator Problems 1 Let Y1 Y2 Y3 and Y4 be independent identically distributed random variables from a population with mean m and variance s2 Let Y 5 1 41Y1 1 Y2 1 Y3 1 Y42 denote the average of these four random variables i What are the expected value and variance of Y in terms of m and s2 ii Now consider a different estimator of m W 5 1 8Y1 1 1 8Y2 1 1 4Y3 1 1 2Y4 This is an example of a weighted average of the Yi Show that W is also an unbiased estimator of m Find the variance of W iii Based on your answers to parts i and ii which estimator of m do you prefer Y or W 2 This is a more general version of Problem C1 Let Y1 Y2 c Yn be n pairwise uncorrelated random variables with common mean m and common variance s2 Let Y denote the sample average i Define the class of linear estimators of m by Wa 5 a1Y1 1 a2Y2 1 c 1 anYn where the ai are constants What restriction on the ai is needed for Wa to be an unbiased estimator of m ii Find Var1Wa2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 705 iii For any numbers a1 a2 c an the following inequality holds a1 1 a2 1 p 1 an2 2n a2 1 1 a2 2 1 p 1 a2 n Use this along with parts i and ii to show that Var1Wa2 VarY whenever Wa is unbiased so that Y is the best linear unbiased estimator Hint What does the inequality become when the ai satisfy the restriction from part i 3 Let Y denote the sample average from a random sample with mean m and variance s2 Consider two alternative estimators of m W1 5 3 1n 2 12n4Y and W2 5 Y2 i Show that W1 and W2 are both biased estimators of m and find the biases What happens to the biases as n S Comment on any important differences in bias for the two estimators as the sample size gets large ii Find the probability limits of W1 and W2 Hint Use Properties PLIM1 and PLIM2 for W1 note that plim 3 1n 2 12n4 5 16 Which estimator is consistent iii Find Var1W12 and Var1W22 iv Argue that W1 is a better estimator than Y if m is close to zero Consider both bias and variance 4 For positive random variables X and Y suppose the expected value of Y given X is E1Y 0 X2 5 uX The unknown parameter u shows how the expected value of Y changes with X i Define the random variable Z 5 YX Show that E1Z2 5 u Hint Use Property CE2 along with the law of iterated expectations Property CE4 In particular first show that E1Z 0 X2 5 u and then use CE4 ii Use part i to prove that the estimator W1 5 n21g n i511YiXi2 is unbiased for u where 5 1Xi Yi2 i 5 1 2p n6 is a random sample iii Explain why the estimator W2 5 YX where the overbars denote sample averages is not the same as W1 Nevertheless show that W2 is also unbiased for u iv The following table contains data on corn yields for several counties in Iowa The USDA predicts the number of hectares of corn in each county based on satellite photos Researchers count the num ber of pixels of corn in the satellite picture as opposed to for example the number of pixels of soybeans or of uncultivated land and use these to predict the actual number of hectares To develop a prediction equation to be used for counties in general the USDA surveyed farmers in selected counties to obtain corn yields in hectares Let Yi 5 corn yield in county i and let Xi 5 number of corn pixels in the satellite picture for county i There are n 5 17 observations for eight counties Use this sample to compute the estimates of u devised in parts ii and iii Are the estimates similar Plot Corn Yield Corn Pixels 1 16576 374 2 9632 209 3 7608 253 4 18535 432 5 11643 367 6 16208 361 7 15204 288 8 16175 369 9 9288 206 10 14994 316 11 6475 145 12 12707 355 13 13355 295 14 7770 223 15 20639 459 16 10833 290 17 11817 307 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 706 5 Let Y denote a Bernoulliu random variable with 0 u 1 Suppose we are interested in estimat ing the odds ratio g 5 u11 2 u2 which is the probability of success over the probability of failure Given a random sample 5Y1 c Yn6 we know that an unbiased and consistent estimator of u is Y the proportion of successes in n trials A natural estimator of g is G 5 Y11 2 Y2 the proportion of suc cesses over the proportion of failures in the sample i Why is G not an unbiased estimator of g ii Use PLIM2 iii to show that G is a consistent estimator of g 6 You are hired by the governor to study whether a tax on liquor has decreased average liquor consump tion in your state You are able to obtain for a sample of individuals selected at random the difference in liquor consumption in ounces for the years before and after the tax For person i who is sampled randomly from the population Yi denotes the change in liquor consumption Treat these as a random sample from a Normal1m s22 distribution i The null hypothesis is that there was no change in average liquor consumption State this formally in terms of m ii The alternative is that there was a decline in liquor consumption state the alternative in terms of m iii Now suppose your sample size is n 5 900 and you obtain the estimates y 5 2328 and s 5 4664 Calculate the t statistic for testing H0 against H1 obtain the pvalue for the test Because of the large sample size just use the standard normal distribution tabulated in Table G1 Do you reject H0 at the 5 level At the 1 level iv Would you say that the estimated fall in consumption is large in magnitude Comment on the practical versus statistical significance of this estimate v What has been implicitly assumed in your analysis about other determinants of liquor consumption over the twoyear period in order to infer causality from the tax change to liquor consumption 7 The new management at a bakery claims that workers are now more productive than they were under old management which is why wages have generally increased Let Wb i be Worker is wage under the old management and let Wa i be Worker is wage after the change The difference is Di Wa i 2 Wb i Assume that the Di are a random sample from a Normal 1m s22 distribution i Using the following data on 15 workers construct an exact 95 confidence interval for m ii Formally state the null hypothesis that there has been no change in average wages In particular what is E1Di2 under H0 If you are hired to examine the validity of the new managements claim what is the relevant alternative hypothesis in terms of m 5 E1Di2 iii Test the null hypothesis from part ii against the stated alternative at the 5 and 1 levels iv Obtain the pvalue for the test in part iii Worker Wage Before Wage After 1 830 925 2 940 900 3 900 925 4 1050 1000 5 1140 1200 6 875 950 7 1000 1025 8 950 950 9 1080 1150 10 1255 1310 11 1200 1150 12 865 900 13 775 775 14 1125 1150 15 1265 1300 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 707 8 The New York Times 2590 reported threepoint shooting performance for the top 10 threepoint shoot ers in the NBA The following table summarizes these data Player FGAFGM Mark Price 429188 Trent Tucker 833345 Dale Ellis 1149472 Craig Hodges 1016396 Danny Ainge 1051406 Byron Scott 676260 Reggie Miller 416159 Larry Bird 1206455 Jon Sundvold 440166 Brian Taylor 417157 Note FGA 5 field goals attempted and FGM 5 field goals made For a given player the outcome of a particular shot can be modeled as a Bernoulli zeroone variable if Yi is the outcome of shot i then Yi 5 1 if the shot is made and Yi 5 0 if the shot is missed Let u denote the probability of making any particular threepoint shot attempt The natural estimator of u is Y 5 FGMFGA i Estimate u for Mark Price ii Find the standard deviation of the estimator Y in terms of u and the number of shot attempts n iii The asymptotic distribution of 1Y 2 u2se1Y2 is standard normal where se1Y2 5 Y11 2 Y2n Use this fact to test H0 u 5 5 against H1 u 5 for Mark Price Use a 1 significance level 9 Suppose that a military dictator in an unnamed country holds a plebiscite a yesno vote of confidence and claims that he was supported by 65 of the voters A human rights group suspects foul play and hires you to test the validity of the dictators claim You have a budget that allows you to randomly sample 200 voters from the country i Let X be the number of yes votes obtained from a random sample of 200 out of the entire voting population What is the expected value of X if in fact 65 of all voters supported the dictator ii What is the standard deviation of X again assuming that the true fraction voting yes in the plebiscite is 65 iii Now you collect your sample of 200 and you find that 115 people actually voted yes Use the CLT to approximate the probability that you would find 115 or fewer yes votes from a random sample of 200 if in fact 65 of the entire population voted yes iv How would you explain the relevance of the number in part iii to someone who does not have training in statistics 10 Before a strike prematurely ended the 1994 major league baseball season Tony Gwynn of the San Di ego Padres had 165 hits in 419 at bats for a 394 batting average There was discussion about whether Gwynn was a potential 400 hitter that year This issue can be couched in terms of Gwynns probabil ity of getting a hit on a particular at bat call it u Let Yi be the Bernoulliu indicator equal to unity if Gwynn gets a hit during his ith at bat and zero otherwise Then Y1 Y2 c Yn is a random sample from a Bernoulliu distribution where u is the probability of success and n 5 419 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 708 Our best point estimate of u is Gwynns batting average which is just the proportion of successes y 5 394 Using the fact that se1y2 5 y11 2 y2n construct an approximate 95 confidence in terval for u using the standard normal distribution Would you say there is strong evidence against Gwynns being a potential 400 hitter Explain 11 Suppose that between their first and second years in college 400 students are randomly selected and given a university grant to purchase a new computer For student i yi denotes the change in GPA from the first year to the second year If the average change is y 5 132 with standard deviation s 5 127 is the average change in GPAs statistically greater than zero Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 709 Summary of Matrix Algebra T his appendix summarizes the matrix algebra concepts including the algebra of probability needed for the study of multiple linear regression models using matrices in Appendix E None of this material is used in the main text D1 Basic Definitions Definition D1 Matrix A matrix is a rectangular array of numbers More precisely an m 3 n matrix has m rows and n columns The positive integer m is called the row dimension and n is called the column dimension We use uppercase boldface letters to denote matrices We can write an m 3 n matrix generi cally as A 5 3aij4 5 D a11 a12 a13 p a1n a21 a22 a23 p a2n am1 am2 am3 p amn T where aij represents the element in the ith row and the jth column For example a25 stands for the num ber in the second row and the fifth column of A A specific example of a 2 3 3 matrix is A 5 c 2 21 7 24 5 0d D1 where a13 5 7 The shorthand A 5 3aij4 is often used to define matrix operations Definition D2 Square Matrix A square matrix has the same number of rows and col umns The dimension of a square matrix is its number of rows and columns Definition D3 Vectors i A 1 3 m matrix is called a row vector of dimension m and can be written as x 1x1 x2 p xm2 Appendix D Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 710 ii An n 3 1 matrix is called a column vector and can be written as y D y1 y2 yn T Definition D4 Diagonal Matrix A square matrix A is a diagonal matrix when all of its offdiagonal elements are zero that is aij 5 0 for all i 2 j We can always write a diagonal matrix as A 5 D a11 0 0 p 0 0 a22 0 p 0 0 0 0 p ann T Definition D5 Identity and Zero Matrices i The n 3 n identity matrix denoted I or sometimes In to emphasize its dimension is the di agonal matrix with unity one in each diagonal position and zero elsewhere I In D 1 0 0 p 0 0 1 0 p 0 0 0 0 p 1 T ii The m 3 n zero matrix denoted 0 is the m 3 n matrix with zero for all entries This need not be a square matrix D2 Matrix Operations D2a Matrix Addition Two matrices A and B each having dimension m 3 n can be added element by element A 1 B 5 3aij 1 bij4 More precisely A 1 B 5 D a11 1 b11 a12 1 b12 p a1n 1 b1n a21 1 b21 a22 1 b22 p a2n 1 b2n am1 1 bm1 am2 1 bm2 p amn 1 bmn T For example c 2 21 7 24 5 0d 1 c1 0 24 4 2 3d 5 c3 21 3 0 7 3d Matrices of different dimensions cannot be added D2b Scalar Multiplication Given any real number g often called a scalar scalar multiplication is defined as gA 3gaij4 or gA 5 D ga11 ga12 p ga1n ga21 ga22 p ga2n gam1 gam2 p gamn T Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix D Summary of Matrix Algebra 711 For example if g 5 2 and A is the matrix in equation D1 then gA 5 B 4 22 14 28 10 0R D2c Matrix Multiplication To multiply matrix A by matrix B to form the product AB the column dimension of A must equal the row dimension of B Therefore let A be an m 3 n matrix and let B be an n 3 p matrix Then matrix multiplication is defined as AB 5 B a n k51 aikbkjR In other words the 1i j2 th element of the new matrix AB is obtained by multiplying each element in the ith row of A by the corresponding element in the jth column of B and adding these n products to gether A schematic may help make this process more transparent A B AB ith row S Eai1ai2ai3 p ainU E b1j b2j b3j bnj U 5 E a n k51 aikbkjU jth column 1i j2 th element where by the definition of the summation operator in Appendix A a n k51 aikbkj 5 ai1b1j 1 ai2b2j 1 p 1 ainbnj For example B 2 21 0 24 1 0R C 0 1 6 0 21 2 0 1 3 0 0 0 S 5 B 1 0 12 21 21 22 224 1R We can also multiply a matrix and a vector If A is an n 3 m matrix and y is an m 3 1 vector then Ay is an n 3 1 vector If x is a 1 3 n vector then xA is a 1 3 m vector Matrix addition scalar multiplication and matrix multiplication can be combined in various ways and these operations satisfy several rules that are familiar from basic operations on numbers In the following list of properties A B and C are matrices with appropriate dimensions for applying each operation and a and b are real numbers Most of these properties are easy to illustrate from the definitions Properties of Matrix Operations 1 1a 1 b2A 5 aA 1 bA 2 a1A 1 B2 5 aA 1 aB 3 1ab2A 5 a1bA2 4 a1AB2 5 1aA2B 5 A 1 B 5 B 1 A 6 1A 1 B2 1 C 5 A 1 1B 1 C2 7 1AB2C 5 A1BC2 8 A1B 1 C2 5 AB 1 AC 9 1A 1 B2C 5 AC 1 BC 10 IA 5 AI 5 A 11 A 1 0 5 0 1 A 5 A 12 A 2 A 5 0 13 A0 5 0A 5 0 and 14 AB 2 BA even when both products are defined Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 712 The last property deserves further comment If A is n 3 m and B is m 3 p then AB is defined but BA is defined only if n 5 p the row dimension of A equals the column dimension of B If A is m 3 n and B is n 3 m then AB and BA are both defined but they are not usually the same in fact they have different dimensions unless A and B are both square matrices Even when A and B are both square AB 2 BA except under special circumstances D2d Transpose Definition D6 Transpose Let A 5 3aij4 be an m 3 n matrix The transpose of A de noted Ar called A prime is the n 3 m matrix obtained by interchanging the rows and columns of A We can write this as Ar 3aji4 For example A 5 B 2 21 7 24 5 0R Ar 5 C 2 24 21 5 7 0 S Properties of Transpose 1 1Ar2 r 5 A 2 1aA2 r 5 aAr for any scalar a 3 1A 1 B2r 5 Ar 1 Br 4 1AB2r 5 BrAr where A is m 3 n and B is n 3 k 5 xrx 5 g n i51x2 i where x is an n 3 1 vector and 6 If A is an n 3 k matrix with rows given by the 1 3 k vectors a1 a2 p an so that we can write A 5 D a1 a2 an T then Ar 5 1a1r a2r p anr2 Definition D7 Symmetric Matrix A square matrix A is a symmetric matrix if and only if Ar 5 A If X is any n 3 k matrix then XrX is always defined and is a symmetric matrix as can be seen by applying the first and fourth transpose properties see Problem 3 D2e Partitioned Matrix Multiplication Let A be an n 3 k matrix with rows given by the 1 3 k vectors a1 a2 p an and let B be an n 3 m matrix with rows given by 1 3 m vectors b1 b2 p bn A 5 D a1 a2 an T B 5 D b1 b2 bn T Then ArB 5 a n i51 air bi Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix D Summary of Matrix Algebra 713 where for each i air bi is a k 3 m matrix Therefore ArB can be written as the sum of n matrices each of which is k 3 m As a special case we have ArA 5 a n i51 air ai where ari ai is a k 3 k matrix for all i A more general form of partitioned matrix multiplication holds when we have matrices A 1m 3 n2 and B 1n 3 p2 written as A 5 aA11 A12 A21 A22 b B 5 aB11 B12 B21 B22 b where A11 is m1 3 n1 A12 is m1 3 n2 A21 is m2 3 n1 A22 is m2 3 n2 B11 is n1 3 p1 B12 is n1 3 p2 B21 is n2 3 p1 and B22 is n2 3 p2 Naturally m1 1 m2 5 m n1 1 n2 5 n and p1 1 p2 5 p When we form the product AB the expression looks just when the entries are scalars AB 5 aA11B11 1 A12B21 A11B12 1 A12B22 A21B11 1 A22B21 A21B12 1 A22B22 b Note that each of the matrix multiplications that form the partition on the right is well defined because the column and row dimensions are compatible for multiplication D2f Trace The trace of a matrix is a very simple operation defined only for square matrices Definition D8 Trace For any n 3 n matrix A the trace of a matrix A denoted trA is the sum of its diagonal elements Mathematically tr1A2 5 a n i51 aii Properties of Trace 1 tr1In2 5 n 2 tr1Ar2 5 tr1A2 3 tr1A 1 B2 5 tr1A2 1 tr1B2 4 tr1aA2 5 atr1A2 for any scalar a and 5 tr1AB2 5 tr1BA2 where A is m 3 n and B is n 3 m D2g Inverse The notion of a matrix inverse is very important for square matrices Definition D9 Inverse An n 3 n matrix A has an inverse denoted A21 provided that A21A 5 In and AA21 5 In In this case A is said to be invertible or nonsingular Otherwise it is said to be noninvertible or singular Properties of Inverse 1 If an inverse exists it is unique 2 1aA2 21 5 11a2A21 if a 2 0 and A is invertible 3 1AB2 21 5 B21A21 if A and B are both n 3 n and invertible and 4 1Ar2 21 5 1A212r We will not be concerned with the mechanics of calculating the inverse of a matrix Any matrix alge bra text contains detailed examples of such calculations Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 714 D3 Linear Independence and Rank of a Matrix For a set of vectors having the same dimension it is important to know whether one vector can be expressed as a linear combination of the remaining vectors Definition D10 Linear Independence Let 5x1 x2 p xr6 be a set of n 3 1 vectors These are linearly independent vectors if and only if a1x1 1 a2x2 1 p 1 arxr 5 0 D2 implies that a1 5 a2 5 p 5 ar 5 0 If D2 holds for a set of scalars that are not all zero then 5x1 x2 p xr6 is linearly dependent The statement that 5x1 x2 p xr6 is linearly dependent is equivalent to saying that at least one vector in this set can be written as a linear combination of the others Definition D11 Rank i Let A be an n 3 m matrix The rank of a matrix A denoted rankA is the maximum num ber of linearly independent columns of A ii If A is n 3 m and rank1A2 5 m then A has full column rank If A is n 3 m its rank can be at most m A matrix has full column rank if its columns form a lin early independent set For example the 3 3 2 matrix C 1 3 2 6 0 0 S can have at most rank two In fact its rank is only one because the second column is three times the first column Properties of Rank 1 rank1Ar2 5 rank1A2 2 If A is n 3 k then rank1A2 min1n k2 and 3 If A is k 3 k and rank1A2 5 k then A is invertible D4 Quadratic Forms and Positive Definite Matrices Definition D12 Quadratic Form Let A be an n 3 n symmetric matrix The quadratic form associated with the matrix A is the realvalued function defined for all n 3 1 vectors x f 1x2 5 xrAx 5 a n i51 aii x2 i 1 2 a n i51 a n ji aij xi xj Definition D13 Positive Definite and Positive SemiDefinite i A symmetric matrix A is said to be positive definite pd if xrAx 0 for all n 3 1 vectors x except x 5 0 ii A symmetric matrix A is positive semidefinite psd if xrAx 0 for all n 3 1 vectors If a matrix is positive definite or positive semidefinite it is automatically assumed to be symmetric Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix D Summary of Matrix Algebra 715 Properties of Positive Definite and Positive SemiDefinite Matrices 1 A pd matrix has diagonal elements that are strictly positive while a psd matrix has nonnegative diagonal elements 2 If A is pd then A21 exists and is pd 3 If X is n 3 k then XrX and XXr are psd and 4 If X is n 3 k and rank1X2 5 k then XrX is pd and therefore nonsingular D5 Idempotent Matrices Definition D14 Idempotent Matrix Let A be an n 3 n symmetric matrix Then A is said to be an idempotent matrix if and only if AA 5 A For example C 1 0 0 0 0 0 0 0 1 S is an idempotent matrix as direct multiplication verifies Properties of Idempotent Matrices Let A be an n 3 n idempotent matrix 1 rank1A2 5 tr1A2 and 2 A is positive semidefinite We can construct idempotent matrices very generally Let X be an n 3 k matrix with rank1X2 5 k Define P X1XrX2 21Xr M In 2 X1XrX2 21Xr 5 In 2 P Then P and M are symmetric idempotent matrices with rank1P2 5 k and rank1M2 5 n 2 k The ranks are most easily obtained by using Property 1 tr1P2 5 tr3 1XrX2 21XrX4 from Property 5 for trace 5 tr1Ik2 5 k by Property 1 for trace It easily follows that tr1M2 5 tr1In2 2 tr1P2 5 n 2 k D6 Differentiation of Linear and Quadratic Forms For a given n 3 1 vector a consider the linear function defined by f 1x2 5 arx for all n 3 1 vectors x The derivative of f with respect to x is the 1 3 n vector of partial derivatives which is simply f 1x2x 5 ar For an n 3 n symmetric matrix A define the quadratic form g1x2 5 xrAx Then g1x2x 5 2xrA which is a 1 3 n vector Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 716 D7 Moments and Distributions of Random Vectors In order to derive the expected value and variance of the OLS estimators using matrices we need to define the expected value and variance of a random vector As its name suggests a random vector is simply a vector of random variables We also need to define the multivariate normal distribution These concepts are simply extensions of those covered in Appendix B D7a Expected Value Definition D15 Expected Value i If y is an n 3 1 random vector the expected value of y denoted Ey is the vector of ex pected values E1y2 5 3E1y12 E1y22 p E1yn2 4r ii If Z is an n 3 m random matrix EZ is the n 3 m matrix of expected values E1Z2 5 3E1zij2 4 Properties of Expected Value 1 If A is an m 3 n matrix and b is an n 3 1 vector where both are nonrandom then E1Ay 1 b2 5 AE1y2 1 b and 2 If A is p 3 n and B is m 3 k where both are nonrandom then E1AZB2 5 AE1Z2B D7b VarianceCovariance Matrix Definition D16 VarianceCovariance Matrix If y is an n 3 1 random vector its variancecovariance matrix denoted Vary is defined as Var1y2 5 D s2 1 s12 p s1n s21 s2 2 p s2n sn1 sn2 p s2 n T where s2 j 5 Var1yj2 and sij 5 Cov1yi yj2 In other words the variancecovariance matrix has the variances of each element of y down its diagonal with covariance terms in the off diagonals Because Cov1yi yj2 5 Cov1yj yi2 it immediately follows that a variancecovariance matrix is symmetric P r o p e r t i e s o f Va r i a n c e 1 If a is an n 3 1 nonrandom vector then Var1ary2 5 ar3Var1y2a 0 2 If Var1ary2 0 for all a 2 0 Vary is positive definite 3 Var1y2 5 E3 1y 2 m2 1y 2 m2r4 where m 5 E1y2 4 If the elements of y are uncorrelated Vary is a diagonal matrix If in addition Var1yj2 5 s2 for j 5 1 2 p n then Var1y2 5 s2In and 5 If A is an m 3 n nonrandom matrix and b is an n 3 1 nonrandom vector then Var1Ay 1 b2 5 A3Var1y2 4Ar D7c Multivariate Normal Distribution The normal distribution for a random variable was discussed at some length in Appendix B We need to extend the normal distribution to random vectors We will not provide an expression for the probability distribution function as we do not need it It is important to know that a multivariate normal random vector is completely characterized by its mean and its variancecovariance matrix Therefore if y is an n 3 1 multivariate normal random vector with mean m and variancecovariance matrix S we write y Normal1m S2 We now state several useful properties of the multivariate normal distribution Properties of the Multivariate Normal Distribution 1 If y Normal1mS2 then each element of y is normally distributed 2 If y Normal1mS2 then yi and yj any two elements of y are independent if and only if they are uncorrelated that is sij 5 0 3 If y Normal1mS2 then Ay 1 b Normal1Am 1 bASAr2 where A and b are nonrandom 4 If y Normal10S2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix D Summary of Matrix Algebra 717 then for nonrandom matrices A and B Ay and By are independent if and only if ASBr 5 0 In particular if S 5 s2In then ABr 5 0 is necessary and sufficient for independence of Ay and By 5 If y Normal10s2In2 A is a k 3 n nonrandom matrix and B is an n 3 n symmetric idempo tent matrix then Ay and yrBy are independent if and only if AB 5 0 and 6 If y Normal10s2In2 and A and B are nonrandom symmetric idempotent matrices then yrAy and yrBy are independent if and only if AB 5 0 D7d ChiSquare Distribution In Appendix B we defined a chisquare random variable as the sum of squared independent standard normal random variables In vector notation if u Normal10 In2 then uru x2 n Properties of the ChiSquare Distribution 1 If u Normal10In2 and A is an n 3 n symmetric idempotent matrix with rank1A2 5 q then urAu x2 q 2 If u Normal10In2 and A and B are n 3 n symmetric idempotent matrices such that AB 5 0 then urAu and urBu are independent chisquare random variables and 3 If z Normal10C2 where C is an m 3 m nonsingular matrix then zrC21z x2 m D7e t Distribution We also defined the t distribution in Appendix B Now we add an important property Property of the t Distribution If u Normal10In2 c is an n 3 1 nonrandom vec tor A is a nonrandom n 3 n symmetric idempotent matrix with rank q and Ac 5 0 then 5cru1crc2 1261urAuq2 12 tq D7f F Distribution Recall that an F random variable is obtained by taking two independent chisquare random variables and finding the ratio of each standardized by degrees of freedom Property of the F Distribution If u Normal10In2 and A and B are n 3 n non random symmetric idempotent matrices with rank1A2 5 k1 rank1B2 5 k2 and AB 5 0 then 1urAuk121urBuk22 Fk1 k2 Summary This appendix contains a condensed form of the background information needed to study the classical linear model using matrices Although the material here is selfcontained it is primarily intended as a review for readers who are familiar with matrix algebra and multivariate statistics and it will be used extensively in Appendix E Key Terms ChiSquare Random Variable Column Vector Diagonal Matrix Expected Value F Random Variable Idempotent Matrix Identity Matrix Inverse Linearly Independent Vectors Matrix Matrix Multiplication Multivariate Normal Distribution Positive Definite pd Positive SemiDefinite psd Quadratic Form Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 718 Appendices Random Vector Rank of a Matrix Row Vector Scalar Multiplication Square Matrix Symmetric Matrix t Distribution Trace of a Matrix Transpose VarianceCovariance Matrix Zero Matrix Problems 1 i Find the product AB using A 5 c 2 2 1 7 24 5 0d B 5 0 1 6 1 8 0 3 0 0 S ii Does BA exist 2 If A and B are n 3 n diagonal matrices show that AB 5 BA 3 Let X be any n 3 k matrix Show that XrX is a symmetric matrix 4 i Use the properties of trace to argue that tr1ArA2 5 tr1AAr2 for any n 3 m matrix A ii For A 5 c2 0 2 1 0 3 0d verify that tr1ArA2 5 tr1AAr2 5 i Use the definition of inverse to prove the following if A and B are n 3 n nonsingular matrices then 1AB2 21 5 B21A21 ii If A B and C are all n 3 n nonsingular matrices find 1ABC2 21 in terms of A21 B21 and C21 6 i Show that if A is an n 3 n symmetric positive definite matrix then A must have strictly positive diagonal elements ii Write down a 2 3 2 symmetric matrix with strictly positive diagonal elements that is not positive definite 7 Let A be an n 3 n symmetric positive definite matrix Show that if P is any n 3 n nonsingular matrix then PrAP is positive definite 8 Prove Property 5 of variances for vectors using Property 3 9 Let a be an n 3 1 nonrandom vector and let u be an n 3 1 random vector with E1uur2 5 In Show that E3tr1auu9a92 4 5 g n i51a2 i 10 Take as given the properties of the chisquare distribution listed in the text Show how those properties along with the definition of an F random variable imply the stated property of the F distribution con cerning ratios of quadratic forms 11 Let X be an n 3 k matrix partitioned as X 5 1X1 X22 where X1 is n 3 k1 and X2 is n 3 k2 i Show that XrX 5 aXr1X1 Xr1X2 Xr2X1 Xr2X2 b Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix D Summary of Matrix Algebra 719 What are the dimensions of each of the matrices ii Let b be a k 3 1 vector partitioned as b 5 ab1 b2 b where b1 is k1 3 1 and b2 is k2 3 1 Show that 1XrX2b 5 a 1Xr1X12b1 1 1Xr1X22b2 1Xr2X12b1 1 1Xr2X22b2 b Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 720 T his appendix derives various results for ordinary least squares estimation of the multiple linear regression model using matrix notation and matrix algebra see Appendix D for a summary The material presented here is much more advanced than that in the text E1 The Model and Ordinary Least Squares Estimation Throughout this appendix we use the t subscript to index observations and an n to denote the sample size It is useful to write the multiple linear regression model with k parameters as follows yt 5 b0 1 b1xt1 1 b2xt2 1 p 1 bkxtk 1 ut t 5 1 2 p n E1 where yt is the dependent variable for observation t and xtj j 5 1 2 p k are the independent vari ables As usual b0 is the intercept and b1 p bk denote the slope parameters For each t define a 1 3 1k 1 12 vector xt 5 11 xt1 p xtk2 and let b 5 1b0 b1 p bk2 r be the 1k 1 12 3 1 vector of all parameters Then we can write E1 as yt 5 xtb 1 ut t 5 1 2 p n E2 Some authors prefer to define xt as a column vector in which case xt is replaced with xtr in E2 Mathematically it makes more sense to define it as a row vector We can write E2 in full matrix notation by appropriately defining data vectors and matrices Let y denote the n 3 1 vector of ob servations on y the tth element of y is yt Let X be the n 3 1k 1 12 vector of observations on the explanatory variables In other words the tth row of X consists of the vector xt Written out in detail X n 3 1k 1 12 D x1 x2 xn T 5 D 1 x11 x12 p x1k 1 x21 x22 p x2k 1 xn1 xn2 p xnk T The Linear Regression Model in Matrix Form Appendix E Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 721 Finally let u be the n 3 1 vector of unobservable errors or disturbances Then we can write E2 for all n observations in matrix notation y 5 Xb 1 u E3 Remember because X is n 3 1k 1 12 and b is 1k 1 12 3 1 Xb is n 3 1 Estimation of b proceeds by minimizing the sum of squared residuals as in Section 32 Define the sum of squared residuals function for any possible 1k 1 12 3 1 parameter vector b as SSR1b2 a n t51 1yt 2 xtb2 2 The 1k 1 12 3 1 vector of ordinary least squares estimates b 5 1b 0 b 1 p b k2 r minimizes SSRb over all possible 1k 1 12 3 1 vectors b This is a problem in multivariable calculus For b to mini mize the sum of squared residuals it must solve the first order condition SSR1b 2b 0 E4 Using the fact that the derivative of 1yt 2 xtb2 2 with respect to b is the 1 3 1k 1 12 vector 221yt 2 xtb2xt E4 is equivalent to a n t51 xtr1yt 2 xtb 2 0 E5 We have divided by 2 and taken the transpose We can write this first order condition as a n t51 1yt 2 b 0 2 b 1xt1 2 p 2 b kxtk2 5 0 a n t51 xt11yt 2 b 0 2 b 1xt1 2 p 2 b kxtk2 5 0 a n t51 xtk1yt 2 b 0 2 b 1xt1 2 p 2 b kxtk2 5 0 which is identical to the first order conditions in equation 313 We want to write these in matrix form to make them easier to manipulate Using the formula for partitioned multiplication in Appen dix D we see that E5 is equivalent to Xr1y 2 Xb 2 5 0 E6 or 1XrX2b 5 Xry E7 It can be shown that E7 always has at least one solution Multiple solutions do not help us as we are looking for a unique set of OLS estimates given our data set Assuming that the 1k 1 12 3 1k 1 12 symmetric matrix XrX is nonsingular we can premultiply both sides of E7 by 1XrX2 21 to solve for the OLS estimator b b 5 1XrX2 21Xry E8 This is the critical formula for matrix analysis of the multiple linear regression model The assump tion that XrX is invertible is equivalent to the assumption that rank1X2 5 1k 1 12 which means that the columns of X must be linearly independent This is the matrix version of MLR3 in Chapter 3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 722 Before we continue E8 warrants a word of warning It is tempting to simplify the formula for b as follows b 5 1XrX2 21Xry 5 X211Xr2 21Xry 5 X21y The flaw in this reasoning is that X is usually not a square matrix so it cannot be inverted In other words we cannot write 1XrX2 21 5 X211Xr2 21 unless n 5 1k 1 12 a case that virtually never arises in practice The n 3 1 vectors of OLS fitted values and residuals are given by y 5 Xb u 5 y 2 y 5 y 2 Xb respectively From E6 and the definition of u we can see that the first order condition for b is the same as Xru 5 0 E9 Because the first column of X consists entirely of ones E9 implies that the OLS residuals always sum to zero when an intercept is included in the equation and that the sample covariance between each independent variable and the OLS residuals is zero We discussed both of these properties in Chapter 3 The sum of squared residuals can be written as SSR 5 a n t51 u 2 t 5 u ru 5 1y 2 Xb 2 r1y 2 Xb 2 E10 All of the algebraic properties from Chapter 3 can be derived using matrix algebra For example we can show that the total sum of squares is equal to the explained sum of squares plus the sum of squared residuals see 327 The use of matrices does not provide a simpler proof than summation notation so we do not provide another derivation The matrix approach to multiple regression can be used as the basis for a geometrical interpreta tion of regression This involves mathematical concepts that are even more advanced than those we covered in Appendix D See Goldberger 1991 or Greene 1997 E1a The FrischWaugh Theorem In Section 32 we described a partialling out interpretation of the ordinary least squares estimates We can establish the partialling out interpretation very generally using matrix notation Partition the n 3 1k 1 12 matrix X as X 5 1X10X22 where X1 is n 3 1k1 1 12 and includes the interceptalthough that is not required for the result to holdand X2 is n 3 k2 We still assume that X has rank k 1 1 which means X1 has rank k1 1 1 and X2 has rank k2 Consider the OLS estimates b 1 and b 2 from the long regression y on X1 X2 As we know the multiple regression coefficients on X2 b 2 generally differs from b 2 from the regres sion y on X2 One way to describe the difference is to understand that we can obtain b 2 from a shorter regression but first we must partial out X1 from X2 Consider the following twostep method i Regress each column of X2 on X1 and obtain the matrix of residuals say X 2 We can write X 2 as X 2 5 3In 2 X11Xr 1X12 21Xr 14X2 5 1In 2 P12X2 5 M1X2 where P1 5 X11Xr1X12 21Xr1 and M1 5 In 2 P1 are n 3 n symmetric idempotent matrices ii Regress y on X 2 and call the k2 3 1 vector of coefficient b 2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 723 The FrischWaugh FW theorem states that b 2 5 b 2 Importantly the FW theorem generally says nothing about equality of the estimates from the long regression b 2 and those from the short regression b 2 Usually b 2 2 b 2 However if Xr1X2 5 0 then X 2 5 M1X2 5 X2 in which case b 2 5 b 2 then b 2 5 b 2 follows from FW It is also worth not ing that we obtain b 2 if we also partial X1 out of y In other words let ÿ be the residuals from regress ing y on X1 so that y 5 M1y Then b 2 is obtained from the regression y on X 2 It is important to understand that it is not enough to only partial out X1 from y The important step is partialling out X1 from X2 Problem 6 at the end of this chapter asks you to derive the FW theorem and to investigate some related issues Another useful algebraic result is that when we regress y on X 2 and save the residuals say u these are identical to the OLS residuals from the original long regression y 5 X 2b 2 5 u 5 u 5 y 2 X1b 1 2 X2b 2 where we have used the FW result b 2 5 b 2 We do not obtain the original OLS residuals if we regress y on X 2 but we do obtain b 2 Before the advent of powerful computers the FrischWaugh result was sometimes used as a com putational device Today the result is more of theoretical interest and it is very helpful in under standing the mechanics of OLS For example recall that in Chapter 10 we used the FW theorem to establish that adding a time trend to a multiple regression is algebraically equivalent to first linearly detrending all of the explanatory variables before running the regression The FW theorem also can be used in Chapter 14 to establish that the fixed effects estimator which we introduced as being obtained from OLS on timedemeaned data can also be obtained from the long dummy variable regression E2 Finite Sample Properties of OLS Deriving the expected value and variance of the OLS estimator b is facilitated by matrix algebra but we must show some care in stating the assumptions Assumption E1 Linear in Parameters The model can be written as in E3 where y is an observed n 3 1 vector X is an n 3 1k 1 12 observed matrix and u is an n 3 1 vector of unobserved errors or disturbances Assumption E2 No Perfect Collinearity The matrix X has rank 1k 1 12 This is a careful statement of the assumption that rules out linear dependencies among the explanatory variables Under Assumption E2 XrX is nonsingular so b is unique and can be written as in E8 Assumption E3 Zero Conditional Mean Conditional on the entire matrix X each error ut has zero mean E1ut0X2 5 0 t 5 1 2 p n Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 724 In vector form Assumption E3 can be written as E1u0X2 5 0 E11 This assumption is implied by MLR4 under the random sampling assumption MLR2 In time series applications Assumption E3 imposes strict exogeneity on the explanatory variables something dis cussed at length in Chapter 10 This rules out explanatory variables whose future values are correlated with ut in particular it eliminates lagged dependent variables Under Assumption E3 we can condi tion on the xtj when we compute the expected value of b UNbiasedNess of oLs Under Assumptions E1 E2 and E3 the OLS estimator b is unbiased for b PROOF Use Assumptions E1 and E2 and simple algebra to write b 5 1XrX2 21Xry 5 1XrX2 21Xr1Xb 1 u2 5 1XrX2 211XrX2b 1 1XrX2 21Xru 5 b 1 1XrX2 21Xru E12 where we use the fact that 1XrX2 211XrX2 5 Ik11 Taking the expectation conditional on X gives E1b 0X2 5 b 1 1XrX2 21XrE1u0X2 5 b 1 1XrX2 21Xr0 5 b because E1u0X2 5 0 under Assumption E3 This argument clearly does not depend on the value of b so we have shown that b is unbiased Theorem e1 To obtain the simplest form of the variancecovariance matrix of b we impose the assumptions of homoskedasticity and no serial correlation Assumption E4 Homoskedasticity and No serial Correlation i Var1ut0X2 5 s2 t 5 1 2 p n ii Cov1utus0X2 5 0 for all t 2 s In matrix form we can write these two assumptions as Var1u0X2 5 s2Inr E13 where In is the n n identity matrix Part i of Assumption E4 is the homoskedasticity assumption the variance of ut cannot depend on any element of X and the variance must be constant across observations t Part ii is the no serial correlation assumption the errors cannot be correlated across observations Under random sampling and in any other crosssectional sampling schemes with independent observations part ii of As sumption E4 automatically holds For time series applications part ii rules out correlation in the errors over time both conditional on X and unconditionally Because of E13 we often say that u has a scalar variancecovariance matrix when Assump tion E4 holds We can now derive the variancecovariance matrix of the OLS estimator Theorem e2 VariaNCeCoVariaNCe Matrix of tHe oLs estiMator Under Assumptions E1 through E4 Var1b 0X2 5 s21XrX2 21 E14 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 725 Formula E14 means that the variance of b j conditional on X is obtained by multiplying s2 by the jth diagonal element of 1XrX2 21 For the slope coefficients we gave an interpretable formula in equation 351 Equation E14 also tells us how to obtain the covariance between any two OLS estimates multiply s2 by the appropriate offdiagonal element of 1XrX2 21 In Chapter 4 we showed how to avoid explicitly finding covariances for obtaining confidence intervals and hypothesis tests by appropriately rewriting the model The GaussMarkov Theorem in its full generality can be proven PROOF From the last formula in equation E12 we have Var1b 0X2 5 Var3 1XrX2 21Xru0X4 5 1XrX2 21Xr3Var1u0X2 4X1XrX2 21 Now we use Assumption E4 to get Var1b 0X2 5 1XrX2 21Xr1s2In2X1XrX2 21 5 s21XrX2 21XrX1XrX2 21 5 s21XrX2 21 Theorem e3 GaUssMarkoV tHeoreM Under Assumptions E1 through E4 b is the best linear unbiased estimator PROOF Any other linear estimator of b can be written as b 5 Ary E15 where A is an n 3 1k 1 12 matrix In order for b to be unbiased conditional on X A can consist of nonrandom numbers and functions of X For example A cannot be a function of y To see what fur ther restrictions on A are needed write b 5 Ar1Xb 1 u2 5 1ArX2b 1 Aru E16 Then E1b0X2 5 ArXb 1 E1Aru0X2 5 ArXb 1 ArE1u0X2 because A is a function of X 5 ArXb because E1u0X2 5 0 For b to be an unbiased estimator of b it must be true that E1b0X2 5 b for all 1k 1 12 3 1 vectors b that is ArXb 5 b for all 1k 1 12 3 1 vectors b E17 Because ArX is a 1k 1 12 3 1k 1 12 matrix E17 holds if and only if ArX 5 lk11 Equations E15 and E17 characterize the class of linear unbiased estimators for b Next from E16 we have Var1b0X2 5 Ar3Var1u0X2 4A 5 s2ArA by Assumption E4 Therefore Var1b0X2 2 Var1b0X2 5 s23ArA 2 1XrX2 214 5 s23ArA 2 ArX1XrX2 21XrA4 because ArX 5 Ik11 5 s2Ar3In 2 X1XrX2 21Xr4A s2ArMA Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 726 The unbiased estimator of the error variance s2 can be written as s 2 5 u ru1n 2 k 2 12 which is the same as equation 356 where M In 2 X1XrX2 21Xr Because M is symmetric and idempotent ArMA is positive semidefinite for any n 3 1k 1 12 matrix A This establishes that the OLS estimator b is BLUE Why is this important Let c be any 1k 1 12 3 1 vector and consider the linear combination crb 5 c0b0 1 c1b1 1 p 1 ckbk which is a scalar The unbiased estimators of crb are crb and crb But Var1crb0X2 2 Var1crb 0X2 5 cr3Var1b0X2 2 Var1b 0X2 4c 0 because 3Var1b0X2 2 Var1b 0X2 4 is psd Therefore when it is used for estimating any linear combi nation of b OLS yields the smallest variance In particular Var1b j0X2 Var1b j X2 for any other linear unbiased estimator of bj UNbiasedNess of s 2 Under Assumptions E1 through E4 s 2 is unbiased for s2 E1s 20X2 5 s2 for all s2 0 PROOF Write u 5 y 2 Xb 5 y 2 X1XrX2 21Xry 5 My 5 Mu where M 5 In 2 X1XrX2 21Xr and the last equality follows because MX 5 0 Because M is symmetric and idempotent u ru 5 urMrMu 5 urMu Because urMu is a scalar it equals its trace Therefore E1urMu0X2 5 E3tr1urMu2 0X4 5 E3tr1Muur2 0X4 5 tr3E1Muur0X2 4 5 tr3ME1uur0X2 4 5 tr1Ms2In2 5 s2tr1M2 5 s21n 2 k 2 12 The last equality follows from tr1M2 5 tr1In2 2 tr3X1XrX2 21Xr4 5 n 2 tr3 1XrX2 21XrX4 5 n 2 tr1Ik112 5 n 2 1k 1 12 5 n 2 k 2 1 Therefore E1s 20X2 5 E1urMu0X21n 2 k 2 12 5 s2 Theorem e4 E3 Statistical Inference When we add the final classical linear model assumption b has a multivariate normal distribution which leads to the t and F distributions for the standard test statistics covered in Chapter 4 Assumption E5 Normality of errors Conditional on X the ut are independent and identically distributed as Normal10s22 Equivalently u given X is distributed as multivariate normal with mean zero and variancecovariance matrix s2In u Normal10s2In2 Under Assumption E5 each ut is independent of the explanatory variables for all t In a time series setting this is essentially the strict exogeneity assumption Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 727 Theorem E5 is the basis for statistical inference involving b In fact along with the properties of the chisquare t and F distributions that we summarized in Appendix D we can use Theorem E5 to establish that t statistics have a t distribution under Assumptions E1 through E5 under the null hypothesis and likewise for F statistics We illustrate with a proof for the t statistics NorMaLity of b Under the classical linear model Assumptions E1 through E5 b conditional on X is distributed as multivariate normal with mean b and variancecovariance matrix s21XrX2 21 Theorem e5 distribUtioN of t statistiC Under Assumptions E1 through E5 1b j 2 bj2se1b j2 tn2k21 j 5 0 1 p k PROOF The proof requires several steps the following statements are initially conditional on X First by Theorem E5 1b j 2 bj2sd1b j2 Normal1012 where sd1b j2 5 sCj j and cj j is the jth diagonal ele ment of 1XrX2 21 Next under Assumptions E1 through E5 conditional on X 1n 2 k 2 12s 2s2 x2 n2k21 E18 This follows because 1n 2 k 2 12s 2s2 5 1us2 rM1us2 where M is the n 3 n symmetric idem potent matrix defined in Theorem E4 But us Normal10In2 by Assumption E5 It follows from Property 1 for the chisquare distribution in Appendix D that 1us2 rM1us2 x2 n2k21 because M has rank n 2 k 2 1 We also need to show that b and s 2 are independent But b 5 b 1 1XrX2 21Xru and s 2 5 urMu1n 2 k 2 12 Now 3 1XrX2 21Xr4M 5 0 because XrM 5 0 It follows from Property 5 of the multivariate normal distribution in Appendix D that b and Mu are independent Because s 2 is a func tion of Mu b and s 2 are also independent 1b j 2 bj2se1b j2 5 3 1b j 2 bj2sd1b j2 41s 2s22 12 which is the ratio of a standard normal random variable and the square root of a x2 n2k211n 2 k 2 12 random variable We just showed that these are independent so by definition of a t random variable 1b j 2 bj2se1b j2 has the tn2k21 distribution Because this distribution does not depend on X it is the unconditional distribution of 1b j 2 bj2se1b j2 as well Theorem e6 From this theorem we can plug in any hypothesized value for bj and use the t statistic for testing hypotheses as usual Under Assumptions E1 through E5 we can compute what is known as the CramerRao lower bound for the variancecovariance matrix of unbiased estimators of b again conditional on X see Greene 1997 Chapter 4 This can be shown to be s21XrX2 21 which is exactly the variance covariance matrix of the OLS estimator This implies that b is the minimum variance unbiased estimator of b conditional on X Var1b0X2 2 Var1b 0X2 is positive semidefinite for any other unbiased estimator b we no longer have to restrict our attention to estimators linear in y It is easy to show that the OLS estimator is in fact the maximum likelihood estimator of b un der Assumption E5 For each t the distribution of yt given X is Normal1xt bs22 Because the yt are Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 728 independent conditional on X the likelihood function for the sample is obtained from the product of the densities P n t5112ps22 212exp321yt 2 xtb2 212s22 4 where P denotes product Maximizing this function with respect to b and s2 is the same as maximiz ing its natural logarithm a n t51 321122log12ps22 2 1yt 2 xtb2 212s22 4 For obtaining b this is the same as minimizing g n t511yt 2 xtb2 2the division by 2s2 does not affect the optimizationwhich is just the problem that OLS solves The estimator of s2 that we have used SSR1n 2 k2 turns out not to be the MLE of s2 the MLE is SSRn which is a biased estimator Be cause the unbiased estimator of s2 results in t and F statistics with exact t and F distributions under the null it is always used instead of the MLE That the OLS estimator is the MLE under Assumption E5 implies an interesting robustness property of the MLE based on the normal distribution The reasoning is simple We know that the OLS estimator is unbiased under Assumptions E1 to E3 normality of the errors is used nowhere in the proof and neither is Assumption E4 As the next section shows the OLS estimator is also consis tent without normality provided the law of large numbers holds as is widely true These statistical properties of the OLS estimator imply that the MLE based on the normal loglikelihood function is robust to distributional specification the distribution can be almost anything and yet we still obtain a consistent and under E1 to E3 unbiased estimator As discussed in Section 173 a maximum likelihood estimator obtained without assuming the distribution is correct is often called a quasi maximum likelihood estimator QMLE Generally consistency of the MLE relies on having a correct distribution in order to con clude that it is consistent for the parameters We have just seen that the normal distribution is a no table exception There are some other distributions that share this property including the Poisson distributionas discussed in Section 173 Wooldridge 2010 Chapter 18 discusses some other useful examples E4 Some Asymptotic Analysis The matrix approach to the multiple regression model can also make derivations of asymptotic prop erties more concise In fact we can give general proofs of the claims in Chapter 11 We begin by proving the consistency result of Theorem 111 Recall that these assumptions con tain as a special case the assumptions for crosssectional analysis under random sampling Proof of Theorem 111 As in Problem E1 and using Assumption TS1r we write the OLS estimator as b 5 a a n t51 xtr xtb 21 a a n t51 xtr ytb 5 a a n t51 xtr xtb 21 a a n t51 xtr1xtb 1 ut2 b 5 b 1 a a n t51 xtrxtb 21 a a n t51 xtrutb E19 5 b 1 an21 a n t51 xtrxtb 21 an21 a n t51 xtrutb Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 729 Now by the law of large numbers n21 a n t51 xtrxt S p A and n21 a n t51 xtrut S p 0 E20 where A 5 E1xtr xt2 is a 1k 1 12 3 1k 1 12 nonsingular matrix under Assumption TS2r and we have used the fact that E1xtrut2 5 0 under Assumption TS3r Now we must use a matrix version of Property PLIM1 in Appendix C Namely because A is nonsingular an21 a n t51 xrt xtb 21 S p A21 E21 Wooldridge 2010 Chapter 3 contains a discussion of these kinds of convergence results It now follows from E19 E20 and E21 that plim1b 2 5 b 1 A21 0 5 b This completes the proof Next we sketch a proof of the asymptotic normality result in Theorem 112 Proof of Theorem 112 From equation E19 we can write n1b 2 b2 5 an21 a n t51 xtrxtb 21 an212 a n t51 xtrutb 5 A21an212 a n t51 xtrutb 1 op112 E22 where the term op1 is a remainder term that converges in probability to zero This term is equal to 3 1n21g n t51xtrxt2 21 2 A2141n212g n t51xtrut2 The term in brackets converges in probability to zero by the same argument used in the proof of Theorem 111 while 1n212g n t51xtrut2 is bounded in prob ability because it converges to a multivariate normal distribution by the central limit theorem A well known result in asymptotic theory is that the product of such terms converges in probability to zero Further n1b 2 b2 inherits its asymptotic distribution from A211n212g n t51xrt ut2 See Wooldridge 2010 Chapter 3 for more details on the convergence results used in this proof By the central limit theorem n212g n t51xtrut has an asymptotic normal distribution with mean zero and say 1k 1 12 3 1k 1 12 variancecovariance matrix B Then n1b 2 b2 has an asymp totic multivariate normal distribution with mean zero and variancecovariance matrix A21BA21 We now show that under Assumptions TS4r and TS5r B 5 s2A The general expression is useful be cause it underlies heteroskedasticityrobust and serial correlationrobust standard errors for OLS of the kind discussed in Chapter 12 First under Assumption TS5r xtrut and xsrus are uncorrelated for t 2 s Why Suppose s t for concreteness Then by the law of iterated expectations E1xtrutusxs2 5 E3E1utus0xtrxs2xtrxs4 5 E3E1utus0xtrxs2xtrxs4 5 E30 xtrxs4 5 0 The zero covariances imply that the variance of the sum is the sum of the variances But Var1xtrut2 5 E1xtrututxt2 5 E1u2 y xtrxt2 By the law of iterated expectations E1u2 t xtrxt2 5E3E1u2 txtrxt0xt2 5E3E1u2 t 0xt2xtrxt4 5 E3s2xtrxt4 5 s2E1xtrxt2 5 s2A where we use E1u2 t 0xt2 5 s2 under Assumptions TS3r and TS4r This shows that B 5 s2A and so under Assumptions TS1r to TS5r we have n1b 2 b2 a Normal10s2A212 E23 This completes the proof From equation E23 we treat b as if it is approximately normally distributed with mean b and variancecovariance matrix s2A21n The division by the sample size n is expected here the ap proximation to the variancecovariance matrix of b shrinks to zero at the rate 1n When we replace Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 730 s2 with its consistent estimator s 2 5 SSR1n 2 k 2 12 and replace A with its consistent estimator n21g n t51xtrxt 5 XrXn we obtain an estimator for the asymptotic variance of b Anar1b 2 5 s 21XrX2 21 E24 Notice how the two divisions by n cancel and the righthand side of E24 is just the usual way we estimate the variance matrix of the OLS estimator under the GaussMarkov assumptions To sum marize we have shown that under Assumptions TS1r to TS5rwhich contain MLR1 to MLR5 as special casesthe usual standard errors and t statistics are asymptotically valid It is perfectly legiti mate to use the usual t distribution to obtain critical values and pvalues for testing a single hypoth esis Interestingly in the general setup of Chapter 11 assuming normality of the errorssay ut given xt ut21 xt21 p u1 x1 is distributed as Normal10 s22does not necessarily help as the t statistics would not generally have exact t statistics under this kind of normality assumption When we do not assume strict exogeneity of the explanatory variables exact distributional results are difficult if not impossible to obtain If we modify the argument above we can derive a heteroskedasticityrobust variancecovariance matrix The key is that we must estimate E1u2 txtrxt2 separately because this matrix no longer equals s2E1xtrxt2 But if the u t are the OLS residuals a consistent estimator is 1n 2 k 2 12 21 a n t51 u 2 txtrxt E25 where the division by n 2 k 2 1 rather than n is a degrees of freedom adjustment that typically helps the finite sample properties of the estimator When we use the expression in equation E25 we obtain Anar1b 2 5 3n1n 2 k 2 12 41XrX2 21a a n t51 u 2 txtrxtb 1XrX2 21 E26 The square roots of the diagonal elements of this matrix are the same heteroskedasticityrobust stan dard errors we obtained in Section 82 for the pure crosssectional case A matrix extension of the serial correlation and heteroskedasticity robust standard errors we obtained in Section 125 is also available but the matrix that must replace E25 is complicated because of the serial correlation See for example Hamilton 1994 Section 105 E4 Wald Statistics for Testing Multiple Hypotheses Similar arguments can be used to obtain the asymptotic distribution of the Wald statistic for testing multiple hypotheses Let R be a q 3 1k 1 12 matrix with q 1k 1 12 Assume that the q restric tions on the 1k 1 12 3 1 vector of parameters b can be expressed as H0 Rb 5 r where r is a q 3 1 vector of known constants Under Assumptions TS1r to TS5r it can be shown that under H0 3 n1Rb 2 r2 4r1s2RA21Rr2 213 n1Rb 2 r2 4 a x2 q E27 where A 5 E1xtrxt2 as in the proofs of Theorems 111 and 112 The intuition behind equa tion E25 is simple Because n1b 2 b2 is roughly distributed as Normal10s2A212 R3 n1b 2 b2 4 5 nR1b 2 b2 i s a p p r o x i m a t e l y Normal10s2RA21Rr2 b y P r o p erty 3 of the multivariate normal distribution in Appendix D Under H0 Rb 5 r so n1Rb 2 r2 a Normal10s2RA21Rr2 under H0 By Property 3 of the chisquare distribution zr1s2RA21Rr2 21z x2 q if z Normal10s2RA21Rr2 To obtain the final result formally we need to use an asymptotic version of this property which can be found in Wooldridge 2010 Chapter 3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 731 Given the result in E25 we obtain a computable statistic by replacing A and s2 with their consistent estimators doing so does not change the asymptotic distribution The result is the socalled Wald statistic which after canceling the sample sizes and doing a little algebra can be written as W 5 1Rb 2 r2 r3R1XrX2 21Rr4211Rb 2 r2s 2 E28 Under H0 W a x2 q where we recall that q is the number of restrictions being tested If s 2 5 SSR1n 2 k 2 12 it can be shown that Wq is exactly the F statistic we obtained in Chapter 4 for testing multiple linear restrictions See for example Greene 1997 Chapter 7 Therefore under the classical linear model assumptions TS1 to TS6 in Chapter 10 Wq has an exact Fq n2k21 distri bution Under Assumptions TS1r to TS5r we only have the asymptotic result in E26 Neverthe less it is appropriate and common to treat the usual F statistic as having an approximate Fq n2k21 distribution A Wald statistic that is robust to heteroskedasticity of unknown form is obtained by using the matrix in E26 in place of s 21XrX2 21 and similarly for a test statistic robust to both heteroskedas ticity and serial correlation The robust versions of the test statistics cannot be computed via sums of squared residuals or Rsquareds from the restricted and unrestricted regressions Summary This appendix has provided a brief treatment of the linear regression model using matrix notation This material is included for more advanced classes that use matrix algebra but it is not needed to read the text In effect this appendix proves some of the results that we either stated without proof proved only in special cases or proved through a more cumbersome method of proof Other topicssuch as asymptotic properties instrumental variables estimation and panel data modelscan be given concise treatments us ing matrices Advanced texts in econometrics including Davidson and MacKinnon 1993 Greene 1997 Hayashi 2000 and Wooldridge 2010 can be consulted for details Key Terms Problems 1 Let xt be the 1 3 1k 1 12 vector of explanatory variables for observation t Show that the OLS estima tor b can be written as b 5 a a n t51 xtrxtb 21 a a n t51 xtrytb Dividing each summation by n shows that b is a function of sample averages 2 Let b be the 1k 1 12 3 1 vector of OLS estimates i Show that for any 1k 1 12 3 1 vector b we can write the sum of squared residuals as SSR1b2 5 u ru 1 1b 2 b2 rXrX1b 2 b2 Hint Write 1y 2 Xb2 r1y 2 Xb2 5 3u 1 X1b 2 b2 4r3u 1 X1b 2 b2 4 and use the fact that Xru 5 0 First Order Condition FrischWaugh FW theorem Matrix Notation Minimum Variance Unbiased Estimator Scalar VarianceCovariance Matrix VarianceCovariance Matrix of the OLS Estimator Wald Statistic QuasiMaximum Likelihood Estimator QMLE Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 732 Appendices ii Explain how the expression for SSRb in part i proves that b uniquely minimizes SSRb over all possible values of b assuming X has rank k 1 1 3 Let b be the OLS estimate from the regression of y on X Let A be a 1k 1 12 3 1k 1 12 nonsingular matrix and define zt xtA t 5 1 p n Therefore zt is 1 3 1k 1 12 and is a nonsingular linear com bination of xt Let Z be the n 3 1k 1 12 matrix with rows zt Let b denote the OLS estimate from a regression of y on Z i Show that b 5 A21b ii Let yt be the fitted values from the original regression and let yt be the fitted values from regress ing y on Z Show that y t 5 yt for all t 5 1 2 p n How do the residuals from the two regres sions compare iii Show that the estimated variance matrix for b is s 2A211XrX2 21A21r where s 2 is the usual vari ance estimate from regressing y on X iv Let the b j be the OLS estimates from regressing yt on 1 xt1 p xtk and let the b j be the OLS es timates from the regression of yt on 1 a1xt1 p akxtk where ai 2 0 j 5 1 p k Use the results from part i to find the relationship between the b j and the b j v Assuming the setup of part iv use part iii to show that se1b j2 5 se1b j20aj0 vi Assuming the setup of part iv show that the absolute values of the t statistics for b j and b j are identical 4 Assume that the model y 5 Xb 1 u satisfies the GaussMarkov assumptions let G be a 1k 1 12 3 1k 1 12 nonsingular nonrandom matrix and define d 5 Gb so that d is also a 1k 1 12 3 1 vector Let b be the 1k 1 12 3 1 vector of OLS estimators and define d 5 Gb as the OLS estimator of d i Show that E1d 0X2 5 d ii Find Var 1d 0X2 in terms of s2 X and G iii Use Problem E3 to verify that d and the appropriate estimate of Var1d 0X2 are obtained from the regression of y on XG21 iv Now let c be a 1k 1 12 3 1 vector with at least one nonzero entry For concreteness assume that ck 2 0 Define u 5 crb so that u is a scalar Define dj 5 bj j 5 0 1 p k 2 1 and dk 5 u Show how to define a 1k 1 12 3 1k 1 12 nonsingular matrix G so that d 5 Gb Hint Each of the first k rows of G should contain k zeros and a one What is the last row v Show that for the choice of G in part iv G21 5 G 1 0 0 0 0 1 0 0 0 0 1 0 2c0ck 2c1ck 2ck21ck 1ck W Use this expression for G1 and part iii to conclude that u and its standard error are obtained as the coefficient on xtk ck in the regression of yt on 31 2 1c0 ck2xtk4 3xt1 2 1c1 ck2xtk4 p 3xt k21 2 1ck21 ck2xtk4 xtk ck t 5 1 p n This regression is exactly the one obtained by writing bk in terms of u and b0 b1 p bk21 plugging the result into the original model and rearranging Therefore we can formally justify the trick we use throughout the text for obtaining the standard error of a linear combination of parameters Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 733 5 Assume that the model y 5 Xb 1 u satisfies the GaussMarkov assumptions and let b be the OLS estimator of b Let Z 5 G1X2 be an n 3 1k 1 12 matrix function of X and assume that ZrX3a 1k 1 12 3 1k 1 12 matrix4 is nonsingular Define a new estimator of b by b 5 1ZrX2 21Zry i Show that E1b0X2 5 b so that b is also unbiased conditional on X ii Find Var1b0X2 Make sure this is a symmetric 1k 1 12 3 1k 1 12 matrix that depends on Z X and s2 iii Which estimator do you prefer b or b Explain 6 Consider the setup of the FrischWaugh Theorem i Using partitioned matrices show that the first order conditions 1XrX2b 5 Xry can be written as Xr1X1b 1 1 Xr1X2b 2 5 Xr1y Xr2 X1b 1 1 Xr2X2b 2 5 Xr2y ii Multiply the first set of equations by Xr2X11Xr1X12 21 and subtract the result from the second set of equations to show that 1Xr2M1X22b 2 5 Xr2M1y where In 2 X11Xr1X12 21Xr1 Conclude that b 2 5 1X r2X 22 21X r2y iii Use part ii to show that b 2 5 1X r2X 22 21X r2 y iv Use the fact that M1X1 5 0 to show that the residuals u from the regression y on X 2 are identical to the residuals û from the regression y on X1 X2 Hint By definition and the FW theorem u 5 y 2 X 2b 2 5 M11y 2 X2b 22 5 M11y 2 X1b 1 2 X2b 22 Now you do the rest 7 Suppose that the linear model written in matrix notation y 5 Xb 1 u satisfies Assumptions E1 E2 and E3 Partition the model as y 5 X1b1 1 X2b2 1 u where X1 is n 3 1k1 1 12 and X2 is n 3 k2 i Consider the following proposal for estimating b2 First regress y on X1 and obtain the residuals say y Then regress y on X2 to get b 2 Show that b 2 is generally biased and show what the bias is You should find E1b 20X2 in terms of b2 X2 and the residualmaking matrix M1 ii As a special case write y 5 X1b1 1 bkXk 1 u where Xk is an n 3 1 vector on the variable xtk Show that E1b k0X2 5 a SSRk g n t51x2 tk bbk where SSRk is the sum of squared residuals from regressing xtk on 1 xt1 xt2 p xt k21 How come the factor multiplying bk is never greater than one iii Suppose you know b1 Show that the regression y 2 X1b1 on X2 produces an unbiased estimator of b2 conditional on X Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 734 Answers to Chapter Questions Appendix F Chapter 2 Question 21 When student ability motivation age and other factors in u are not related to at tendance equation 26 would hold This seems unlikely to be the case Question 22 About 1105 To see this from the average wages measured in 1976 and 2003 dollars we can get the CPI deflator as 1906590 323 When we multiply 342 by 323 we obtain about 1105 Question 23 5465 as can be seen by plugging shareA 5 60 into equation 228 This is not unreasonable if Candidate A spends 60 of the total money spent he or she is predicted to receive almost 55 of the vote Question 24 The equation will be salaryhun 5 963191 1 18501 roe as is easily seen by multiplying equation 239 by 10 Question 25 Equation 258 can be written as Var1b 02 5 1s2n212 1 a n i51x2 i 21 a n i511xi 2 x2 22 where the term multiplying s2n21 is greater than or equal to one but it is equal to one if and only if x 5 0 In this case the variance is as small as it can possibly be Var1b 02 5 s2n Chapter 3 Question 31 Just a few factors include age and gender distribution size of the police force or more generally resources devoted to crime fighting population and general historical factors These factors certainly might be correlated with prbconv and avgsen which means equation 35 would not hold For example size of the police force is possibly correlated with both prbcon and avgsen as some cities put more effort into crime prevention and law enforcement We should try to bring as many of these factors into the equation as possible Question 32 We use the third property of OLS concerning predicted values and residu als when we plug the average values of all independent variables into the OLS regression line Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix F Answers to Chapter Questions 735 we obtain the average value of the dependent variable So colGPA 5 129 1 453 hsGPA 1 0094 ACT 5 129 1 4531342 1 009412422 306You can check the average of colGPA in GPA1 to verify this to the second decimal place Question 33 No The variable shareA is not an exact linear function of expendA and expendB even though it is an exact nonlinear function shareA 5 100 3expendA1expendA 1 expendB2 4 Therefore it is legitimate to have expendA expendB and shareA as explanatory variables Question 34 As we discussed in Section 34 if we are interested in the effect of x1 on y cor relation among the other explanatory variables x2 x3 and so on does not affect Var1b 12 These variables are included as controls and we do not have to worry about collinearity among the control variables Of course we are controlling for them primarily because we think they are correlated with attendance but this is necessary to perform a ceteris paribus analysis Chapter 4 Question 41 Under these assumptions the GaussMarkov assumptions are satisfied u is in dependent of the explanatory variables so E1u0x1 p xk2 5 E1u2 and Var1u0x1 p xk2 5 Var1u2 Further it is easily seen that E1u2 5 0 Therefore MLR4 and MLR5 hold The classical linear model assumptions are not satisfied because u is not normally distributed which is a violation of MLR6 Question 42 H0 b1 5 0 H1 b1 0 Question 43 Because b 1 5 56 0 and we are testing against H1 b1 0 the onesided p value is onehalf of the twosided pvalue or 043 Question 44 H0 b5 5 b6 5 b7 5 b8 5 0 k 5 8 and q 5 4 The restricted version of the model is score 5 b0 1 b1classize 1 b2expend 1 b3tchcomp 1 b4enroll 1 u Question 45 The F statistic for testing exclusion of ACT is 3 1291 2 1832 11 2 2912 41680 2 32 10313 Therefore the absolute value of the t statistic is about 1016 The t statistic on ACT is negative because b ACT is negative so tACT 5 21016 Question 46 Not by much The F test for joint significance of droprate and gradrate is eas ily computed from the Rsquareds in the table F 5 3 1361 2 353211 2 3612 4140222 252 The 10 critical value is obtained from Table G3a as 230 while the 5 critical value from Table G3b is 3 The pvalue is about 082 Thus droprate and gradrate are jointly significant at the 10 level but not at the 5 level In any case controlling for these variables has a minor effect on the bs coefficient Chapter 5 Question 51 This requires some assumptions It seems reasonable to assume that b2 0 score depends positively on priGPA and CovskippedpriGPA 0 skipped and priGPA are negatively correlated This means that b2d1 0 which means that plim b 1 b1 Because b1 is thought to be negative or at least nonpositive a simple regression is likely to overestimate the importance of skip ping classes Question 52 b j 6 196se1b j2 is the asymptotic 95 confidence interval Or we can replace 196 with 2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 736 Chapter 6 Question 61 Because fincdol 5 1000 faminc the coefficient on fincdol will be the coefficient on faminc divided by 1000 or 09271000 5 0000927 The standard error also drops by a factor of 1000 so the t statistic does not change nor do any of the other OLS statistics For readability it is better to measure family income in thousands of dollars Question 62 We can do this generally The equation is log1y2 5 b0 1 b1log1x12 1 b2x2 1 p where x2 is a proportion rather than a percentage Then ceteris paribus Dlog1y2 5 b2Dx2 100 Dlog1y2 5 b21100 Dx22 or Dy b21100 Dx22 Now because Dx2 is the change in the pro portion 100 Dx2 is a percentage point change In particular if Dx2 5 01 then 100 Dx2 5 1 which corresponds to a one percentage point change But then b2 is the percentage change in y when 100 Dx2 5 1 Question 63 The new model would be stndfnl 5 b0 1 b1atndrte 1 b2priGPA 1 b3ACT 1 b4priGPA2 1 b5ACT 2 1 b6 priGPA atndrte 1 b7ACT atndrte 1 u Therefore the partial effect of atndrte on stndfnl is b1 1 b6 priGPA 1 b7 ACT This is what we multiply by Datndrte to obtain the ceteris paribus change in stndfnl Question 64 From equation 621 R2 5 1 2 s 23SST1n 2 12 4 For a given sample and a given dependent variable SST1n 2 12 is fixed When we use different sets of explanatory variables only s 2 changes As s 2 decreases R2 increases If we make s and therefore s 2 as small as possible we are making R2 as large as possible Question 65 One possibility is to collect data on annual earnings for a sample of actors along with profitability of the movies in which they each appeared In a simple regression analysis we could relate earnings to profitability But we should probably control for other factors that may affect salary such as age gender and the kinds of movies in which the actors performed Methods for including qualitative factors in regression models are considered in Chapter 7 Chapter 7 Question 71 No because it would not be clear when party is one and when it is zero A better name would be something like Dem which is one for Democratic candidates and zero for Republi cans Or Rep which is one for Republicans and zero for Democrats Question 72 With outfield as the base group we would include the dummy variables frstbase scndbase thrdbase shrtstop and catcher Question 73 The null in this case is H0 d1 5 d2 5 d3 5 d4 5 0 so that there are four restric tions As usual we would use an F test where q 4 and k depends on the number of other explana tory variables Question 74 Because tenure appears as a quadratic we should allow separate quadratics for men and women That is we would add the explanatory variables female tenure and female tenure2 Question 75 We plug pcnv 5 0 avgsen 5 0 tottime 5 0 ptime86 5 0 qemp86 5 4 black 5 1 and hispan 5 0 into equation 731 arr86 5 380 2 038142 1 170 5 398 or almost 4 It is hard to know whether this is reasonable For someone with no prior convictions who was employed throughout the year this estimate might seem high but remember that the population con sists of men who were already arrested at least once prior to 1986 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix F Answers to Chapter Questions 737 Chapter 8 Question 81 This statement is clearly false For example in equation 87 the usual standard error for black is 147 while the heteroskedasticityrobust standard error is 118 Question 82 The F test would be obtained by regressing u 2 on marrmale marrfem and singfem singmale is the base group With n 5 526 and three independent variables in this regres sion the df are 3 and 522 Question 83 Certainly the outcome of the statistical test suggests some cause for concern A t statistic of 296 is very significant and it implies that there is heteroskedasticity in the wealth equation As a practical matter we know that the WLS standard error 063 is substantially below the heteroskedasticityrobust standard error for OLS 104 and so the heteroskedasticity seems to be practically important Plus the nonrobust OLS standard error is 061 which is too optimistic There fore even if we simply adjust the OLS standard error for heteroskedasticity of unknown form there are nontrivial implications Question 84 The 1 critical value in the F distribution with 12 2 df is 461 An F statistic of 1115 is well above the 1 critical value and so we strongly reject the null hypothesis that the trans formed errors ui hi are homoskedastic In fact the pvalue is less than 00002 which is obtained from the F2804 distribution This means that our model for Var1u0x2 is inadequate for fully eliminat ing the heteroskedasticity in u Chapter 9 Question 91 These are binary variables and squaring them has no effect black2 5 black and hispan2 5 hispan Question 92 When educ IQ is in the equation the coefficient on educ say b1 measures the effect of educ on logwage when IQ 5 0 The partial effect of education is b1 1 b9IQ There is no one in the population of interest with an IQ close to zero At the average population IQ which is 100 the estimated return to education from column 3 is 018 1 0003411002 5 052 which is almost what we obtain as the coefficient on educ in column 2 Question 93 No If educp is an integerwhich means someone has no education past the pre vious grade completedthe measurement error is zero If educp is not an integer educ educp so the measurement error is negative At a minimum e1 cannot have zero mean and e1 and educp are probably correlated Question 94 An incumbents decision not to run may be systematically related to how he or she expects to do in the election Therefore we may only have a sample of incumbents who are stronger on average than all possible incumbents who could run This results in a sample selection problem if the population of interest includes all incumbents If we are only interested in the effects of campaign expenditures on election outcomes for incumbents who seek reelection there is no sample selection problem Chapter 10 Question 101 The impact propensity is 48 while the longrun propensity is 48 2 15 1 32 5 65 Question 102 The explanatory variables are xt1 5 zt and xt2 5 zt21 The absence of perfect collinearity means that these cannot be constant and there cannot be an exact linear relationship Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 738 between them in the sample This rules out the possibility that all the z1 p zn take on the same value or that the z0 z1 p zn21 take on the same value But it eliminates other patterns as well For example if zt 5 a 1 bt for constants a and b then zt21 5 a 1 b1t 2 12 5 1a 1 bt2 2 b 5 zt 2 b which is a perfect linear function of zt Question 103 If zt is slowly moving over timeas is the case for the levels or logs of many economic time seriesthen zt and zt21 can be highly correlated For example the correlation between unemt and unemt21 in PHILLIPS is 75 Question 104 No because a linear time trend with a1 0 becomes more and more negative as t gets large Since gfr cannot be negative a linear time trend with a negative trend coefficient cannot represent gfr in all future time periods Question 105 The intercept for March is b0 1 d2 Seasonal dummy variables are strictly exog enous because they follow a deterministic pattern For example the months do not change based upon whether either the explanatory variables or the dependent variables change Chapter 11 Question 111 i No because E1yt2 5 d0 1 d1t depends on t ii Yes because yt 2 E1yt2 5 et is an iid sequence Question 112 We plug inf e t 5 1122inft21 1 1122inft22 into inft 2 inf e t 5 b11unemt 2 m02 1 et and rearrange inft 2 1122 1inft21 1 inft222 5 b0 1 b1unemt 1 et where b0 5 2b1m0 as before Therefore we would regress yt on unemt where yt 5 inft 2 1122 1inft21 1 inft222 Note that we lose the first two observations in constructing yt Question 113 No because ut and ut21 are correlated In particular Cov1utut212 5 E3 1et 1 a1et212 1et21 1 a1et222 4 5 a1E1e2 t212 5 a1s2 e 2 0 if a1 2 0 If the errors are serially correlated the model cannot be dynamically complete Chapter 12 Question 121 We use equation 124 Now only adjacent terms are correlated In particular the covariance between xtut and xt11ut11 is xt xt11Cov1utut112 5 xt xt11as2 e Therefore the formula is Var1b 12 5 SST22 x a a n t51 x2 tVar1ut2 1 2 a n21 t51 xt xt11E1utut112 b 5 s2SSTx 1 12SST2 x2 a n21 t51 as2 e xt xt11 5 s2SSTx 1 as2 e12SST2 x2 a n21 t51 xt xt11 where s2 5 Var1ut2 5 s2 e 1 a2 1s2 e 5 s2 e11 1 a2 12 Unless xt and xt11 are uncorrelated in the sample the second term is nonzero whenever a1 2 0 Notice that if xt and xt11 are positively correlated and a 0 the true variance is actually smaller than the usual variance When the equation is in levels as opposed to being differenced the typical case is a 0 with positive correlation between xt and xt11 Question 122 r 6 196se1r 2 where se1r 2 is the standard error reported in the regression Or we could use the heteroskedasticityrobust standard error Showing that this is asymptotically valid is complicated because the OLS residuals depend on b j but it can be done Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix F Answers to Chapter Questions 739 Question 123 The model we have in mind is ut 5 r1ut21 1 r4ut24 1 et and we want to test H0 r1 5 0 r4 5 0 against the alternative that H0 is false We would run the regression of u t on u t21 and u t24 to obtain the usual F statistic for joint significance of the two lags We are testing two restrictions Question 124 We would probably estimate the equation using first differences as r 5 92 is close enough to 1 to raise questions about the levels regression See Chapter 18 for more discussion Question 125 Because there is only one explanatory variable the White test is easy to com pute Simply regress u 2 t on returnt21 and return2 t21 with an intercept as always and compute the F test for joint significance of returnt21 and return2 t21 If these are jointly significant at a small enough significance level we reject the null of homoskedasticity Chapter 13 Question 131 Yes assuming that we have controlled for all relevant factors The coefficient on black is 1076 and with a standard error of 174 it is not statistically different from 1 The 95 confidence interval is from about 735 to 1417 Question 132 The coefficient on highearn shows that in the absence of any change in the earnings cap high earners spend much more timeon the order of 292 on average because exp12562 2 1 292on workers compensation Question 133 First E1vi12 5 E1ai 1 ui12 5 E1ai2 1 E1vi12 5 0 Similarly E1vi22 5 0 Therefore the covariance between vi1 and vi2 is simply E1vi1vi22 5 E3 1ai 1 ui12 1ai 1 ui22 4 5 E1a2 i 2 1 E1aiui12 1 E1aiui22 1 E1ui1ui22 5 E1a2 i 2 because all of the covariance terms are zero by assumption But E1a2 i 2 5 Var1ai2 because E1ai2 5 0 This causes positive serial correlation across time in the errors within each i which biases the usual OLS standard errors in a pooled OLS regression Question 134 Because Dadmn 5 admn90 2 admn85 is the difference in binary indicators it can be 21 if and only if admn90 5 0 and admn85 5 1 In other words Washington state had an ad ministrative per se law in 1985 but it was repealed by 1990 Question 135 No just as it does not cause bias and inconsistency in a time series regression with strictly exogenous explanatory variables There are two reasons it is a concern First serial cor relation in the errors in any equation generally biases the usual OLS standard errors and test statistics Second it means that pooled OLS is not as efficient as estimators that account for the serial correla tion as in Chapter 12 Chapter 14 Question 141 Whether we use first differencing or the within transformation we will have trouble estimating the coefficient on kidsit For example using the within transformation if kidsit does not vary for family i then kidsit 5 kidsit 2 kidsi 5 0 for t 5 123 As long as some families have variation in kidsit then we can compute the fixed effects estimator but the kids coefficient could be very imprecisely estimated This is a form of multicollinearity in fixed effects estimation or first differencing estimation Question 142 If a firm did not receive a grant in the first year it may or may not receive a grant in the second year But if a firm did receive a grant in the first year it could not get a grant in Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 740 the second year That is if grant21 5 1 then grant 5 0 This induces a negative correlation between grant and grant21 We can verify this by computing a regression of grant on grant21 using the data in JTRAIN for 1989 Using all firms in the sample we get grant 5 248 2 248 grant21 10352 10722 n 5 157 R2 5 070 The coefficient on grant21 must be the negative of the intercept because grant 5 0 when grant21 5 1 Question 143 It suggests that the unobserved effect ai is positively correlated with unionii Re member pooled OLS leaves ai in the error term while fixed effects removes ai By definition ai has a positive effect on logwage By the standard omitted variables analysis see Chapter 3 OLS has an upward bias when the explanatory variable union is positively correlated with the omitted variable 1ai2 Thus belonging to a union appears to be positively related to timeconstant unobserved factors that affect wage Question 144 Not if all sisters within a family have the same mother and father Then because the parents race variables would not change by sister they would be differenced away in equation 1413 Chapter 15 Question 151 Probably not In the simple equation 1518 years of education is part of the error term If some men who were assigned low draft lottery numbers obtained additional schooling then lottery number and education are negatively correlated which violates the first requirement for an instrumental variable in equation 154 Question 152 i For equation 1527 we require that high school peer group effects carry over to college Namely for a given SAT score a student who went to a high school where smoking marijuana was more popular would smoke more marijuana in college Even if the identification con dition equation 1527 holds the link might be weak ii We have to assume that percentage of students using marijuana at a students high school is not correlated with unobserved factors that affect college grade point average Although we are some what controlling for high school quality by including SAT in the equation this might not be enough Perhaps high schools that did a better job of preparing students for college also had fewer students smoking marijuana Or marijuana usage could be correlated with average income levels These are of course empirical questions that we may or may not be able to answer Question 153 Although prevalence of the NRA and subscribers to gun magazines are probably correlated with the presence of gun control legislation it is not obvious that they are uncorrelated with unobserved factors that affect the violent crime rate In fact we might argue that a population interested in guns is a reflection of high crime rates and controlling for economic and demographic variables is not sufficient to capture this It would be hard to argue persuasively that these are truly exogenous in the violent crime equation Question 154 As usual there are two requirements First it should be the case that growth in government spending is systematically related to the party of the president after netting out the investment rate and growth in the labor force In other words the instrument must be partially cor related with the endogenous explanatory variable While we might think that government spend ing grows more slowly under Republican presidents this certainly has not always been true in the United States and would have to be tested using the t statistic on REPt21 in the reduced form Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix F Answers to Chapter Questions 741 gGovt 5 p0 1 p1REPt21 1 p2INVRATt 1 p3gLABt 1 vt We must assume that the party of the president has no separate effect on gGDP This would be violated if for example monetary policy differs systematically by presidential party and has a separate effect on GDP growth Chapter 16 Question 161 Probably not It is because firms choose price and advertising expenditures jointly that we are not interested in the experiment where say advertising changes exogenously and we want to know the effect on price Instead we would model price and advertising each as a function of demand and cost variables This is what falls out of the economic theory Question 162 We must assume two things First money supply growth should appear in equation 1622 so that it is partially correlated with inf Second we must assume that money sup ply growth does not appear in equation 1623 If we think we must include money supply growth in equation 1623 then we are still short an instrument for inf Of course the assumption that money supply growth is exogenous can also be questioned Question 163 Use the Hausman test from Chapter 15 In particular let v2 be the OLS residuals from the reduced form regression of open on logpcinc and logland Then use an OLS regression of inf on open logpcinc and v2 and compute the t statistic for significance of v2 If v2 is significant the 2SLS and OLS estimates are statistically different Question 164 The demand equation looks like log1fisht2 5 b0 1 b1log1prcfisht2 1 b2log1inct2 1 b3log1prcchickt2 1 b4log1prcbeeft2 1 ut1 where logarithms are used so that all elasticities are constant By assumption the demand func tion contains no seasonality so the equation does not contain monthly dummy variables say febt mart p dect with January as the base month Also by assumption the supply of fish is sea sonal which means that the supply function does depend on at least some of the monthly dummy variables Even without solving the reduced form for logprcfish we conclude that it depends on the monthly dummy variables Since these are exogenous they can be used as instruments for logprcfish in the demand equation Therefore we can estimate the demandforfish equation using monthly dummies as the IVs for logprcfish Identification requires that at least one monthly dummy variable appears with a nonzero coefficient in the reduced form for logprcfish Chapter 17 Question 171 H0 b4 5 b5 5 b6 5 0 so that there are three restrictions and therefore three df in the LR or Wald test Q u e s t i o n 1 7 2 We need the partial derivative of F1b 0 1 b 1nwifeinc 1 b 2educ 1 b 3exper 1 b 4exper2 1 p2 with respect to exper which is f12 1b 3 1 2b 4exper2 where f12 is evaluated at the given values and the initial level of experience There fore we need to evaluate the standard normal probability density at 270 2 012120132 1 13111232 1 1231102 2 001911022 2 05314252 2 868102 1 036112 463 where we plug in the initial level of experience 10 But f14632 5 12p2 212exp3214632224 358 Next we multiply this by b 3 1 2b 4exper which is evaluated at exper 5 10 The partial effect using the calcu lus approximation is 3583123 2 2100192 1102 4 030 In other words at the given values of the explanatory variables and starting at exper 5 10 the next year of experience increases the probability of labor force participation by about 03 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 742 Question 173 No The number of extramarital affairs is a nonnegative integer which presum ably takes on zero or small numbers for a substantial fraction of the population It is not realistic to use a Tobit model which while allowing a pileup at zero treats y as being continuously distributed over positive values Formally assuming that y 5 max10 yp2 where yp is normally distributed is at odds with the discreteness of the number of extramarital affairs when y 0 Question 174 The adjusted standard errors are the usual Poisson MLE standard errors multiplied by s 5 2 141 so the adjusted standard errors will be about 41 higher The quasiLR statistic is the usual LR statistic divided by s 2 so it will be onehalf of the usual LR statistic Question 175 By assumption mvpi 5 b0 1 xib 1 ui where as usual xib denotes a linear function of the exogenous variables Now observed wage is the largest of the minimum wage and the marginal value product so wagei 5 max1minwageimvpi2 which is very similar to equation 1734 except that the max operator has replaced the min operator Chapter 18 Question 181 We can plug these values directly into equation 181 and take expecta tions First because zs 5 0 for all s 0 y21 5 a 1 u21 Then z0 5 1 so y0 5 a 1 d0 1 u0 For h 1 yh 5 a 1 dh21 1 dh 1 uh Because the errors have zero expected values E1y212 5 a E1y02 5 a 1 d0 and E1yh2 5 a 1 dh21 1 d for all h 1 As h S dh S 0 It follows that E1yh2 S a as h S that is the expected value of yh returns to the expected value before the increase in z at time zero This makes sense although the increase in z lasted for two periods it is still a temporary increase Question 182 Under the described setup Dyt and Dxt are iid sequences that are independent of one another In particular Dyt and Dxt are uncorrelated If g t is the slope coefficient from regressing Dyt on Dxt t 5 1 2 p n then plim g t 5 0 This is as it should be as we are regressing one I0 pro cess on another I0 process and they are uncorrelated We write the equation Dyt 5 g0 1 g1Dxt 1 et where g0 5 g1 5 0 Because 5et6 is independent of 5Dxt6 the strict exogeneity assumption holds Moreover 5et6 is serially uncorrelated and homoskedastic By Theorem 112 in Chapter 11 the t sta tistic for g t has an approximate standard normal distribution If et is normally distributed the classical linear model assumptions hold and the t statistic has an exact t distribution Question 183 Write xt 5 xt21 1 at where 5at6 is I102 By assumption there is a linear combi nation say st 5 yt 2 bxt which is I102 Now yt 2 bxt21 5 yt 2 b1xt 2 at2 5 st 1 bat Because st and at are I102 by assumption so is st 1 bat Question 184 Just use the sum of squared residuals form of the F test and assume homoskedasticity The restricted SSR is obtained by regressing Dhy6t 2 Dhy3t21 1 1hy6t21 2 hy3t222 on a constant Notice that a0 is the only parameter to estimate in Dhy6t 5 a0 1 g0Dhy3t21 1 d1hy6t21 2 hy3t222 when the restrictions are imposed The unrestricted sum of squared residuals is obtained from equation 1839 Question 185 We are fitting two equations yt 5 a 1 b t and yt 5 g 1 dyeart We can obtain the relationship between the parameters by noting that yeart 5 t 1 49 Plugging this into the sec ond equation gives yt 5 g 1 d 1t 1 492 5 1g 1 49d 2 1 dt Matching the slope and intercept with the first equation gives d 5 b so that the slopes on t and yeart are identicaland a 5 g 1 49d Generally when we use year rather than t the intercept will change but the slope will not You can verify this by using one of the time series data sets such as HSEINV or INVEN Whether we use t or some measure of year does not change fitted values and naturally it does not change forecasts of future values The intercept simply adjusts appropriately to different ways of including a trend in the regression Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 743 Table G1 Cumulative Areas under the Standard Normal Distribution z 0 1 2 3 4 5 6 7 8 9 230 00013 00013 00013 00012 00012 00011 00011 00011 00010 00010 229 00019 00018 00018 00017 00016 00016 00015 00015 00014 00014 228 00026 00025 00024 00023 00023 00022 00021 00021 00020 00019 227 00035 00034 00033 00032 00031 00030 00029 00028 00027 00026 226 00047 00045 00044 00043 00041 00040 00039 00038 00037 00036 225 00062 00060 00059 00057 00055 00054 00052 00051 00049 00048 224 00082 00080 00078 00075 00073 00071 00069 00068 00066 00064 223 00107 00104 00102 00099 00096 00094 00091 00089 00087 00084 222 00139 00136 00132 00129 00125 00122 00119 00116 00113 00110 221 00179 00174 00170 00166 00162 00158 00154 00150 00146 00143 220 00228 00222 00217 00212 00207 00202 00197 00192 00188 00183 219 00287 00281 00274 00268 00262 00256 00250 00244 00239 00233 218 00359 00351 00344 00336 00329 00322 00314 00307 00301 00294 217 00446 00436 00427 00418 00409 00401 00392 00384 00375 00367 216 00548 00537 00526 00516 00505 00495 00485 00475 00465 00455 215 00668 00655 00643 00630 00618 00606 00594 00582 00571 00559 214 00808 00793 00778 00764 00749 00735 00721 00708 00694 00681 213 00968 00951 00934 00918 00901 00885 00869 00853 00838 00823 212 01151 01131 01112 01093 01075 01056 01038 01020 01003 00985 211 01357 01335 01314 01292 01271 01251 01230 01210 01190 01170 210 01587 01562 01539 01515 01492 01469 01446 01423 01401 01379 209 01841 01814 01788 01762 01736 01711 01685 01660 01635 01611 208 02119 02090 02061 02033 02005 01977 01949 01922 01894 01867 207 02420 02389 02358 02327 02296 02266 02236 02206 02177 02148 206 02743 02709 02676 02643 02611 02578 02546 02514 02483 02451 205 03085 03050 03015 02981 02946 02912 02877 02843 02810 02776 204 03446 03409 03372 03336 03300 03264 03228 03192 03156 03121 Statistical Tables Appendix G continued Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it appendices 744 z 0 1 2 3 4 5 6 7 8 9 203 03821 03783 03745 03707 03669 03632 03594 03557 03520 03483 202 04207 04168 04129 04090 04052 04013 03974 03936 03897 03859 201 04602 04562 04522 04483 04443 04404 04364 04325 04286 04247 200 05000 04960 04920 04880 04840 04801 04761 04721 04681 04641 00 05000 05040 05080 05120 05160 05199 05239 05279 05319 05359 01 05398 05438 05478 05517 05557 05596 05636 05675 05714 05753 02 05793 05832 05871 05910 05948 05987 06026 06064 06103 06141 03 06179 06217 06255 06293 06331 06368 06406 06443 06480 06517 04 06554 06591 06628 06664 06700 06736 06772 06808 06844 06879 05 06915 06950 06985 07019 07054 07088 07123 07157 07190 07224 06 07257 07291 07324 07357 07389 07422 07454 07486 07517 07549 07 07580 07611 07642 07673 07704 07734 07764 07794 07823 07852 08 07881 07910 07939 07967 07995 08023 08051 08078 08106 08133 09 08159 08186 08212 08238 08264 08289 08315 08340 08365 08389 10 08413 08438 08461 08485 08508 08531 08554 08577 08599 08621 11 08643 08665 08686 08708 08729 08749 08770 08790 08810 08830 12 08849 08869 08888 08907 08925 08944 08962 08980 08997 09015 13 09032 09049 09066 09082 09099 09115 09131 09147 09162 09177 14 09192 09207 09222 09236 09251 09265 09279 09292 09306 09319 15 09332 09345 09357 09370 09382 09394 09406 09418 09429 09441 16 09452 09463 09474 09484 09495 09505 09515 09525 09535 09545 17 09554 09564 09573 09582 09591 09599 09608 09616 09625 09633 18 09641 09649 09656 09664 09671 09678 09686 09693 09699 09706 19 09713 09719 09726 09732 09738 09744 09750 09756 09761 09767 20 09772 09778 09783 09788 09793 09798 09803 09808 09812 09817 21 09821 09826 09830 09834 09838 09842 09846 09850 09854 09857 22 09861 09864 09868 09871 09875 09878 09881 09884 09887 09890 23 09893 09896 09898 09901 09904 09906 09909 09911 09913 09916 24 09918 09920 09922 09925 09927 09929 09931 09932 09934 09936 25 09938 09940 09941 09943 09945 09946 09948 09949 09951 09952 26 09953 09955 09956 09957 09959 09960 09961 09962 09963 09964 27 09965 09966 09967 09968 09969 09970 09971 09972 09973 09974 28 09974 09975 09976 09977 09977 09978 09979 09979 09980 09981 29 09981 09982 09982 09983 09984 09984 09985 09985 09986 09986 30 09987 09987 09987 09988 09988 09989 09989 09989 09990 09990 Table G1 Continued Examples If Z Normal10 12 then P1Z 21322 5 0934 and P1Z 1842 5 9671 Source This table was generated using the Stata function normal Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it appendix G Statistical Tables 745 Table G2 Critical Values of the t Distribution Significance Level 1Tailed 10 05 025 01 005 2Tailed 20 10 05 02 01 1 3078 6314 12706 31821 63657 2 1886 2920 4303 6965 9925 3 1638 2353 3182 4541 5841 4 1533 2132 2776 3747 4604 5 1476 2015 2571 3365 4032 6 1440 1943 2447 3143 3707 7 1415 1895 2365 2998 3499 8 1397 1860 2306 2896 3355 9 1383 1833 2262 2821 3250 10 1372 1812 2228 2764 3169 11 1363 1796 2201 2718 3106 12 1356 1782 2179 2681 3055 13 1350 1771 2160 2650 3012 14 1345 1761 2145 2624 2977 15 1341 1753 2131 2602 2947 16 1337 1746 2120 2583 2921 17 1333 1740 2110 2567 2898 18 1330 1734 2101 2552 2878 19 1328 1729 2093 2539 2861 20 1325 1725 2086 2528 2845 21 1323 1721 2080 2518 2831 22 1321 1717 2074 2508 2819 23 1319 1714 2069 2500 2807 24 1318 1711 2064 2492 2797 25 1316 1708 2060 2485 2787 26 1315 1706 2056 2479 2779 27 1314 1703 2052 2473 2771 28 1313 1701 2048 2467 2763 29 1311 1699 2045 2462 2756 30 1310 1697 2042 2457 2750 40 1303 1684 2021 2423 2704 60 1296 1671 2000 2390 2660 90 1291 1662 1987 2368 2632 120 1289 1658 1980 2358 2617 1282 1645 1960 2326 2576 D e g r e e s o f F r e e d o m Examples The 1 critical value for a onetailed test with 25 df is 2485 The 5 critical value for a twotailed test with large 1202 df is 196 Source This table was generated using the Stata function invttail Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it appendices 746 Example The 10 critical value for numerator df 5 2 and denominator df 5 40 is 244 Source This table was generated using the Stata function invFtail Table G3a 10 Critical Values of the F Distribution Numerator Degrees of Freedom 1 2 3 4 5 6 7 8 9 10 10 329 292 273 261 252 246 241 238 235 232 11 323 286 266 254 245 239 234 230 227 225 12 318 281 261 248 239 233 228 224 221 219 13 314 276 256 243 235 228 223 220 216 214 14 310 273 252 239 231 224 219 215 212 210 15 307 270 249 236 227 221 216 212 209 206 16 305 267 246 233 224 218 213 209 206 203 17 303 264 244 231 222 215 210 206 203 200 18 301 262 242 229 220 213 208 204 200 198 19 299 261 240 227 218 211 206 202 198 196 20 297 259 238 225 216 209 204 200 196 194 21 296 257 236 223 214 208 202 198 195 192 22 295 256 235 222 213 206 201 197 193 190 23 294 255 234 221 211 205 199 195 192 189 24 293 254 233 219 210 204 198 194 191 188 25 292 253 232 218 209 202 197 193 189 187 26 291 252 231 217 208 201 196 192 188 186 27 290 251 230 217 207 200 195 191 187 185 28 289 250 229 216 206 200 194 190 187 184 29 289 250 228 215 206 199 193 189 186 183 30 288 249 228 214 205 198 193 188 185 182 40 284 244 223 209 200 193 187 183 179 176 60 279 239 218 204 195 187 182 177 174 171 90 276 236 215 201 191 184 178 174 170 167 120 275 235 213 199 190 182 177 172 168 165 271 230 208 194 185 177 172 167 163 160 D e n o m i n a t o r D e g r e e s o f F r e e d o m Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it appendix G Statistical Tables 747 Table G3b 5 Critical Values of the F Distribution Numerator Degrees of Freedom 1 2 3 4 5 6 7 8 9 10 10 496 410 371 348 333 322 314 307 302 298 11 484 398 359 336 320 309 301 295 290 285 12 475 389 349 326 311 300 291 285 280 275 13 467 381 341 318 303 292 283 277 271 267 14 460 374 334 311 296 285 276 270 265 260 15 454 368 329 306 290 279 271 264 259 254 16 449 363 324 301 285 274 266 259 254 249 17 445 359 320 296 281 270 261 255 249 245 18 441 355 316 293 277 266 258 251 246 241 19 438 352 313 290 274 263 254 248 242 238 20 435 349 310 287 271 260 251 245 239 235 21 432 347 307 284 268 257 249 242 237 232 22 430 344 305 282 266 255 246 240 234 230 23 428 342 303 280 264 253 244 237 232 227 24 426 340 301 278 262 251 242 236 230 225 25 424 339 299 276 260 249 240 234 228 224 26 423 337 298 274 259 247 239 232 227 222 27 421 335 296 273 257 246 237 231 225 220 28 420 334 295 271 256 245 236 229 224 219 29 418 333 293 270 255 243 235 228 222 218 30 417 332 292 269 253 242 233 227 221 216 40 408 323 284 261 245 234 225 218 212 208 60 400 315 276 253 237 225 217 210 204 199 90 395 310 271 247 232 220 211 204 199 194 120 392 307 268 245 229 217 209 202 196 191 384 300 260 237 221 210 201 194 188 183 D e n o m i n a t o r D e g r e e s o f F r e e d o m Example The 5 critical value for numerator df 5 4 and large denominator df12 is 237 Source This table was generated using the Stata function invFtail Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it appendices 748 Table G3c 1 Critical Values of the F Distribution Numerator Degrees of Freedom 1 2 3 4 5 6 7 8 9 10 10 1004 756 655 599 564 539 520 506 494 485 11 965 721 622 567 532 507 489 474 463 454 12 933 693 595 541 506 482 464 450 439 430 13 907 670 574 521 486 462 444 430 419 410 14 886 651 556 504 469 446 428 414 403 394 15 868 636 542 489 456 432 414 400 389 380 16 853 623 529 477 444 420 403 389 378 369 17 840 611 518 467 434 410 393 379 368 359 18 829 601 509 458 425 401 384 371 360 351 19 818 593 501 450 417 394 377 363 352 343 20 810 585 494 443 410 387 370 356 346 337 21 802 578 487 437 404 381 364 351 340 331 22 795 572 482 431 399 376 359 345 335 326 23 788 566 476 426 394 371 354 341 330 321 24 782 561 472 422 390 367 350 336 326 317 25 777 557 468 418 385 363 346 332 322 313 26 772 553 464 414 382 359 342 329 318 309 27 768 549 460 411 378 356 339 326 315 306 28 764 545 457 407 375 353 336 323 312 303 29 760 542 454 404 373 350 333 320 309 300 30 756 539 451 402 370 347 330 317 307 298 40 731 518 431 383 351 329 312 299 289 280 60 708 498 413 365 334 312 295 282 272 263 90 693 485 401 354 323 301 284 272 261 252 120 685 479 395 348 317 296 279 266 256 247 663 461 378 332 302 280 264 251 241 232 D e n o m i n a t o r D e g r e e s o f F r e e d o m Example The 1 critical value for numerator df 5 3 and denominator df 5 60 is 413 Source This table was generated using the Stata function invFtail Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it appendix G Statistical Tables 749 Table G4 Critical Values of the ChiSquare Distribution Significance Level 10 05 01 1 271 384 663 2 461 599 921 3 625 781 1134 4 778 949 1328 5 924 1107 1509 6 1064 1259 1681 7 1202 1407 1848 8 1336 1551 2009 9 1468 1692 2167 10 1599 1831 2321 11 1728 1968 2472 12 1855 2103 2622 13 1981 2236 2769 14 2106 2368 2914 15 2231 2500 3058 16 2354 2630 3200 17 2477 2759 3341 18 2599 2887 3481 19 2720 3014 3619 20 2841 3141 3757 21 2962 3267 3893 22 3081 3392 4029 23 3201 3517 4164 24 3320 3642 4298 25 3438 3765 4431 26 3556 3889 4564 27 3674 4011 4696 28 3792 4134 4828 29 3909 4256 4959 30 4026 4377 5089 D e g r e e s o f F r e e d o m Example The 5 critical value with df 5 8 is 1551 Source This table was generated using the Stata function invchi2tail Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 750 References Angrist J D 1990 Lifetime Earnings and the Vietnam Era Draft Lottery Evidence from Social Security Ad ministrative Records American Economic Review 80 313336 Angrist J D and A B Krueger 1991 Does Compulsory School Attendance Affect Schooling and Earnings Quar terly Journal of Economics 106 9791014 Ashenfelter O and A B Krueger 1994 Estimates of the Economic Return to Schooling from a New Sample of Twins American Economic Review 84 11571173 Averett S and S Korenman 1996 The Economic Real ity of the Beauty Myth Journal of Human Resources 31 304330 Ayres I and S D Levitt 1998 Measuring Positive Exter nalities from Unobservable Victim Precaution An Empiri cal Analysis of Lojack Quarterly Journal of Economics 108 4377 Banerjee A J Dolado J W Galbraith and D F Hendry 1993 CoIntegration ErrorCorrection and the Econo metric Analysis of NonStationary Data Oxford Oxford University Press Bartik T J 1991 The Effects of Property Taxes and Oth er Local Public Policies on the Intrametropolitan Pattern of Business Location in Industry Location and Public Policy ed H W Herzog and A M Schlottmann 5780 Knoxville University of Tennessee Press Becker G S 1968 Crime and Punishment An Economic Approach Journal of Political Economy 76 169217 Belsley D E Kuh and R Welsch 1980 Regression Diag nostics Identifying Influential Data and Sources of Collin earity New York Wiley Berk R A 1990 A Primer on Robust Regression in Mod ern Methods of Data Analysis ed J Fox and J S Long 292324 Newbury Park CA Sage Publications Betts J R 1995 Does School Quality Matter Evidence from the National Longitudinal Survey of Youth Review of Economics and Statistics 77 231250 Biddle J E and D S Hamermesh 1990 Sleep and the Allocation of Time Journal of Political Economy 98 922943 Biddle J E and D S Hamermesh 1998 Beauty Produc tivity and Discrimination Lawyers Looks and Lucre Journal of Labor Economics 16 172201 Blackburn M and D Neumark 1992 Unobserved Ability Efficiency Wages and Interindustry Wage Differentials Quarterly Journal of Economics 107 14211436 Blinder A S and M W Watson 2014 Presidents and the US Economy An Econometric Exploration National Bureau of Economic Research Working Paper No 20324 Blomström M R E Lipsey and M Zejan 1996 Is Fixed Investment the Key to Economic Growth Quarterly Jour nal of Economics 111 269276 Blundell R A Duncan and K Pendakur 1998 Semipara metric Estimation and Consumer Demand Journal of Ap plied Econometrics 13 435461 Bollerslev T R Y Chou and K F Kroner 1992 ARCH Modeling in Finance A Review of the Theory and Empiri cal Evidence Journal of Econometrics 52 559 Bollerslev T R F Engle and D B Nelson 1994 ARCH Models in Handbook of Econometrics volume 4 chapter 49 ed R F Engle and D L McFadden 29593038 Am sterdam NorthHolland Bound J D A Jaeger and R M Baker 1995 Problems with Instrumental Variables Estimation When the Correla tion between the Instruments and Endogenous Explanatory Variables Is Weak Journal of the American Statistical As sociation 90 443450 Breusch T S and A R Pagan 1979 A Simple Test for Heteroskedasticity and Random Coefficient Variation Econometrica 47 9871007 Cameron A C and P K Trivedi 1998 Regression Analysis of Count Data Cambridge Cambridge University Press Campbell J Y and N G Mankiw 1990 Permanent Income Current Income and Consumption Journal of Business and Economic Statistics 8 265279 Card D 1995 Using Geographic Variation in College Proximity to Estimate the Return to Schooling in Aspects of Labour Market Behavior Essays in Honour of John Vanderkamp ed L N Christophides E K Grant and R Swidinsky 201222 Toronto University of Toronto Press Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 751 Card D and A Krueger 1992 Does School Quality Mat ter Returns to Education and the Characteristics of Public Schools in the United States Journal of Political Economy 100 140 CastilloFreeman A J and R B Freeman 1992 When the Minimum Wage Really Bites The Effect of the USLevel Minimum on Puerto Rico in Immigration and the Work Force ed G J Borjas and R B Freeman 177211 Chi cago University of Chicago Press Clark K B 1984 Unionization and Firm Performance The Impact on Profits Growth and Productivity American Economic Review 74 893919 Cloninger D O 1991 Lethal Police Response as a Crime De terrent 57City Study Suggests a Decrease in Certain Crimes American Journal of Economics and Sociology 50 5969 Cloninger D O and L C Sartorius 1979 Crime Rates Clearance Rates and Enforcement Effort The Case of Hou ston Texas American Journal of Economics and Sociol ogy 38 389402 Cochrane J H 1997 Where Is the Market Going Uncer tain Facts and Novel Theories Economic Perspectives 21 Federal Reserve Bank of Chicago 337 Cornwell C and W N Trumbull 1994 Estimating the Eco nomic Model of Crime Using Panel Data Review of Eco nomics and Statistics 76 360366 Craig B R W E Jackson III and J B Thomson 2007 Small Firm Finance Credit Rationing and the Impact of SBAGuaranteed Lending on Local Economic Growth Journal of Small Business Management 45 116132 Currie J 1995 Welfare and the WellBeing of Children Chur Switzerland Harwood Academic Publishers Currie J and N Cole 1993 Welfare and Child Health The Link between AFDC Participation and Birth Weight American Economic Review 83 971983 Currie J and D Thomas 1995 Does Head Start Make a Difference American Economic Review 85 341364 Davidson R and J G MacKinnon 1981 Several Tests of Model Specification in the Presence of Alternative Hypoth eses Econometrica 49 781793 Davidson R and J G MacKinnon 1993 Estimation and Inference in Econometrics New York Oxford University Press De Long J B and L H Summers 1991 Equipment Invest ment and Economic Growth Quarterly Journal of Eco nomics 106 445502 Dickey D A and W A Fuller 1979 Distributions of the Estimators for Autoregressive Time Series with a Unit Root Journal of the American Statistical Association 74 427431 Diebold F X 2001 Elements of Forecasting 2nd ed Cincin nati SouthWestern Downes T A and S M Greenstein 1996 Understand ing the Supply Decisions of Nonprofits Modeling the Location of Private Schools Rand Journal of Economics 27 365390 Draper N and H Smith 1981 Applied Regression Analysis 2nd ed New York Wiley Duan N 1983 Smearing Estimate A Nonparametric Re transformation Method Journal of the American Statisti cal Association 78 605610 Durbin J 1970 Testing for Serial Correlation in Least Squares Regressions When Some of the Regressors Are Lagged Dependent Variables Econometrica 38 410421 Durbin J and G S Watson 1950 Testing for Serial Cor relation in Least Squares Regressions I Biometrika 37 409428 Eicker F 1967 Limit Theorems for Regressions with Un equal and Dependent Errors Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Prob ability 1 5982 Berkeley University of California Press Eide E 1994 Economics of Crime Deterrence and the Ra tional Offender Amsterdam NorthHolland Engle R F 1982 Autoregressive Conditional Heteroske dasticity with Estimates of the Variance of United Kingdom Inflation Econometrica 50 9871007 Engle R F and C W J Granger 1987 Cointegration and Error Correction Representation Estimation and Testing Econometrica 55 251276 Evans W N and R M Schwab 1995 Finishing High School and Starting College Do Catholic Schools Make a Difference Quarterly Journal of Economics 110 941974 Fair R C 1996 Econometrics and Presidential Elections Journal of Economic Perspectives 10 89102 Franses P H and R Paap 2001 Quantitative Models in Marketing Research Cambridge Cambridge University Press Freeman D G 2007 Drunk Driving Legislation and Traffic Fatalities New Evidence on BAC 08 Laws Contemporary Economic Policy 25 293308 Friedman B M and K N Kuttner 1992 Money Income Prices and Interest Rates American Economic Review 82 472492 Geronimus A T and S Korenman 1992 The Socioeco nomic Consequences of Teen Childbearing Reconsidered Quarterly Journal of Economics 107 11871214 Goldberger A S 1991 A Course in Econometrics Cam bridge MA Harvard University Press Graddy K 1995 Testing for Imperfect Competition at the Fulton Fish Market Rand Journal of Economics 26 7592 Graddy K 1997 Do FastFood Chains Price Discriminate on the Race and Income Characteristics of an Area Jour nal of Business and Economic Statistics 15 391401 Granger C W J and P Newbold 1974 Spurious Regressions in Econometrics Journal of Econometrics 2 111120 References Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 752 References Greene W 1997 Econometric Analysis 3rd ed New York MacMillan Griliches Z 1957 Specification Bias in Estimates of Pro duction Functions Journal of Farm Economics 39 820 Grogger J 1990 The Deterrent Effect of Capital Punish ment An Analysis of Daily Homicide Counts Journal of the American Statistical Association 410 295303 Grogger J 1991 Certainty vs Severity of Punishment Economic Inquiry 29 297309 Hall R E 1988 The Relation between Price and Marginal Cost in US Industry Journal of Political Economy 96 921948 Hamermesh D S and J E Biddle 1994 Beauty and the La bor Market American Economic Review 84 11741194 Hamermesh D H and A Parker 2005 Beauty in the Class room Instructors Pulchritude and Putative Pedagogical Productivity Economics of Education Review 24 369376 Hamilton J D 1994 Time Series Analysis Princeton NJ Princeton University Press Hansen CB 2007 Asymptotic Properties of a Robust Vari ance Matrix Estimator for Panel Data When T Is Large Journal of Econometrics 141 597620 Hanushek E 1986 The Economics of Schooling Produc tion and Efficiency in Public Schools Journal of Econom ic Literature 24 11411177 Harvey A 1990 The Econometric Analysis of Economic Time Series 2nd ed Cambridge MA MIT Press Hausman J A 1978 Specification Tests in Econometrics Econometrica 46 12511271 Hausman J A and D A Wise 1977 Social Experimen tation Truncated Distributions and Efficient Estimation Econometrica 45 319339 Hayasyi F 2000 Econometrics Princeton NJ Princeton University Press Heckman J J 1976 The Common Structure of Statisti cal Models of Truncation Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models Annals of Economic and Social Measurement 5 475492 Herrnstein R J and C Murray 1994 The Bell Curve Intel ligence and Class Structure in American Life New York Free Press Hersch J and L S Stratton 1997 Housework Fixed Ef fects and Wages of Married Workers Journal of Human Resources 32 285307 Hines J R 1996 Altered States Taxes and the Location of Foreign Direct Investment in America American Eco nomic Review 86 10761094 Holzer H 1991 The Spatial Mismatch Hypothesis What Has the Evidence Shown Urban Studies 28 105122 Holzer H R Block M Cheatham and J Knott 1993 Are Training Subsidies Effective The Michigan Experience Industrial and Labor Relations Review 46 625636 Horowitz J 2001 The Bootstrap in Handbook of Econo metrics volume 5 chapter 52 ed E Leamer and J L Heckman 31593228 Amsterdam North Holland Hoxby C M 1994 Do Private Schools Provide Compe tition for Public Schools National Bureau of Economic Research Working Paper Number 4978 Huber P J 1967 The Behavior of Maximum Likelihood Estimates under Nonstandard Conditions Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1 221233 Berkeley University of Cali fornia Press Hunter W C and M B Walker 1996 The Cultural Affinity Hypothesis and Mortgage Lending Decisions Journal of Real Estate Finance and Economics 13 5770 Hylleberg S 1992 Modelling Seasonality Oxford Oxford University Press Kane T J and C E Rouse 1995 LaborMarket Returns to Two and FourYear Colleges American Economic Review 85 600614 Kiefer N M and T J Vogelsang 2005 A New Asymp totic Theory for HeteroskedasticityAutocorrelation Robust Tests Econometric Theory 21 11301164 Kiel K A and K T McClain 1995 House Prices during Siting Decision Stages The Case of an Incinerator from Rumor through Operation Journal of Environmental Eco nomics and Management 28 241255 Kleck G and E B Patterson 1993 The Impact of Gun Control and Gun Ownership Levels on Violence Rates Journal of Quantitative Criminology 9 249287 Koenker R 1981 A Note on Studentizing a Test for Heter oskedasticity Journal of Econometrics 17 107112 Koenker R 2005 Quantile Regression Cambridge Cam bridge University Press Korenman S and D Neumark 1991 Does Marriage Re ally Make Men More Productive Journal of Human Re sources 26 282307 Korenman S and D Neumark 1992 Marriage Motherhood and Wages Journal of Human Resources 27 233255 Krueger A B 1993 How Computers Have Changed the Wage Structure Evidence from Microdata 19841989 Quarterly Journal of Economics 108 3360 Krupp C M and P S Pollard 1996 Market Respons es to Antidumping Laws Some Evidence from the US Chemical Industry Canadian Journal of Economics 29 199227 Kwiatkowski D P C B Phillips P Schmidt and Y Shin 1992 Testing the Null Hypothesis of Stationarity against the Alternative of a Unit Root How Sure Are We That Eco nomic Time Series Have a Unit Root Journal of Econo metrics 54 159178 Lalonde R J 1986 Evaluating the Econometric Evalua tions of Training Programs with Experimental Data Amer ican Economic Review 76 604620 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 753 Larsen R J and M L Marx 1986 An Introduction to Math ematical Statistics and Its Applications 2nd ed Englewood Cliffs NJ PrenticeHall Leamer E E 1983 Lets Take the Con Out of Economet rics American Economic Review 73 3143 Levine P B A B Trainor and D J Zimmerman 1996 The Effect of Medicaid Abortion Funding Restrictions on Abor tions Pregnancies and Births Journal of Health Econom ics 15 555578 Levine P B and D J Zimmerman 1995 The Benefit of Additional HighSchool Math and Science Classes for Young Men and Women Journal of Business and Econom ics Statistics 13 137149 Levitt S D 1994 Using Repeat Challengers to Estimate the Effect of Campaign Spending on Election Outcomes in the US House Journal of Political Economy 102 777798 Levitt S D 1996 The Effect of Prison Population Size on Crime Rates Evidence from Prison Overcrowding Legisla tion Quarterly Journal of Economics 111 319351 Little R J A and D B Rubin 2002 Statistical Analysis with Missing Data 2nd ed Wiley New York Low S A and L R McPheters 1983 Wage Differentials and the Risk of Death An Empirical Analysis Economic Inquiry 21 271280 Lynch L M 1992 Private Sector Training and the Earn ings of Young Workers American Economic Review 82 299312 MacKinnon J G and H White 1985 Some Heteroskedas ticity Consistent Covariance Matrix Estimators with Im proved Finite Sample Properties Journal of Econometrics 29 305325 Maloney M T and R E McCormick 1993 An Examina tion of the Role that Intercollegiate Athletic Participation Plays in Academic Achievement Athletes Feats in the Classroom Journal of Human Resources 28 555570 Mankiw N G 1994 Macroeconomics 2nd ed New York Worth Mark S T T J McGuire and L E Papke 2000 The In fluence of Taxes on Employment and Population Growth Evidence from the Washington DC Metropolitan Area National Tax Journal 53 105123 McCarthy P S 1994 Relaxed Speed Limits and Highway Safety New Evidence from California Economics Letters 46 173179 McClain K T and J M Wooldridge 1995 A Simple Test for the Consistency of Dynamic Linear Regression in Rational Distributed Lag Models Economics Letters 48 235240 McCormick R E and M Tinsley 1987 Athletics versus Academics Evidence from SAT Scores Journal of Politi cal Economy 95 11031116 McFadden D L 1974 Conditional Logit Analysis of Qual itative Choice Behavior in Frontiers in Econometrics ed P Zarembka 105142 New York Academic Press Meyer B D 1995 Natural and QuasiExperiments in Eco nomics Journal of Business and Economic Statistics 13 151161 Meyer B D W K Viscusi and D L Durbin 1995 Work ers Compensation and Injury Duration Evidence from a Natural Experiment American Economic Review 85 322340 Mizon G E and J F Richard 1986 The Encompassing Principle and Its Application to Testing Nonnested Hypoth eses Econometrica 54 657678 Mroz T A 1987 The Sensitivity of an Empirical Model of Married Womens Hours of Work to Economic and Statisti cal Assumptions Econometrica 55 765799 Mullahy J and P R Portney 1990 Air Pollution Cigarette Smoking and the Production of Respiratory Health Jour nal of Health Economics 9 193205 Mullahy J and J L Sindelar 1994 Do Drinkers Know When to Say When An Empirical Analysis of Drunk Driv ing Economic Inquiry 32 383394 Netzer D 1992 Differences in Reliance on User Charges by American State and Local Governments Public Fi nance Quarterly 20 499511 Neumark D 1996 Sex Discrimination in Restaurant Hir ing An Audit Study Quarterly Journal of Economics 111 915941 Neumark D and W Wascher 1995 Minimum Wage Ef fects on Employment and School Enrollment Journal of Business and Economic Statistics 13 199206 Newey W K and K D West 1987 A Simple Posi tive SemiDefinite Heteroskedasticity and Autocorrela tion Consistent Covariance Matrix Econometrica 55 703708 Papke L E 1987 Subnational Taxation and Capital Mobil ity Estimates of TaxPrice Elasticities National Tax Jour nal 40 191203 Papke L E 1994 Tax Policy and Urban Development Evi dence from the Indiana Enterprise Zone Program Journal of Public Economics 54 3749 Papke L E 1995 Participation in and Contributions to 401k Pension Plans Evidence from Plan Data Journal of Human Resources 30 311325 Papke L E 1999 Are 401k Plans Replacing Other Em ployerProvided Pensions Evidence from Panel Data Journal of Human Resources 34 346368 Papke L E 2005 The Effects of Spending on Test Pass Rates Evidence from Michigan Journal of Public Eco nomics 89 821839 Papke L E and J M Wooldridge 1996 Econometric Meth ods for Fractional Response Variables with an Application to 401k Plan Participation Rates Journal of Applied Econometrics 11 619632 Park R 1966 Estimation with Heteroskedastic Error Terms Econometrica 34 888 References Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 754 References Peek J 1982 Interest Rates Income Taxes and Anticipated Inflation American Economic Review 72 980991 Pindyck R S and D L Rubinfeld 1992 Microeconomics 2nd ed New York Macmillan Ram R 1986 Government Size and Economic Growth A New Framework and Some Evidence from CrossSection and TimeSeries Data American Economic Review 76 191203 Ramanathan R 1995 Introductory Econometrics with Ap plications 3rd ed Fort Worth Dryden Press Ramey V 1991 Nonconvex Costs and the Behavior of In ventories Journal of Political Economy 99 306334 Ramsey J B 1969 Tests for Specification Errors in Clas sical Linear LeastSquares Analysis Journal of the Royal Statistical Association Series B 71 350371 Romer D 1993 Openness and Inflation Theory and Evi dence Quarterly Journal of Economics 108 869903 Rose N L 1985 The Incidence of Regulatory Rents in the Motor Carrier Industry Rand Journal of Economics 16 299318 Rose N L and A Shepard 1997 Firm Diversification and CEO Compensation Managerial Ability or Executive En trenchment Rand Journal of Economics 28 489514 Rouse C E 1998 Private School Vouchers and Student Achievement An Evaluation of the Milwaukee Parental Choice Program Quarterly Journal of Economics 113 553602 Sander W 1992 The Effect of Womens Schooling on Fer tility Economic Letters 40 229233 Savin N E and K J White 1977 The DurbinWatson Test for Serial Correlation with Extreme Sample Sizes or Many Regressors Econometrica 45 19891996 Shea J 1993 The InputOutput Approach to Instrument Selection Journal of Business and Economic Statistics 11 145155 Shughart W F and R D Tollison 1984 The Random Character of Merger Activity Rand Journal of Economics 15 500509 Solon G 1985 The Minimum Wage and Teenage Em ployment A Reanalysis with Attention to Serial Corre lation and Seasonality Journal of Human Resources 20 292297 Staiger D and J H Stock 1997 Instrumental Variables Regression with Weak Instruments Econometrica 65 557586 Stigler S M 1986 The History of Statistics Cambridge MA Harvard University Press Stock J H and M W Watson 1989 Interpreting the Evi dence on MoneyIncome Causality Journal of Economet rics 40 161181 Stock J H and M W Watson 1993 A Simple Estimator of Cointegrating Vectors in Higher Order Integrated Systems Econometrica 61 783820 Stock J H and M Yogo 2005 Asymptotic Distributions of Instrumental Variables Statistics with Many Instruments in Identification and Inference for Econometric Models Essays in Honor of Thomas Rothenberg ed D W K An drews and J H Stock 109120 Cambridge Cambridge University Press Stock J W and M W Watson 2008 Heteroskedasticity Robust Standard Errors for Fixed Effects Panel Data Re gression Econometrica 76 155174 Sydsaeter K and P J Hammond 1995 Mathematics for Economic Analysis Englewood Cliffs NJ Prentice Hall Terza J V 2002 Alcohol Abuse and Employment A Sec ond Look Journal of Applied Econometrics 17 393404 Tucker I B 2004 A Reexamination of the Effect of Big time Football and Basketball Success on Graduation Rates and Alumni Giving Rates Economics of Education Review 23 655661 Vella F and M Verbeek 1998 Whose Wages Do Unions Raise A Dynamic Model of Unionism and Wage Rate De termination for Young Men Journal of Applied Economet rics 13 163183 Wald A 1940 The Fitting of Straight Lines If Both Vari ables Are Subject to Error Annals of Mathematical Statis tics 11 284300 Wallis K F 1972 Testing for FourthOrder Autocorrela tion in Quarterly Regression Equations Econometrica 40 617636 White H 1980 A HeteroskedasticityConsistent Covari ance Matrix Estimator and a Direct Test for Heteroskedas ticity Econometrica 48 817838 White H 1984 Asymptotic Theory for Econometricians Orlando Academic Press White M J 1986 Property Taxes and Firm Location Evi dence from Proposition 13 in Studies in State and Local Public Finance ed H S Rosen 83112 Chicago Univer sity of Chicago Press Whittington L A J Alm and H E Peters 1990 Fertility and the Personal Exemption Implicit Pronatalist Policy in the United States American Economic Review 80 545556 Wooldridge J M 1989 A Computationally Simple Heter oskedasticity and Serial CorrelationRobust Standard Error for the Linear Regression Model Economics Letters 31 239243 Wooldridge J M 1991a A Note on Computing RSquared and Adjusted RSquared for Trending and Seasonal Data Economics Letters 36 4954 Wooldridge J M 1991b On the Application of Robust RegressionBased Diagnostics to Models of Conditional Means and Conditional Variances Journal of Economet rics 47 546 Wooldridge J M 1994a A Simple Specification Test for the Predictive Ability of Transformation Models Review of Economics and Statistics 76 5965 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 755 Wooldridge J M 1994b Estimation and Inference for De pendent Processes in Handbook of Econometrics volume 4 chapter 45 ed R F Engle and D L McFadden 2639 2738 Amsterdam NorthHolland Wooldridge J M 1995 Score Diagnostics for Linear Mod els Estimated by Two Stage Least Squares in Advances in Econometrics and Quantitative Economics ed G S Maddala P C B Phillips and T N Srinivasan 6687 Ox ford Blackwell Wooldridge JM 2001 Diagnostic Testing in Companion to Theoretical Econometrics ed B H Baltagi 180200 Oxford Blackwell Wooldridge J M 2010 Econometric Analysis of Cross Sec tion and Panel Data 2nd ed Cambridge MA MIT Press References Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 756 Glossary A Adjusted RSquared A goodnessoffit measure in multiple regression analysis that penalizes additional explanatory variables by using a degrees of freedom adjustment in esti mating the error variance Alternative Hypothesis The hypothesis against which the null hypothesis is tested AR1 Serial Correlation The errors in a time series regres sion model follow an AR1 model Asymptotic Bias See inconsistency Asymptotic Confidence Interval A confidence interval that is approximately valid in large sample sizes Asymptotic Normality The sampling distribution of a prop erly normalized estimator converges to the standard normal distribution Asymptotic Properties Properties of estimators and test statistics that apply when the sample size grows without bound Asymptotic Standard Error A standard error that is valid in large samples Asymptotic t Statistic A t statistic that has an approximate standard normal distribution in large samples Asymptotic Variance The square of the value by which we must divide an estimator in order to obtain an asymptotic standard normal distribution Asymptotically Efficient For consistent estimators with as ymptotically normal distributions the estimator with the smallest asymptotic variance Asymptotically Uncorrelated A time series process in which the correlation between random variables at two points in time tends to zero as the time interval between them increases See also weakly dependent Attenuation Bias Bias in an estimator that is always toward zero thus the expected value of an estimator with attenua tion bias is less in magnitude than the absolute value of the parameter Augmented DickeyFuller Test A test for a unit root that in cludes lagged changes of the variable as regressors Autocorrelation See serial correlation Autoregressive Conditional Heteroskedasticity ARCH A model of dynamic heteroskedasticity where the variance of the error term given past information depends linearly on the past squared errors Autoregressive Process of Order One AR1 A time series model whose current value depends linearly on its most recent value plus an unpredictable disturbance Auxiliary Regression A regression used to compute a test statisticsuch as the test statistics for heteroskedasticity and serial correlationor any other regression that does not estimate the model of primary interest Average The sum of n numbers divided by n Average Marginal Effect See average partial effect Average Partial Effect For nonconstant partial effects the partial effect averaged across the specified population Average Treatment Effect A treatment or policy effect av eraged across the population B Balanced Panel A panel data set where all years or periods of data are available for all crosssectional units Base Group The group represented by the overall intercept in a multiple regression model that includes dummy ex planatory variables Base Period For index numbers such as price or production indices the period against which all other time periods are measured Base Value The value assigned to the base period for con structing an index number usually the base value is 1 or 100 Benchmark Group See base group Bernoulli or Binary Random Variable A random variable that takes on the values zero or one Best Linear Unbiased Estimator BLUE Among all lin ear unbiased estimators the one with the smallest vari ance OLS is BLUE conditional on the sample values of the explanatory variables under the GaussMarkov assumptions Beta Coefficients See standardized coefficients Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 757 Bias The difference between the expected value of an esti mator and the population value that the estimator is sup posed to be estimating Biased Estimator An estimator whose expectation or sam pling mean is different from the population value it is sup posed to be estimating Biased Towards Zero A description of an estimator whose expectation in absolute value is less than the absolute value of the population parameter Binary Response Model A model for a binary dummy de pendent variable Binary Variable See dummy variable Binomial Distribution The probability distribution of the number of successes out of n independent Bernoulli trials where each trial has the same probability of success Bivariate Regression Model See simple linear regression model BLUE See best linear unbiased estimator Bootstrap A resampling method that draws random samples with replacement from the original data set Bootstrap Standard Error A standard error obtained as the sample standard deviation of an estimate across all boot strap samples BreuschGodfrey Test An asymptotically justified test for ARp serial correlation with AR1 being the most popu lar the test allows for lagged dependent variables as well as other regressors that are not strictly exogenous BreuschPagan Test A test for heteroskedasticity where the squared OLS residuals are regressed on the explanatory variables in the model C Causal Effect A ceteris paribus change in one variable that has an effect on another variable Censored Normal Regression Model The special case of the censored regression model where the underly ing population model satisfies the classical linear model assumptions Censored Regression Model A multiple regression model where the dependent variable has been censored above or below some known threshold Central Limit Theorem CLT A key result from prob ability theory which implies that the sum of independent random variables or even weakly dependent random vari ables when standardized by its standard deviation has a distribution that tends to standard normal as the sample size grows Ceteris Paribus All other relevant factors are held fixed ChiSquare Distribution A probability distribution obtained by adding the squares of independent standard normal ran dom variables The number of terms in the sum equals the degrees of freedom in the distribution ChiSquare Random Variable A random variable with a chisquare distribution Chow Statistic An F statistic for testing the equality of re gression parameters across different groups say men and women or time periods say before and after a policy change Classical ErrorsinVariables CEV A measurement error model where the observed measure equals the actual vari able plus an independent or at least an uncorrelated mea surement error Classical Linear Model The multiple linear regres sion model under the full set of classical linear model assumptions Classical Linear Model CLM Assumptions The ideal set of assumptions for multiple regression analysis for cross sectional analysis Assumptions MLR1 through MLR6 and for time series analysis Assumptions TS1 through TS6 The assumptions include linearity in the parameters no perfect collinearity the zero conditional mean assump tion homoskedasticity no serial correlation and normality of the errors Cluster Effect An unobserved effect that is common to all units usually people in the cluster Cluster Sample A sample of natural clusters or groups that usually consist of people Clustering The act of computing standard errors and test statistics that are robust to cluster correlation either due to cluster sampling or to time series correlation in panel data CochraneOrcutt CO Estimation A method of estimat ing a multiple linear regression model with AR1 errors and strictly exogenous explanatory variables unlike Prais Winsten CochraneOrcutt does not use the equation for the first time period Coefficient of Determination See Rsquared Cointegration The notion that a linear combination of two series each of which is integrated of order one is inte grated of order zero Column Vector A vector of numbers arranged as a column Composite Error Term In a panel data model the sum of the timeconstant unobserved effect and the idiosyncratic error Conditional Distribution The probability distribution of one random variable given the values of one or more other random variables Conditional Expectation The expected or average value of one random variable called the dependent or explained variable that depends on the values of one or more other variables called the independent or explanatory variables Conditional Forecast A forecast that assumes the future val ues of some explanatory variables are known with certainty Conditional Median The median of a response variable con ditional on some explanatory variables Conditional Variance The variance of one random variable given one or more other random variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 758 Confidence Interval CI A rule used to construct a random interval so that a certain percentage of all data sets deter mined by the confidence level yields an interval that con tains the population value Confidence Level The percentage of samples in which we want our confidence interval to contain the population value 95 is the most common confidence level but 90 and 99 are also used Consistency An estimator converges in probability to the correct population value as the sample size grows Consistent Estimator An estimator that converges in proba bility to the population parameter as the sample size grows without bound Consistent Test A test where under the alternative hypoth esis the probability of rejecting the null hypothesis con verges to one as the sample size grows without bound Constant Elasticity Model A model where the elasticity of the dependent variable with respect to an explanatory vari able is constant in multiple regression both variables ap pear in logarithmic form Contemporaneously Homoskedastic Describes a time se ries or panel data applications in which the variance of the error term conditional on the regressors in the same time period is constant Contemporaneously Exogenous Describes a time series or panel data application in which a regressor is contempora neously exogenous if it is uncorrelated with the error term in the same time period although it may be correlated with the errors in other time periods Continuous Random Variable A random variable that takes on any particular value with probability zero Control Group In program evaluation the group that does not participate in the program Control Variable See explanatory variable Corner Solution Response A nonnegative dependent vari able that is roughly continuous over strictly positive values but takes on the value zero with some regularity Correlated Random Effects An approach to panel data analysis where the correlation between the unobserved ef fect and the explanatory variables is modeled usually as a linear relationship Correlation Coefficient A measure of linear dependence be tween two random variables that does not depend on units of measurement and is bounded between 1 and 1 Count Variable A variable that takes on nonnegative integer values Covariance A measure of linear dependence between two random variables Covariance Stationary A time series process with constant mean and variance where the covariance between any two random variables in the sequence depends only on the dis tance between them Covariate See explanatory variable Critical Value In hypothesis testing the value against which a test statistic is compared to determine whether or not the null hypothesis is rejected CrossSectional Data Set A data set collected by sampling a population at a given point in time Cumulative Distribution Function cdf A function that gives the probability of a random variable being less than or equal to any specified real number Cumulative Effect At any point in time the change in a re sponse variable after a permanent increase in an explana tory variableusually in the context of distributed lag models D Data Censoring A situation that arises when we do not al ways observe the outcome on the dependent variable be cause at an upper or lower threshold we only know that the outcome was above or below the threshold See also censored regression model Data Frequency The interval at which time series data are collected Yearly quarterly and monthly are the most com mon data frequencies Data Mining The practice of using the same data set to estimate numerous models in a search to find the best model DavidsonMacKinnon Test A test that is used for testing a model against a nonnested alternative it can be imple mented as a t test on the fitted values from the competing model Degrees of Freedom df In multiple regression analysis the number of observations minus the number of estimated parameters Denominator Degrees of Freedom In an F test the degrees of freedom in the unrestricted model Dependent Variable The variable to be explained in a mul tiple regression model and a variety of other models Derivative The slope of a smooth function as defined using calculus Descriptive Statistic A statistic used to summarize a set of numbers the sample average sample median and sample standard deviation are the most common Deseasonalizing The removing of the seasonal components from a monthly or quarterly time series Detrending The practice of removing the trend from a time series Diagonal Matrix A matrix with zeros for all offdiagonal entries DickeyFuller Distribution The limiting distribution of the t statistic in testing the null hypothesis of a unit root DickeyFuller DF Test A t test of the unit root null hypoth esis in an AR1 model See also augmented DickeyFuller test Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 759 Difference in Slopes A description of a model where some slope parameters may differ by group or time period DifferenceinDifferences Estimator An estimator that arises in policy analysis with data for two time periods One version of the estimator applies to independently pooled cross sections and another to panel data sets DifferenceStationary Process A time series sequence that is I0 in its first differences Diminishing Marginal Effect The marginal effect of an ex planatory variable becomes smaller as the value of the ex planatory variable increases Discrete Random Variable A random variable that takes on at most a finite or countably infinite number of values Distributed Lag Model A time series model that relates the dependent variable to current and past values of an explan atory variable Disturbance See error term Downward Bias The expected value of an estimator is below the population value of the parameter Dummy Dependent Variable See binary response model Dummy Variable A variable that takes on the value zero or one Dummy Variable Regression In a panel data setting the regression that includes a dummy variable for each cross sectional unit along with the remaining explanatory vari ables It produces the fixed effects estimator Dummy Variable Trap The mistake of including too many dummy variables among the independent variables it oc curs when an overall intercept is in the model and a dummy variable is included for each group Duration Analysis An application of the censored regres sion model where the dependent variable is time elapsed until a certain event occurs such as the time before an un employed person becomes reemployed DurbinWatson DW Statistic A statistic used to test for first order serial correlation in the errors of a time se ries regression model under the classical linear model assumptions Dynamically Complete Model A time series model where no further lags of either the dependent variable or the ex planatory variables help to explain the mean of the depen dent variable E Econometric Model An equation relating the dependent variable to a set of explanatory variables and unobserved disturbances where unknown population parameters de termine the ceteris paribus effect of each explanatory variable Economic Model A relationship derived from economic the ory or less formal economic reasoning Economic Significance See practical significance Elasticity The percentage change in one variable given a 1 ceteris paribus increase in another variable Empirical Analysis A study that uses data in a formal econometric analysis to test a theory estimate a relation ship or determine the effectiveness of a policy Endogeneity A term used to describe the presence of an en dogenous explanatory variable Endogenous Explanatory Variable An explanatory vari able in a multiple regression model that is correlated with the error term either because of an omitted variable mea surement error or simultaneity Endogenous Sample Selection Nonrandom sample selec tion where the selection is related to the dependent variable either directly or through the error term in the equation Endogenous Variables In simultaneous equations mod els variables that are determined by the equations in the system EngleGranger Test A test of the null hypothesis that two time series are not cointegrated the statistic is obtained as the DickeyFuller statistic using OLS residuals EngleGranger TwoStep Procedure A twostep method for estimating error correction models whereby the coin tegrating parameter is estimated in the first stage and the error correction parameters are estimated in the second Error Correction Model A time series model in first dif ferences that also contains an error correction term which works to bring two I1 series back into longrun equilibrium Error Term The variable in a simple or multiple regression equation that contains unobserved factors which affect the dependent variable The error term may also include mea surement errors in the observed dependent or independent variables Error Variance The variance of the error term in a multiple regression model ErrorsinVariables A situation where either the dependent variable or some independent variables are measured with error Estimate The numerical value taken on by an estimator for a particular sample of data Estimator A rule for combining data to produce a numerical value for a population parameter the form of the rule does not depend on the particular sample obtained Event Study An econometric analysis of the effects of an event such as a change in government regulation or eco nomic policy on an outcome variable Excluding a Relevant Variable In multiple regression anal ysis leaving out a variable that has a nonzero partial effect on the dependent variable Exclusion Restrictions Restrictions which state that certain variables are excluded from the model or have zero popu lation coefficients Exogenous Explanatory Variable An explanatory variable that is uncorrelated with the error term Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 760 Exogenous Sample Selection A sample selection that either depends on exogenous explanatory variables or is indepen dent of the error term in the equation of interest Exogenous Variable Any variable that is uncorrelated with the error term in the model of interest Expected Value A measure of central tendency in the distri bution of a random variable including an estimator Experiment In probability a general term used to denote an event whose outcome is uncertain In econometric analysis it denotes a situation where data are collected by randomly assigning individuals to control and treatment groups Experimental Data Data that have been obtained by running a controlled experiment Experimental Group See treatment group Explained Sum of Squares SSE The total sample varia tion of the fitted values in a multiple regression model Explained Variable See dependent variable Explanatory Variable In regression analysis a variable that is used to explain variation in the dependent variable Exponential Function A mathematical function defined for all values that has an increasing slope but a constant pro portionate change Exponential Smoothing A simple method of forecasting a variable that involves a weighting of all previous outcomes on that variable Exponential Trend A trend with a constant growth rate F F Distribution The probability distribution obtained by form ing the ratio of two independent chisquare random variables where each has been divided by its degrees of freedom F Random Variable A random variable with an F distribution F Statistic A statistic used to test multiple hypotheses about the parameters in a multiple regression model Feasible GLS FGLS Estimator A GLS procedure where variance or correlation parameters are unknown and there fore must first be estimated See also generalized least squares estimator Finite Distributed Lag FDL Model A dynamic model where one or more explanatory variables are allowed to have lagged effects on the dependent variable First Difference A transformation on a time series con structed by taking the difference of adjacent time periods where the earlier time period is subtracted from the later time period FirstDifferenced FD Equation In time series or panel data models an equation where the dependent and inde pendent variables have all been first differenced FirstDifferenced FD Estimator In a panel data setting the pooled OLS estimator applied to first differences of the data across time First Order Autocorrelation For a time series process or dered chronologically the correlation coefficient between pairs of adjacent observations First Order Conditions The set of linear equations used to solve for the OLS estimates Fitted Values The estimated values of the dependent variable when the values of the independent variables for each ob servation are plugged into the OLS regression line Fixed Effect See unobserved effect Fixed Effects Estimator For the unobserved effects panel data model the estimator obtained by applying pooled OLS to a timedemeaned equation Fixed Effects Model An unobserved effects panel data model where the unobserved effects are allowed to be ar bitrarily correlated with the explanatory variables in each time period Fixed Effects Transformation For panel data the timede meaned data Forecast Error The difference between the actual outcome and the forecast of the outcome Forecast Interval In forecasting a confidence interval for a yet unrealized future value of a time series variable See also prediction interval FrischWaugh Theorem The general algebraic result that provides multiple regression analysis with its partialling out interpretation Functional Form Misspecification A problem that occurs when a model has omitted functions of the explanatory variables such as quadratics or uses the wrong func tions of either the dependent variable or some explanatory variables G GaussMarkov Assumptions The set of assumptions As sumptions MLR1 through MLR5 or TS1 through TS5 under which OLS is BLUE GaussMarkov Theorem The theorem that states that under the five GaussMarkov assumptions for cross sectional or time series models the OLS estimator is BLUE conditional on the sample values of the explanatory variables Generalized Least Squares GLS Estimator An estimator that accounts for a known structure of the error variance heteroskedasticity serial correlation pattern in the errors or both via a transformation of the original model Geometric or Koyck Distributed Lag An infinite distrib uted lag model where the lag coefficients decline at a geo metric rate GoodnessofFit Measure A statistic that summarizes how well a set of explanatory variables explains a dependent or response variable Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 761 Granger Causality A limited notion of causality where past values of one series xt are useful for predicting future val ues of another series yt after past values of yt have been controlled for Growth Rate The proportionate change in a time series from the previous period It may be approximated as the differ ence in logs or reported in percentage form H Heckit Method An econometric procedure used to correct for sample selection bias due to incidental truncation or some other form of nonrandomly missing data Heterogeneity Bias The bias in OLS due to omitted hetero geneity or omitted variables Heteroskedasticity The variance of the error term given the explanatory variables is not constant Heteroskedasticity of Unknown Form Heteroskedasticity that may depend on the explanatory variables in an un known arbitrary fashion HeteroskedasticityRobust F Statistic An Ftype statistic that is asymptotically robust to heteroskedasticity of un known form HeteroskedasticityRobust LM Statistic An LM statistic that is robust to heteroskedasticity of unknown form HeteroskedasticityRobust Standard Error A standard er ror that is asymptotically robust to heteroskedasticity of unknown form HeteroskedasticityRobust t Statistic A t statistic that is asymptotically robust to heteroskedasticity of unknown form Highly Persistent A time series process where outcomes in the distant future are highly correlated with current outcomes Homoskedasticity The errors in a regression model have constant variance conditional on the explanatory variables Hypothesis Test A statistical test of the null or maintained hypothesis against an alternative hypothesis I Idempotent Matrix A square matrix where multiplication of the matrix by itself equals itself Identification A population parameter or set of parameters can be consistently estimated Identified Equation An equation whose parameters can be consistently estimated especially in models with endog enous explanatory variables Identity Matrix A square matrix where all diagonal ele ments are one and all offdiagonal elements are zero Idiosyncratic Error In panel data models the error that changes over time as well as across units say individuals firms or cities Impact Elasticity In a distributed lag model the immediate percentage change in the dependent variable given a 1 in crease in the independent variable Impact Multiplier See impact propensity Impact Propensity In a distributed lag model the immedi ate change in the dependent variable given a oneunit in crease in the independent variable Incidental Truncation A sample selection problem whereby one variable usually the dependent variable is only ob served for certain outcomes of another variable Inclusion of an Irrelevant Variable The including of an explanatory variable in a regression model that has a zero population parameter in estimating an equation by OLS Inconsistency The difference between the probability limit of an estimator and the parameter value Inconsistent Describes an estimator that does not converge in probability to the correct population parameter as the sample size grows Independent Random Variables Random variables whose joint distribution is the product of the marginal distributions Independent Variable See explanatory variable Independently Pooled Cross Section A data set obtained by pooling independent random samples from different points in time Index Number A statistic that aggregates information on economic activity such as production or prices Infinite Distributed Lag IDL Model A distributed lag model where a change in the explanatory variable can have an impact on the dependent variable into the indefinite future Influential Observations See outliers Information Set In forecasting the set of variables that we can observe prior to forming our forecast InSample Criteria Criteria for choosing forecasting models that are based on goodnessoffit within the sample used to obtain the parameter estimates Instrument See instrumental variable Instrument Exogeneity In instrumental variables estima tion the requirement that an instrumental variable is uncor related with the error term Instrument Relevance In instrumental variables estima tion the requirement that an instrumental variable helps to partially explain variation in the endogenous explanatory variable Instrumental Variable IV In an equation with an endog enous explanatory variable an IV is a variable that does not appear in the equation is uncorrelated with the error in the equation and is partially correlated with the endogenous explanatory variable Instrumental Variables IV Estimator An estimator in a linear model used when instrumental variables are avail able for one or more endogenous explanatory variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 762 Integrated of Order One I1 A time series process that needs to be firstdifferenced in order to produce an I0 process Integrated of Order Zero I0 A stationary weakly de pendent time series process that when used in regression analysis satisfies the law of large numbers and the central limit theorem Interaction Effect In multiple regression the partial effect of one explanatory variable depends on the value of a dif ferent explanatory variable Interaction Term An independent variable in a regression model that is the product of two explanatory variables Intercept In the equation of a line the value of the y variable when the x variable is zero Intercept Parameter The parameter in a multiple lin ear regression model that gives the expected value of the dependent variable when all the independent variables equal zero Intercept Shift The intercept in a regression model differs by group or time period Internet A global computer network that can be used to ac cess information and download databases Interval Estimator A rule that uses data to obtain lower and upper bounds for a population parameter See also confi dence interval Inverse For an n n matrix its inverse if it exists is the n n matrix for which pre and postmultiplication by the original matrix yields the identity matrix Inverse Mills Ratio A term that can be added to a multiple regression model to remove sample selection bias J Joint Distribution The probability distribution determining the probabilities of outcomes involving two or more ran dom variables Joint Hypotheses Test A test involving more than one re striction on the parameters in a model Jointly Insignificant Failure to reject using an F test at a specified significance level that all coefficients for a group of explanatory variables are zero Jointly Statistically Significant The null hypothesis that two or more explanatory variables have zero population co efficients is rejected at the chosen significance level Just Identified Equation For models with endogenous ex planatory variables an equation that is identified but would not be identified with one fewer instrumental variable K Kurtosis A measure of the thickness of the tails of a distribu tion based on the fourth moment of the standardized ran dom variable the measure is usually compared to the value for the standard normal distribution which is three L Lag Distribution In a finite or infinite distributed lag model the lag coefficients graphed as a function of the lag length Lagged Dependent Variable An explanatory variable that is equal to the dependent variable from an earlier time period Lagged Endogenous Variable In a simultaneous equations model a lagged value of one of the endogenous variables Lagrange Multiplier LM Statistic A test statistic with largesample justification that can be used to test for omit ted variables heteroskedasticity and serial correlation among other model specification problems Large Sample Properties See asymptotic properties Latent Variable Model A model where the observed depen dent variable is assumed to be a function of an underlying latent or unobserved variable Law of Iterated Expectations A result from probability that relates unconditional and conditional expectations Law of Large Numbers LLN A theorem that says that the average from a random sample converges in probability to the population average the LLN also holds for stationary and weakly dependent time series Leads and Lags Estimator An estimator of a cointegrating parameter in a regression with I1 variables where the current some past and some future first differences in the explanatory variable are included as regressors Least Absolute Deviations LAD A method for estimat ing the parameters of a multiple regression model based on minimizing the sum of the absolute values of the residuals Least Squares Estimator An estimator that minimizes a sum of squared residuals LevelLevel Model A regression model where the dependent variable and the independent variables are in level or origi nal form LevelLog Model A regression model where the dependent variable is in level form and at least some of the indepen dent variables are in logarithmic form Likelihood Ratio Statistic A statistic that can be used to test single or multiple hypotheses when the constrained and unconstrained models have been estimated by maximum likelihood The statistic is twice the difference in the un constrained and constrained loglikelihoods Limited Dependent Variable LDV A dependent or re sponse variable whose range is restricted in some important way Linear Function A function where the change in the depen dent variable given a oneunit change in an independent variable is constant Linear Probability Model LPM A binary response model where the response probability is linear in its parameters Linear Time Trend A trend that is a linear function of time Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 763 Linear Unbiased Estimator In multiple regression analysis an unbiased estimator that is a linear function of the out comes on the dependent variable Linearly Independent Vectors A set of vectors such that no vector can be written as a linear combination of the others in the set Log Function A mathematical function defined only for strictly positive arguments with a positive but decreasing slope Logarithmic Function A mathematical function defined for positive arguments that has a positive but diminishing slope Logit Model A model for binary response where the re sponse probability is the logit function evaluated at a linear function of the explanatory variables LogLevel Model A regression model where the dependent variable is in logarithmic form and the independent vari ables are in level or original form LogLikelihood Function The sum of the loglikelihoods where the loglikelihood for each observation is the log of the density of the dependent variable given the explanatory variables the loglikelihood function is viewed as a func tion of the parameters to be estimated LogLog Model A regression model where the dependent variable and at least some of the explanatory variables are in logarithmic form Longitudinal Data See panel data LongRun Elasticity The longrun propensity in a distrib uted lag model with the dependent and independent vari ables in logarithmic form thus the longrun elasticity is the eventual percentage increase in the explained variable given a permanent 1 increase in the explanatory variable LongRun Multiplier See longrun propensity LongRun Propensity LRP In a distributed lag model the eventual change in the dependent variable given a perma nent oneunit increase in the independent variable Loss Function A function that measures the loss when a forecast differs from the actual outcome the most common examples are absolute value loss and squared loss M Marginal Effect The effect on the dependent variable that results from changing an independent variable by a small amount Martingale A time series process whose expected value given all past outcomes on the series simply equals the most recent value Martingale Difference Sequence The first difference of a martingale It is unpredictable or has a zero mean given past values of the sequence Matched Pair Sample A sample where each observation is matched with another as in a sample consisting of a hus band and wife or a set of two siblings Matrix An array of numbers Matrix Multiplication An algorithm for multiplying to gether two conformable matrices Matrix Notation A convenient mathematical notation grounded in matrix algebra for expressing and manipulat ing the multiple regression model Maximum Likelihood Estimation MLE A broadly appli cable estimation method where the parameter estimates are chosen to maximize the loglikelihood function Maximum Likelihood Estimator An estimator that maxi mizes the log of the likelihood function Mean See expected value Mean Absolute Error MAE A performance measure in forecasting computed as the average of the absolute values of the forecast errors Mean Independent The key requirement in multiple regres sion analysis which says the unobserved error has a mean that does not change across subsets of the population de fined by different values of the explanatory variables Mean Squared Error MSE The expected squared distance that an estimator is from the population value it equals the variance plus the square of any bias Measurement Error The difference between an observed variable and the variable that belongs in a multiple regres sion equation Median In a probability distribution it is the value where there is a 50 chance of being below the value and a 50 chance of being above it In a sample of numbers it is the middle value after the numbers have been ordered Method of Moments Estimator An estimator obtained by using the sample analog of population moments ordinary least squares and two stage least squares are both method of moments estimators Micronumerosity A term introduced by Arthur Goldberger to describe properties of econometric estimators with small sample sizes Minimum Variance Unbiased Estimator An estimator with the smallest variance in the class of all unbiased estimators Missing at Random In multiple regression analysis a miss ing data mechanism where the reason data are missing may be correlated with the explanatory variables but is indepen dent of the error term Missing Completely at Random MCAR In multiple re gression analysis a missing data mechanism where the rea son data are missing is statistically independent of the values of the explanatory variables as well as the unobserved error Missing Data A data problem that occurs when we do not observe values on some variables for certain observations individuals cities time periods and so on in the sample Misspecification Analysis The process of determining likely biases that can arise from omitted variables mea surement error simultaneity and other kinds of model misspecification Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 764 Moving Average Process of Order One MA1 A time series process generated as a linear function of the current value and one lagged value of a zeromean constant vari ance uncorrelated stochastic process Multicollinearity A term that refers to correlation among the independent variables in a multiple regression model it is usually invoked when some correlations are large but an actual magnitude is not well defined Multiple Hypotheses Test A test of a null hypothesis involv ing more than one restriction on the parameters Multiple Linear Regression MLR Model A model linear in its parameters where the dependent variable is a func tion of independent variables plus an error term Multiple Regression Analysis A type of analysis that is used to describe estimation of and inference in the multiple linear regression model Multiple Restrictions More than one restriction on the pa rameters in an econometric model MultipleStepAhead Forecast A time series forecast of more than one period into the future Multiplicative Measurement Error Measurement error where the observed variable is the product of the true unob served variable and a positive measurement error Multivariate Normal Distribution A distribution for mul tiple random variables where each linear combination of the random variables has a univariate onedimensional normal distribution N nRSquared Statistic See Lagrange multiplier statistic Natural Experiment A situation where the economic en vironmentsometimes summarized by an explanatory variableexogenously changes perhaps inadvertently due to a policy or institutional change Natural Logarithm See logarithmic function Nominal Variable A variable measured in nominal or cur rent dollars Nonexperimental Data Data that have not been obtained through a controlled experiment Nonlinear Function A function whose slope is not constant Nonnested Models Two or more models where no model can be written as a special case of the other by imposing restrictions on the parameters Nonrandom Sample A sample obtained other than by sam pling randomly from the population of interest Nonstationary Process A time series process whose joint distributions are not constant across different epochs Normal Distribution A probability distribution commonly used in statistics and econometrics for modeling a popula tion Its probability distribution function has a bell shape Normality Assumption The classical linear model assump tion which states that the error or dependent variable has a normal distribution conditional on the explanatory variables Null Hypothesis In classical hypothesis testing we take this hypothesis as true and require the data to provide substan tial evidence against it Numerator Degrees of Freedom In an F test the number of restrictions being tested O Observational Data See nonexperimental data OLS See ordinary least squares OLS Intercept Estimate The intercept in an OLS regression line OLS Regression Line The equation relating the predicted value of the dependent variable to the independent vari ables where the parameter estimates have been obtained by OLS OLS Slope Estimate A slope in an OLS regression line Omitted Variable Bias The bias that arises in the OLS es timators when a relevant variable is omitted from the regression Omitted Variables One or more variables which we would like to control for have been omitted in estimating a re gression model OneSided Alternative An alternative hypothesis that states that the parameter is greater than or less than the value hypothesized under the null OneStepAhead Forecast A time series forecast one period into the future OneTailed Test A hypothesis test against a onesided alternative Online Databases Databases that can be accessed via a com puter network Online Search Services Computer software that allows the Internet or databases on the Internet to be searched by topic name title or keywords Order Condition A necessary condition for identifying the parameters in a model with one or more endogenous explanatory variables the total number of exogenous variables must be at least as great as the total number of explanatory variables Ordinal Variable A variable where the ordering of the val ues conveys information but the magnitude of the values does not Ordinary Least Squares OLS A method for estimating the parameters of a multiple linear regression model The ordinary least squares estimates are obtained by minimiz ing the sum of squared residuals Outliers Observations in a data set that are substantially dif ferent from the bulk of the data perhaps because of errors or because some data are generated by a different model than most of the other data Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 765 OutofSample Criteria Criteria used for choosing forecast ing models which are based on a part of the sample that was not used in obtaining parameter estimates Over Controlling In a multiple regression model including explanatory variables that should not be held fixed when studying the ceteris paribus effect of one or more other ex planatory variables this can occur when variables that are themselves outcomes of an intervention or a policy are in cluded among the regressors Overall Significance of a Regression A test of the joint sig nificance of all explanatory variables appearing in a mul tiple regression equation Overdispersion In modeling a count variable the variance is larger than the mean Overidentified Equation In models with endogenous ex planatory variables an equation where the number of in strumental variables is strictly greater than the number of endogenous explanatory variables Overidentifying Restrictions The extra moment conditions that come from having more instrumental variables than endogenous explanatory variables in a linear model Overspecifying a Model See inclusion of an irrelevant variable P pValue The smallest significance level at which the null hy pothesis can be rejected Equivalently the largest signifi cance level at which the null hypothesis cannot be rejected Pairwise Uncorrelated Random Variables A set of two or more random variables where each pair is uncorrelated Panel Data A data set constructed from repeated cross sec tions over time With a balanced panel the same units ap pear in each time period With an unbalanced panel some units do not appear in each time period often due to attrition Parameter An unknown value that describes a population relationship Parsimonious Model A model with as few parameters as possible for capturing any desired features Partial Derivative For a smooth function of more than one variable the slope of the function in one direction Partial Effect The effect of an explanatory variable on the dependent variable holding other factors in the regression model fixed Partial Effect at the Average PEA In models with non constant partial effects the partial effect evaluated at the average values of the explanatory variables Percent Correctly Predicted In a binary response model the percentage of times the prediction of zero or one coin cides with the actual outcome Percentage Change The proportionate change in a variable multiplied by 100 Percentage Point Change The change in a variable that is measured as a percentage Perfect Collinearity In multiple regression one independent variable is an exact linear function of one or more other independent variables PlugIn Solution to the Omitted Variables Problem A proxy variable is substituted for an unobserved omitted variable in an OLS regression Point Forecast The forecasted value of a future outcome Poisson Distribution A probability distribution for count variables Poisson Regression Model A model for a count dependent variable where the dependent variable conditional on the explanatory variables is nominally assumed to have a Pois son distribution Policy Analysis An empirical analysis that uses econometric methods to evaluate the effects of a certain policy Pooled Cross Section A data configuration where indepen dent cross sections usually collected at different points in time are combined to produce a single data set Pooled OLS Estimation OLS estimation with independently pooled cross sections panel data or cluster samples where the observations are pooled across time or group as well as across the crosssectional units Population A welldefined group of people firms cities and so on that is the focus of a statistical or econometric analysis Population Model A model especially a multiple linear re gression model that describes a population Population RSquared In the population the fraction of the variation in the dependent variable that is explained by the explanatory variables Population Regression Function See conditional expectation Positive Definite A symmetric matrix such that all quadratic forms except the trivial one that must be zero are strictly positive Positive SemiDefinite A symmetric matrix such that all quadratic forms are nonnegative Power of a Test The probability of rejecting the null hypoth esis when it is false the power depends on the values of the population parameters under the alternative Practical Significance The practical or economic impor tance of an estimate which is measured by its sign and magnitude as opposed to its statistical significance PraisWinsten PW Estimation A method of estimating a multiple linear regression model with AR1 errors and strictly exogenous explanatory variables unlike Cochrane Orcutt PraisWinsten uses the equation for the first time period in estimation Predetermined Variable In a simultaneous equations model either a lagged endogenous variable or a lagged ex ogenous variable Predicted Variable See dependent variable Prediction The estimate of an outcome obtained by plugging specific values of the explanatory variables into an esti mated model usually a multiple regression model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 766 Prediction Error The difference between the actual out come and a prediction of that outcome Prediction Interval A confidence interval for an unknown outcome on a dependent variable in a multiple regression model Predictor Variable See explanatory variable Probability Density Function pdf A function that for dis crete random variables gives the probability that the ran dom variable takes on each value for continuous random variables the area under the pdf gives the probability of various events Probability Limit The value to which an estimator con verges as the sample size grows without bound Probit Model A model for binary responses where the re sponse probability is the standard normal cdf evaluated at a linear function of the explanatory variables Program Evaluation An analysis of a particular private or public program using econometric methods to obtain the causal effect of the program Proportionate Change The change in a variable relative to its initial value mathematically the change divided by the initial value Proxy Variable An observed variable that is related but not identical to an unobserved explanatory variable in multiple regression analysis Pseudo RSquared Any number of goodnessoffit measures for limited dependent variable models Q Quadratic Form A mathematical function where the vector argument both pre and postmultiplies a square symmet ric matrix Quadratic Functions Functions that contain squares of one or more explanatory variables they capture diminishing or increasing effects on the dependent variable Qualitative Variable A variable describing a nonquantita tive feature of an individual a firm a city and so on QuasiDemeaned Data In random effects estimation for panel data it is the original data in each time period minus a fraction of the time average these calculations are done for each crosssectional observation QuasiDifferenced Data In estimating a regression model with AR1 serial correlation it is the difference between the current time period and a multiple of the previous time pe riod where the multiple is the parameter in the AR1 model QuasiExperiment See natural experiment QuasiLikelihood Ratio Statistic A modification of the likelihood ratio statistic that accounts for possible distribu tional misspecification as in a Poisson regression model QuasiMaximum Likelihood Estimation QMLE Maxi mum likelihood estimation where the loglikelihood function may not correspond to the actual conditional dis tribution of the dependent variable R RBar Squared See adjusted Rsquared RSquared In a multiple regression model the proportion of the total sample variation in the dependent variable that is explained by the independent variable RSquared Form of the F Statistic The F statistic for testing exclusion restrictions expressed in terms of the Rsquareds from the restricted and unrestricted models Random Coefficient Slope Model A multiple regression model where the slope parameters are allowed to depend on unobserved unitspecific variables Random Effects Estimator A feasible GLS estimator in the unobserved effects model where the unobserved effect is assumed to be uncorrelated with the explanatory variables in each time period Random Effects Model The unobserved effects panel data model where the unobserved effect is assumed to be uncor related with the explanatory variables in each time period Random Sample A sample obtained by sampling randomly from the specified population Random Sampling A sampling scheme whereby each ob servation is drawn at random from the population In par ticular no unit is more likely to be selected than any other unit and each draw is independent of all other draws Random Variable A variable whose outcome is uncertain Random Vector A vector consisting of random variables Random Walk A time series process where next periods value is obtained as this periods value plus an indepen dent or at least an uncorrelated error term Random Walk with Drift A random walk that has a con stant or drift added in each period Rank Condition A sufficient condition for identification of a model with one or more endogenous explanatory variables Rank of a Matrix The number of linearly independent col umns in a matrix Rational Distributed Lag RDL Model A type of infinite distributed lag model where the lag distribution depends on relatively few parameters Real Variable A monetary value measured in terms of a base period Reduced Form Equation A linear equation where an en dogenous variable is a function of exogenous variables and unobserved errors Reduced Form Error The error term appearing in a reduced form equation Reduced Form Parameters The parameters appearing in a reduced form equation Regressand See dependent variable Regression Specification Error Test RESET A general test for functional form in a multiple regression model it is an F test of joint significance of the squares cubes and perhaps higher powers of the fitted values from the initial OLS estimation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 767 Regression through the Origin Regression analysis where the intercept is set to zero the slopes are obtained by mini mizing the sum of squared residuals as usual Regressor See explanatory variable Rejection Region The set of values of a test statistic that leads to rejecting the null hypothesis Rejection Rule In hypothesis testing the rule that deter mines when the null hypothesis is rejected in favor of the alternative hypothesis Relative Change See proportionate change Resampling Method A technique for approximating stan dard errors and distributions of test statistics whereby a series of samples are obtained from the original data set and estimates are computed for each subsample Residual The difference between the actual value and the fit ted or predicted value there is a residual for each obser vation in the sample used to obtain an OLS regression line Residual Analysis A type of analysis that studies the sign and size of residuals for particular observations after a mul tiple regression model has been estimated Residual Sum of Squares See sum of squared residuals Response Probability In a binary response model the prob ability that the dependent variable takes on the value one conditional on explanatory variables Response Variable See dependent variable Restricted Model In hypothesis testing the model obtained after imposing all of the restrictions required under the null Retrospective Data Data collected based on past rather than current information Root Mean Squared Error RMSE Another name for the standard error of the regression in multiple regression analysis Row Vector A vector of numbers arranged as a row S Sample Average The sum of n numbers divided by n a mea sure of central tendency Sample Correlation For outcomes on two random variables the sample covariance divided by the product of the sample standard deviations Sample Correlation Coefficient An estimate of the popula tion correlation coefficient from a sample of data Sample Covariance An unbiased estimator of the popula tion covariance between two random variables Sample Regression Function SRF See OLS regression line Sample Selection Bias Bias in the OLS estimator which is induced by using data that arise from endogenous sample selection Sample Standard Deviation A consistent estimator of the population standard deviation Sample Variance An unbiased consistent estimator of the population variance Sampling Distribution The probability distribution of an es timator over all possible sample outcomes Sampling Standard Deviation The standard deviation of an estimator that is the standard deviation of a sampling distribution Sampling Variance The variance in the sampling distribu tion of an estimator it measures the spread in the sampling distribution Scalar Multiplication The algorithm for multiplying a sca lar number by a vector or matrix Scalar VarianceCovariance Matrix A variancecovariance matrix where all offdiagonal terms are zero and the diago nal terms are the same positive constant Score Statistic See Lagrange multiplier statistic Seasonal Dummy Variables A set of dummy variables used to denote the quarters or months of the year Seasonality A feature of monthly or quarterly time series where the average value differs systematically by season of the year Seasonally Adjusted Monthly or quarterly time series data where some statistical procedurepossibly regression on seasonal dummy variableshas been used to remove the seasonal component Selected Sample A sample of data obtained not by random sampling but by selecting on the basis of some observed or unobserved characteristic SelfSelection Deciding on an action based on the likely benefits or costs of taking that action SemiElasticity The percentage change in the depen dent variable given a oneunit increase in an independent variable Sensitivity Analysis The process of checking whether the estimated effects and statistical significance of key explan atory variables are sensitive to inclusion of other explana tory variables functional form dropping of potentially outlying observations or different methods of estimation Sequentially Exogenous A feature of an explanatory vari able in time series or panel data models where the error term in the current time period has a zero mean conditional on all current and past explanatory variables a weaker ver sion is stated in terms of zero correlations Serial Correlation In a time series or panel data model cor relation between the errors in different time periods Serial CorrelationRobust Standard Error A standard er ror for an estimator that is asymptotically valid whether or not the errors in the model are serially correlated Serially Uncorrelated The errors in a time series or panel data model are pairwise uncorrelated across time ShortRun Elasticity The impact propensity in a distributed lag model when the dependent and independent variables are in logarithmic form Significance Level The probability of a Type I error in hy pothesis testing Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 768 Simple Linear Regression Model A model where the de pendent variable is a linear function of a single independent variable plus an error term Simultaneity A term that means at least one explanatory variable in a multiple linear regression model is determined jointly with the dependent variable Simultaneity Bias The bias that arises from using OLS to estimate an equation in a simultaneous equations model Simultaneous Equations Model SEM A model that jointly determines two or more endogenous variables where each endogenous variable can be a function of other endogenous variables as well as of exogenous variables and an error term Skewness A measure of how far a distribution is from being symmetric based on the third moment of the standardized random variable Slope In the equation of a line the change in the y variable when the x variable increases by one Slope Parameter The coefficient on an independent variable in a multiple regression model Smearing Estimate A retransformation method particularly useful for predicting the level of a response variable when a linear model has been estimated for the natural log of the response variable Spreadsheet Computer software used for entering and ma nipulating data Spurious Correlation A correlation between two variables that is not due to causality but perhaps to the dependence of the two variables on another unobserved factor Spurious Regression Problem A problem that arises when re gression analysis indicates a relationship between two or more unrelated time series processes simply because each has a trend is an integrated time series such as a random walk or both Square Matrix A matrix with the same number of rows as columns Stable AR1 Process An AR1 process where the param eter on the lag is less than one in absolute value The cor relation between two random variables in the sequence declines to zero at a geometric rate as the distance between the random variables increases and so a stable AR1 pro cess is weakly dependent Standard Deviation A common measure of spread in the distribution of a random variable Standard Deviation of bj A common measure of spread in the sampling distribution of bj Standard Error Generically an estimate of the standard de viation of an estimator Standard Error of bj An estimate of the standard deviation in the sampling distribution of bj Standard Error of the Estimate See standard error of the regression Standard Error of the Regression SER In multiple re gression analysis the estimate of the standard deviation of the population error obtained as the square root of the sum of squared residuals over the degrees of freedom Standard Normal Distribution The normal distribution with mean zero and variance one Standardized Coefficients Regression coefficients that measure the standard deviation change in the dependent variable given a one standard deviation increase in an inde pendent variable Standardized Random Variable A random variable trans formed by subtracting off its expected value and dividing the result by its standard deviation the new random vari able has mean zero and standard deviation one Static Model A time series model where only contempora neous explanatory variables affect the dependent variable Stationary Process A time series process where the mar ginal and all joint distributions are invariant across time Statistical Inference The act of testing hypotheses about population parameters Statistical Significance The importance of an estimate as measured by the size of a test statistic usually a t statistic Statistically Different from Zero See statistically significant Statistically Insignificant Failure to reject the null hypoth esis that a population parameter is equal to zero at the cho sen significance level Statistically Significant Rejecting the null hypothesis that a parameter is equal to zero against the specified alternative at the chosen significance level Stochastic Process A sequence of random variables indexed by time Stratified Sampling A nonrandom sampling scheme whereby the population is first divided into several non overlapping exhaustive strata and then random samples are taken from within each stratum Strict Exogeneity An assumption that holds in a time series or panel data model when the explanatory variables are strictly exogenous Strictly Exogenous A feature of explanatory variables in a time series or panel data model where the error term at any time period has zero expectation conditional on the explanatory variables in all time periods a less restrictive version is stated in terms of zero correlations Strongly Dependent See highly persistent Structural Equation An equation derived from economic theory or from less formal economic reasoning Structural Error The error term in a structural equation which could be one equation in a simultaneous equations model Structural Parameters The parameters appearing in a struc tural equation Studentized Residuals The residuals computed by exclud ing each observation in turn from the estimation divided by the estimated standard deviation of the error Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 769 Sum of Squared Residuals SSR In multiple regression analysis the sum of the squared OLS residuals across all observations Summation Operator A notation denoted by used to de fine the summing of a set of numbers Symmetric Distribution A probability distribution charac terized by a probability density function that is symmet ric around its median value which must also be the mean value whenever the mean exists Symmetric Matrix A square matrix that equals its transpose T t Distribution The distribution of the ratio of a standard nor mal random variable and the square root of an independent chisquare random variable where the chisquare random variable is first divided by its df t Ratio See t statistic t Statistic The statistic used to test a single hypothesis about the parameters in an econometric model Test Statistic A rule used for testing hypotheses where each sample outcome produces a numerical value Text Editor Computer software that can be used to edit text files Text ASCII File A universal file format that can be trans ported across numerous computer platforms TimeDemeaned Data Panel data where for each cross sectional unit the average over time is subtracted from the data in each time period Time Series Data Data collected over time on one or more variables Time Series Process See stochastic process Time Trend A function of time that is the expected value of a trending time series process Tobit Model A model for a dependent variable that takes on the value zero with positive probability but is roughly con tinuously distributed over strictly positive values See also corner solution response Top Coding A form of data censoring where the value of a variable is not reported when it is above a given threshold we only know that it is at least as large as the threshold Total Sum of Squares SST The total sample variation in a dependent variable about its sample average Trace of a Matrix For a square matrix the sum of its diago nal elements Transpose For any matrix the new matrix obtained by inter changing its rows and columns Treatment Group In program evaluation the group that par ticipates in the program Trending Process A time series process whose expected value is an increasing or a decreasing function of time TrendStationary Process A process that is stationary once a time trend has been removed it is usually implicit that the detrended series is weakly dependent True Model The actual population model relating the depen dent variable to the relevant independent variables plus a disturbance where the zero conditional mean assumption holds Truncated Normal Regression Model The special case of the truncated regression model where the underly ing population model satisfies the classical linear model assumptions Truncated Regression Model A linear regression model for crosssectional data in which the sampling scheme entirely excludes on the basis of outcomes on the dependent vari able part of the population TwoSided Alternative An alternative where the population parameter can be either less than or greater than the value stated under the null hypothesis Two Stage Least Squares 2SLS Estimator An instru mental variables estimator where the IV for an endogenous explanatory variable is obtained as the fitted value from regressing the endogenous explanatory variable on all ex ogenous variables TwoTailed Test A test against a twosided alternative Type I Error A rejection of the null hypothesis when it is true Type II Error The failure to reject the null hypothesis when it is false U Unbalanced Panel A panel data set where certain years or periods of data are missing for some crosssectional units Unbiased Estimator An estimator whose expected value or mean of its sampling distribution equals the population value regardless of the population value Uncentered Rsquared The Rsquared computed without subtracting the sample average of the dependent variable when obtaining the total sum of squares SST Unconditional Forecast A forecast that does not rely on knowing or assuming values for future explanatory variables Uncorrelated Random Variables Random variables that are not linearly related Underspecifying a Model See excluding a relevant variable Unidentified Equation An equation with one or more en dogenous explanatory variables where sufficient instru mental variables do not exist to identify the parameters Unit Root Process A highly persistent time series process where the current value equals last periods value plus a weakly dependent disturbance Unobserved Effect In a panel data model an unobserved variable in the error term that does not change over time For cluster samples an unobserved variable that is com mon to all units in the cluster Unobserved Effects Model A model for panel data or clus ter samples where the error term contains an unobserved effect Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 770 Unobserved Heterogeneity See unobserved effect Unrestricted Model In hypothesis testing the model that has no restrictions placed on its parameters Upward Bias The expected value of an estimator is greater than the population parameter value V Variance A measure of spread in the distribution of a ran dom variable VarianceCovariance Matrix For a random vector the posi tive semidefinite matrix defined by putting the variances down the diagonal and the covariances in the appropriate offdiagonal entries VarianceCovariance Matrix of the OLS Estimator The matrix of sampling variances and covariances for the vector of OLS coefficients Variance Inflation Factor In multiple regression analysis under the GaussMarkov assumptions the term in the sam pling variance affected by correlation among the explana tory variables Variance of the Prediction Error The variance in the error that arises when predicting a future value of the depen dent variable based on an estimated multiple regression equation Vector Autoregressive VAR Model A model for two or more time series where each variable is modeled as a linear function of past values of all variables plus disturbances that have zero means given all past values of the observed variables W Wald Statistic A general test statistic for testing hypotheses in a variety of econometric settings typically the Wald sta tistic has an asymptotic chisquare distribution Weak Instruments Instrumental variables that are only slightly correlated with the relevant endogenous explana tory variable or variables Weakly Dependent A term that describes a time series pro cess where some measure of dependence between random variables at two points in timesuch as correlationdimin ishes as the interval between the two points in time increases Weighted Least Squares WLS Estimator An estima tor used to adjust for a known form of heteroskedasticity where each squared residual is weighted by the inverse of the estimated variance of the error White Test A test for heteroskedasticity that involves re gressing the squared OLS residuals on the OLS fitted values and on the squares of the fitted values in its most general form the squared OLS residuals are regressed on the explanatory variables the squares of the explanatory variables and all the nonredundant interactions of the ex planatory variables Within Estimator See fixed effects estimator Within Transformation See fixed effects transformation Y Year Dummy Variables For data sets with a time series component dummy binary variables equal to one in the relevant year and zero in all other years Z Zero Conditional Mean Assumption A key assumption used in multiple regression analysis that states that given any values of the explanatory variables the expected value of the error equals zero See Assumptions MLR4 TS3 and TS3 in the text Zero Matrix A matrix where all entries are zero ZeroOne Variable See dummy variable Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 771 Index Numbers 2SLS See two stage least squares 401k plans asymptotic normality 155156 comparison of simple and multiple regression estimates 70 statistical vs practical significance 121 WLS estimation 259 A ability and wage causality 12 excluding ability from model 7883 IV for ability 481 mean independence 23 proxy variable for ability 279285 adaptive expectations 353 355 adjusted Rsquareds 181184 374 AFDC participation 231 age financial wealth and 257259 263 smoking and 261262 aggregate consumption function 511514 air pollution and housing prices beta coefficients 175176 logarithmic forms 171173 quadratic functions 175177 t test 118 alcohol drinking 230 alternative hypotheses defined 694 onesided 110114 695 twosided 114115 695 antidumping filings and chemical imports AR3 serial correlation 381 dummy variables 327328 forecasting 596 597 598 PW estimation 384 seasonality 336338 apples ecolabeled 180181 AR1 models consistency example 350351 testing for after 2SLS estimation 486 AR1 serial correlation correcting for 381387 testing for 376381 AR2 models EMH example 352 forecasting example 352 397 ARCH model 393394 ARq serial correlation correcting for 386387 testing for 379380 arrests asymptotic normality 155156 average sentence length and 249 goodnessoffit 72 heteroskedasticityrobust LM statistic 249 linear probability model 227228 normality assumption and 107 Poisson regression 545546 ASCII files 609 assumptions classical linear model CLM 106 establishing unbiasedness of OLS 7377 317320 homoskedasticity 4548 8283 89 363 matrix notation 723726 for multiple linear regressions 7377 82 89 152 normality 105108 322 for simple linear regressions 4045 4548 for time series regressions 317323 348354 363 zero mean and zero correlation 152 asymptotically uncorrelated sequences 346348 asymptotic bias deriving 153154 asymptotic confidence interval 157 asymptotic efficiency of OLS 161162 asymptotic normality of estimators in general 683684 asymptotic normality of OLS for multiple linear regressions 156158 for time series regressions 351354 asymptotic properties See large sample properties Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 772 asymptotics OLS See OLS asymptotics asymptotic sample properties of estimators 681684 asymptotic standard errors 157 asymptotic t statistics 157 asymptotic variance 156 atime series data applying 2SLS to 485486 attenuation bias 291 292 attrition 441 augmented DickeyFuller test 576 autocorrelation 320322 See also serial correlation autoregressive conditional heteroskedacity ARCH model 393394 autoregressive model of order two AR2 See AR2 models autoregressive process of order one AR1 347 auxiliary regression 159 average using summation operator 629 average marginal effect AME 286 532 average partial effect APE 286 532 540 average treatment effect 410 B balanced panel 420 baseball players salaries nonnested models 183 testing exclusion restrictions 127132 base group 208 base period and value 326 base value 326 beer price and demand 185186 taxes and traffic fatalities 184 benchmark group 208 Bernoulli random variables 646647 best linear unbiased estimator BLUE 89 beta coefficients 169170 between estimators 435 bias attenuation 291 292 heterogeneity 413 omitted variable 7883 simultaneity in OLS 503504 biased estimators 677678 biased toward zero 80 binary random variable 646 binary response models See logit and probit models binary variables See also qualitative information defined 206 random 646647 binomial distribution 651 birth weight AFDC participation and 231 asymptotic standard error 158 data scaling 166168 F statistic 133134 IV estimation 470471 bivariate linear regression model See simple regression model BLUE best linear unbiased estimator 89 bootstrap standard error 204 BreuschGodfrey test 381 BreuschPagan test for heteroskedasticity 251 C calculus differential 640642 campus crimes t test 116117 causality 1014 cdf cumulative distribution functions 648649 censored regression models 547552 Center for Research in Security Prices CRSP 608 central limit theorem 684 CEO salaries in multiple regressions motivation for multiple regression 6364 nonnested models 183184 predicting 192 193194 writing in population form 74 returns on equity and fitted values and residuals 32 goodnessoffit 35 OLS Estimates 2930 sales and constant elasticity model 39 ceteris paribus 1014 66 6768 chemical firms nonnested models 183 chemical imports See antidumping filings and chemical imports chisquare distribution critical values table 749 discussions 669 717 Chow tests differences across groups 223224 heteroskedasticity and 247248 for panel data 423424 for structural change across time 406 cigarettes See smoking city crimes See also crimes law enforcement and 13 panel data 910 classical errorsinvariables CEV 290 classical linear model CLM assumptions 106 clearup rate distributed lag estimation 416417 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 773 clusters 449450 effect 449 sample 449 CochraneOrcutt CO estimation 383 391 395 coefficient of determination See Rsquareds cointegration 580584 college admission omitting unobservables 285 college GPA beta coefficients 169170 fitted values and intercept 68 gender and 221224 goodnessoffit 71 heteroskedasticityrobust F statistic 247248 interaction effect 178179 interpreting equations 66 with measurement error 292 partial effect 67 population regression function 23 predicted 187188 189 with single dummy variable 209210 t test 115 college proximity as IV for education 473474 colleges junior vs fouryear 124127 collinearity perfect 7476 column vectors 709 commute time and freeway width 702703 compact discs demand for 732 complete cases estimator 293 composite error 413 term 441 Compustat 608 computer ownership college GPA and 209210 determinants of 267 computers grants to buy reducing error variance 185186 Rsquared size 180181 computer usage and wages with interacting terms 218 proxy variable in 282283 conceptual framework 615 conditional distributions features 652658 overview 649 651653 conditional expectations 661665 conditional forecasts 587 conditional median 300302 conditional variances 665 confidence intervals 95 rule of thumb for 691 asymptotic 157 asymptotic for nonnormal populations 692693 hypothesis testing and 701702 interval estimation and 687693 main discussions 122123 687688 for mean from normally distributed population 689691 for predictions 186189 consistency of estimators in general 681683 consistency of OLS in multiple regressions 150154 sampling selection and 553554 in time series regressions 348351 372 consistent tests 703 constant dollars 326 constant elasticity model 39 75 638 constant terms 21 consumer price index CPI 323 consumption See under family income contemporaneously exogenous variables 318 continuous random variables 648649 control group 210 control variables 21 See also independent variables corner solution response 525 corrected Rsquareds 181184 correlated random effects 445447 correlation 2223 coefficients 659660 count variables 543547 county crimes multiyear panel data 422423 covariances 658659 stationary processes 345346 covariates 21 CPI consumer price index 323 crimes See also arrests on campuses t test 116117 in cities law enforcement and 13 in cities panel data 910 clearup rate 416417 in counties multiyear panel data 422423 earlier data use of 283284 econometric model of 45 economic model of 3 160 275277 functional form misspecification 275277 housing prices and beta coefficients 175176 LM statistic 160 prison population and SEM 515516 unemployment and twoperiod panel data 412417 criminologists 607 critical values discussions 110 695 tables of 743749 crop yields and fertilizers causality 11 12 simple equation 2122 crosssectional analysis 612 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 774 crosssectional data See also panel data pooled cross sections regression analysis GaussMarkov assumptions and 82 354 main discussion 57 time series data vs 312313 CRSP Center for Research in Security Prices 608 cumulative areas under standard normal distribution 743744 cumulative distribution functions cdf 648649 cumulative effect 316 current dollars 326 cyclical unemployment 353 D data collection 608611 economic types of 512 experimental vs nonexperimental 2 frequency 7 data issues See also misspecification measurement error 287292 missing data 293294 multicollinearity 8386 293294 nonrandom samples 294295 outliers and influential observations 296300 random slopes 285287 unobserved explanatory variables 279285 data mining 613 data scaling effects on OLS statistics 166170 DavidsonMacKinnon test 278 deficits See interest rates degrees of freedom df chisquare distributions with n 669 for fixed effects estimator 436 for OLS estimators 88 dependent variables See also regression analysis specific event studies defined 21 measurement error in 289292 derivatives 635 descriptive statistics 629 deseasonalizing data 337 detrending 334335 diagonal matrices 710 DickeyFuller distribution 575 DickeyFuller DF test 575578 augmented 576 differenceindifferences estimator 408 410 difference in slopes 218224 differencestationary processes 358 differencing panel data with more than two periods 420425 twoperiod 412417 serial correlation and 387388 differential calculus 640642 diminishing marginal effects 635 discrete random variables 646647 disturbance terms 4 21 63 disturbance variances 45 downward bias 80 drug usage 230 drunk driving laws and fatalities 419 dummy variables See also qualitative information year dummy variables defined 206 regression 438439 trap 208 duration analysis 549551 DurbinWatson test 378379 381 dynamically complete models 360363 E EagleGranger test 581582 earnings of veterans IV estimation 469 EconLit 606 607 econometric analysis in projects 611614 econometric models 45 See also econometric models econometrics 12 See also specific topics economic growth and government policies 7 economic models 25 economic significance See practical significance economic vs statistical significance 120124 702703 economists types of 606607 education birth weight and 133134 fertility and 2SLS 487 with discrete dependent variables 231232 independent cross sections 404405 gender wage gap and 405406 IV for 463 473474 logarithmic equation 639 return to 2SLS 477 differencing 448 fixed effects estimation 438 independent cross sections 405406 IQ and 281282 IV estimation 467469 testing for endogeneity 482 testing overidentifying restrictions 482 wages and See under wages return to education over time 405406 smoking and 261262 women and 225227 See also under women in labor force Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 775 efficiency asymptotic 161162 of estimators in general 679680 of OLS with serially correlated errors 373374 efficient markets hypothesis EMH asymptotic analysis example 352353 heteroskedasticity and 393 elasticity 39 637638 elections See voting outcomes EMH See efficient markets hypothesis EMH empirical analysis data collection 608611 econometric analysis 611614 literature review 607608 posing question 605607 sample projects 621625 steps in 25 writing paper 614621 employment and unemployment See also wages arrests and 227228 crimes and 412417 enterprise zones and 422 estimating average rate 675 forecasting 589 591592 594 inflation and See under inflation in Puerto Rico logarithmic form 323324 time series data 78 women and See women in labor force endogenous explanatory variables See also instrumental variables simultaneous equations models two stage least squares defined 76 274 in logit and probit models 536 sample selection and 557 tesing for 481482 endogenous sample selection 294 EngleGranger twostep procedure 586 enrollment t test 116117 enterprise zones business investments and 696697 unemployment and 422 error correction models 584586 errorsinvariables problem 479481 512 error terms 4 21 63 error variances adding regressors to reduce 185186 defined 45 83 estimating 4850 estimated GLS See feasible GLS estimation and estimators See also first differencing fixed effects instrumental variables logit and probit models OLS ordinary least squares random effects Tobit model advantages of multiple over simple regression 6064 asymptotic sample properties of 681684 changing independent variables simultaneously 68 defined 675 differenceindifferences 408410 finite sample properties of 675680 LAD 300302 language of 9091 method of moments approach 2526 misspecifying models 7883 sampling distributions of OLS estimators 105108 event studies 325 327328 Excel 610 excluding relevant variables 7883 exclusion restrictions 127 for 2SLS 475 general linear 136137 Lagrange multiplier LM statistic 158160 overall significance of regressions 135 for SEM 510511 testing 127132 exogenous explanatory variables 76 exogenous sample selection 294 553 expectations augmented Phillips curve 353354 377 378 expectations hypothesis 14 expected values 652654 716 experience wage and causality 12 interpreting equations 67 motivation for multiple regression 61 omitted variable bias 81 partial effect 642 quadratic functions 173175 636 women and 225227 experimental data 2 experimental group 210 experiments defined 645 explained sum of squares SSE 34 70 explained variables See also dependent variables defined 21 explanatory variables 21 See also independent variables exponential function 639 exponential smoothing 587 exponential trends 330331 F family income See also savings birth weight and asymptotic standard error 158 data scaling 166168 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 776 family income continued college GPA and 292 consumption and motivation for multiple regression 62 63 perfect collinearity and 75 farmers and pesticide usage 185 F distribution critical values table 746748 discussions 670 671 717 FDL finite distributed lag models 314316 350 416417 feasible GLS with heteroskedasticity and AR1 serial correlations 395 main discussion 258263 OLS vs 385386 Federal Bureau of Investigation 608 fertility rate education and 487 forecasting 597 over time 404405 tax exemption and with binary variables 324325 cointegration 582583 FDL model 314316 first differences 363364 serial correlation 362 trends 333 fertility studies with discrete dependent variable 231232 fertilizers land quality and 23 soybean yields and causality 11 12 simple equation 2122 final exam scores interaction effect 178179 skipping classes and 464465 financial wealth nonrandom sampling 294295 WLS estimation 257259 263 finite distributed lag FDL models 314316 350 finite sample properties of estimators 675680 of OLS in matrix form 723726 firm sales See sales firstdifferenced equations 414 firstdifferenced estimator 414 first differencing defined 414 fixed effects vs 439440 I1 time series and 358 panel data pitfalls in 423424 first order autocorrelation 359 first order conditions 27 65 642 721 fitted values See also OLS ordinary least squares in multiple regressions 6869 in simple regressions 27 32 fixed effects defined 413 dummy variable regression 438439 estimation 435441 first differencing vs 439440 random effects vs 444445 transformation 435 with unbalanced panels 440441 forecast error 586 forecasting multiplestepahead 592594 onestepahead 588 overview and definitions 586587 trending seasonal and integrated processes 594598 types of models used for 587588 forecast intervals 588 free throw shooting 651652 freeway width and commute time 702703 frequency data 7 frequency distributions 401k plans 155 F statistics See also F tests defined 129 heteroskedasticityrobust 247248 F tests See also Chow tests F statistics F and t statistics 132133 functional form misspecification and 275279 general linear restrictions 136137 LM tests and 160 overall significance of regressions 135 pvalues for 134135 reporting regression results 137138 Rsquared form 133134 testing exclusion restrictions 127132 functional forms in multiple regressions with interaction terms 177179 logarithmic 171173 misspecification 275279 quadratic 173177 in simple regressions 3640 in time series regressions 323324 G Gaussian distribution 665 GaussMarkov assumptions for multiple linear regressions 7377 82 for simple linear regressions 4044 4548 for time series regressions 319322 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 777 GaussMarkov Theorem for multiple linear regressions 8990 for OLS in matrix form 725726 GDL geometric distributed lag 571572 GDP See gross domestic product GDP gender oversampling 295 wage gap 405406 gender gap independent cross sections 405406 panel data 405406 generalized least squares GLS estimators for AR1 models 383387 with heteroskedasticity and AR1 serial correlations 395 when heteroskedasticity function must be estimated 258263 when heteroskedasticity is known up to a multiplicative constant 255256 geometric distributed lag GDL 571572 GLS estimators See generalized least squares GLS estimators Goldberger Arthur 85 goodnessoffit See also predictions Rsquareds change in unit of measurement and 37 in multiple regressions 7071 overemphasizing 184185 percent correctly predicted 227 530 in simple regressions 3536 in time series regressions 374 Google Scholar 606 government policies economic growth and 6 89 GPA See college GPA Granger Clive W J 150 Granger causality 590 gross domestic product GDP data frequency for 7 government policies and 6 high persistence 355357 in real terms 326 seasonal adjustment of 336 unit root test 578 growth rate 331 gun control laws 230 H HAC standard errors 389 Hartford School District 190 Hausman test 262 444 Head Start participation 230 Heckit method 556 heterogeneity bias 413 heteroskedasticity See also weighted least squares estimation 2SLS with 484485 consequences of for OLS 243244 defined 45 HAC standard errors 389 heteroskedasticityrobust procedures 244249 linear probability model and 265267 robust F statistic 247 robust LM Statistic 248 robust t statistic 246 for simple linear regressions 4548 testing for 249254 for time series regressions 363 in time series regressions 391395 of unknown form 244 in wage equation 46 highly persistent time series deciding whether I0 or I1 359360 description of 354363 transformations on 358360 histogram 401k plan participation 155 homoskedasticity for IV estimation 466467 for multiple linear regressions 8283 89 for OLS in matrix form 724 for time series regressions 319322 351352 hourly wages See wages housing prices and expenditures general linear restrictions 136137 heteroskedasticity BP test 251252 White test 252254 incinerators and inconsistency in OLS 153 pooled cross sections 407411 income and 631 inflation 572574 investment and computing Rsquared 334335 spurious relationship 332333 over controlling 185 with qualitative information 211 RESET 278279 savings and 502 hypotheses See also hypothesis testing about single linear combination of parameters 124127 after 2SLS estimation 479 expectations 14 language of classical testing 120 in logit and probit models 529530 multiple linear restrictions See F tests residual analysis 190 stating in empirical analysis 4 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 778 hypothesis testing about mean in normal population 695696 asymptotic tests for nonnormal populations 698 computing and using pvalues 698700 confidence intervals and 701702 in matrix form Wald statistics for 730731 overview and fundamentals 693695 practical vs statistical significance 702703 I I0 and I1 processes 359360 idempotent matrices 715 identification defined 465 in systems with three or more equations 510511 in systems with two equations 504510 identified equation 505 identity matrices 710 idiosyncratic error 413 IDL infinite distributed lag models 569574 IIP index of industrial production 326327 impact propensitymultiplier 315 incidental truncation 553 554558 incinerators and housing prices inconsistency in OLS 153 pooled cross sections 407411 including irrelevant variables 7778 income See also wages family See family income housing expenditure and 631 PIH 513514 savings and See under savings inconsistency in OLS deriving 153154 inconsistent estimators 681 independence joint distributions and 649651 independently pooled cross sections See also pooled cross sections across time 403407 defined 402 independent variables See also regression analysis specific event studies changing simultaneously 68 defined 21 measurement error in 289291 in misspecified models 7883 random 650 simple vs multiple regression 6164 index numbers 324327 industrial production index of IIP 326327 infant mortality rates outliers 299300 inference in multiple regressions confidence intervals 122124 statistical with IV estimator 466469 in time series regressions 322323 373374 infinite distributed lag models 569574 inflation from 1948 to 2003 313 openness and 508 509510 random walk model for 355 unemployment and expectations augmented Phillips curve 353354 forecasting 589 static Phillips curve 314 322323 unit root test 577 influential observations 296300 information set 587 insample criteria 591 instrumental variables computing Rsquared after estimation 471 in multiple regressions 471475 overview and definitions 462 463 465 properties with poor instrumental variable 469471 in simple regressions 462471 solutions to errorsinvariables problems 479481 statistical inference 466469 integrated of order zeroone processes 358360 integrated processes forecasting 594598 interaction effect 177179 interaction terms 217218 intercept parameter 21 intercepts See also OLS estimators regression analysis change in unit of measurement and 3637 defined 21 630 in regressions on a constant 51 in regressions through origin 5051 intercept shifts 207 interest rates differencing 387388 inference under CLM assumptions 323 Tbill See Tbill rates internet services 606 interval estimation 674 687688 inverse Mills ratio 538 inverse of matrix 713 IQ ability and 279283 284285 nonrandom sampling 294295 irrelevant variables including 7778 IV See instrumental variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 779 J JEL See Journal of Economic Literature JEL job training sample model as selfselection problem 3 worker productivity and program evaluation 229 as selfselection problem 230 joint distributions features of 652658 independence and 649651 joint hypotheses tests 127 jointly statistically significantinsignificant 130 joint probability 649 Journal of Economic Literature JEL 606 junior colleges vs universities 124127 just identified equations 511 K Koyck distributed lag 571572 kurtosi 658 L labor economists 605 607 labor force See employment and unemployment women in labor force labor supply and demand 500501 labor supply function 639 LAD least absolute deviations estimation 300302 lag distribution 315 lagged dependent variables as proxy variables 283284 serial correlation and 374375 lagged endogenous variables 591592 lagged explanatory variables 316 Lagrange multiplier LM statistics heteroskedasticityrobust 248249 See also heteroskedasticity main discussion 158160 land quality and fertilizers 23 large sample properties 681683 latent variable models 526 law enforcement city crime levels and causality 13 murder rates and SEM 501502 law of iterated expectations 664 law of large numbers 682 law school rankings as dummy variables 216217 residual analysis 190 leads and lags estimators 584 least absolute deviations LAD estimation 300302 least squares estimator 686 likelihood ratio statistic 529 limited dependent variables censored and truncated regression models 547552 corner solution response See Tobit model count response Poisson regression for 543547 overview 524525 sample selection corrections 554558 linear functions 630631 linear independence 714 linear in parameters assumption for OLS in matrix form 723724 for simple linear regressions 40 44 for time series regressions 317318 linearity and weak dependence assumption 348349 linear probability model LPM See also limited dependent variables heteroskedasticity and 265266 main discussion 224229 linear regression model 40 64 linear relationship among independent variables 8386 linear time trends 330 literature review 607608 loan approval rates F and t statistics 150 multicollinearity 85 program evaluation 230 logarithms in multiple regressions 171173 natural overview 736739 predicting y when logy is dependent 191193 qualitative information and 211212 real dollars and 327 in simple regressions 3739 in time series regressions 323324 log function 636 logit and probit models interpreting estimates 530536 maximum likelihood estimation of 528529 specifying 525528 testing multiple hypotheses 529530 loglikelihood functions 529 longitudinal data See panel data longrun elasticity 324 longrun multiplier See longrun propensity LRP longrun propensity LRP 316 loss functions 586 LRP longrun propensity 316 lunch program and math performance 4445 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 780 M macroeconomists 606 MAE mean absolute error 591 marginal effect 630 marital status See qualitative information martingale difference sequence 574 martingale functions 587 matched pair samples 449 mathematical statistics See statistics math performance and lunch program 4445 matrices See also OLS in matrix form addition 710 basic definitions 709710 differentiation of linear and quadratic forms 715 idempotent 715 linear independence and rank of 714 moments and distributions of random vectors 716717 multiplication 711712 operations 710713 quadratic forms and positive definite 714715 matrix notation 721 maximum likelihood estimation 528529 685686 MCAR missing completely at random 293 mean using summation operator 629630 mean absolute error MAE 591 mean independence 23 mean squared error MSE 680 measurement error IV solutions t0 479481 men return to education 468 properties of OLS under 287292 measures of association 658 measures of central tendency 655657 measures of variability 656 median 630 655 method of moments approach 2526 685 micronumerosity 85 military personnel survey oversampling in 295 minimum variance unbiased estimators 106 686 727 minimum wages causality 13 employmentunemployment and AR1 serial correlation testing for 377378 detrending 334335 logarithmic form 323324 SCrobust standard error 391 in Puerto Rico effects of 78 minorities and loans See loan approval rates missing at random 294 missing completely at random MCAR 293 missing data 293294 misspecification in empirical projects 613 functional form 275279 unbiasedness and 7883 variances 8687 motherhood teenage 448449 moving average process of order one MA1 346 MSE mean squared error 680 multicollinearity 2SLS and 477 among explanatory variables 293 main discussion 8386 multiple hypotheses tests 127 multiple linear regression MLR model 63 multiple regression analysis See also data issues estimation and estimators heteroskedasticity hypotheses OLS ordinary least squares predictions Rsquareds adding regressors to reduce error variance 185186 advantages over simple regression 6064 confidence intervals 122124 interpreting equations 67 null hypothesis 108 omitted variable bias 7883 over controlling 184185 multiple regressions See also qualitative information beta coefficients 169 hypotheses with more than one parameter 124127 misspecified functional forms 275 motivation for multiple regression 61 62 nonrandom sampling 294295 normality assumption and 107 productivity and 360 quadratic functions 173177 with qualitative information of baseball players race and 220221 computer usage and 218 with different slopes 218221 education and 218220 gender and 207211 212214 218221 with interacting terms 218 law school rankings and 216217 with logy dependent variable 213214 marital status and 219220 with multiple dummy variables 212213 with ordinal variables 215217 physical attractiveness and 216217 random effects model 443444 random slope model 285 reporting results 137138 t test 110 with unobservables general approach 284285 with unobservables using proxy 279285 working individuals in 1976 6 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 781 multiple restrictions 127 multiplestepahead forecast 587 592594 multiplicative measurement error 289 multivariate normal distribution 716717 municipal bond interest rates 214215 murder rates SEM 501502 static Phillips curve 314 N natural experiments 410 469 natural logarithms 736739 See also logarithms netted out 69 nominal dollars 326 nominal vs real 326 nonexperimental data 2 nonlinear functions 634640 nonlinearities incorporating in simple regressions 3739 nonnested models choosing between 182184 functional form misspecification and 278279 nonrandom samples 294295 553 nonstationary time series processes 345346 no perfect collinearity assumption form 723 for multiple linear regressions 7476 77 for time series regressions 318 349 normal distribution 665669 normality assumption for multiple linear regressions 105108 for time series regressions 322 normality of errors assumption 726 normality of estimators in general asymptotic 683684 normality of OLS asymptotic in multiple regressions 154160 in time series regressions 351354 normal sampling distributions for multiple linear regressions 107108 for time series regressions 322323 no serial correlation assumption See also serial correlation for OLS in matrix form 724725 for time series regressions 320322 351352 nRsquared statistic 159 null hypothesis 108110 694 See also hypotheses numerator degrees of freedom 129 O observational data 2 OLS ordinary least squares cointegration and 583584 comparison of simple and multiple regression estimates 6970 consistency See consistency of OLS logit and probit vs 533535 in multiple regressions algebraic properties 6472 computational properties 6466 6472 effects of data scaling 166170 fitted values and residuals 68 goodnessoffit 7071 interpreting equations 6566 Lagrange multiplier LM statistic 158160 measurement error and 287292 normality 154160 partialling out 69 regression through origin 73 statistical properties 7381 Poisson vs 545 546547 in simple regressions algebraic properties 3234 defined 27 deriving estimates 2432 statistical properties 4550 units of measurement changing 3637 simultaneity bias in 503504 in time series regressions correcting for serial correlation 383386 FGLS vs 385386 finite sample properties 317323 normality 351354 SCrobust standard errors 388391 with serially correlated errors properties of 373375 Tobit vs 540542 OLS and Tobit estimates 540542 OLS asymptotics in matrix form 728731 in multiple regressions consistency 150154 efficiency 161162 overview 149150 in time series regressions consistency 348354 OLS estimators See also heteroskedasticity defined 40 in multiple regressions efficiency of 8990 variances of 8189 sampling distributions of 105108 in simple regressions expected value of 7381 unbiasedness of 4045 77 variances of 4548 in time series regressions sampling distributions of 322323 unbiasedness of 317323 variances of 320322 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 782 OLS in matrix form asymptotic analysis 728731 finite sample properties 723726 overview 720722 statistical inference 726728 Wald statistics for testing multiple hypotheses 730731 OLS intercept estimates defined 6566 OLS regression line See also OLS ordinary least squares defined 28 in multiple regressions 65 OLS slope estimates defined 65 omitted variable bias See also instrumental variables general discussions 7883 using proxy variables 279285 onesided alternatives 695 onestepahead forecasts 586 588 onetailed tests 110 696 See also t tests online databases 609 online search services 607608 order condition 479 507 ordinal variables 214217 outliers guarding against 300302 main discussion 296300 outofsample criteria 591 overall significance of regressions 135 over controlling 184185 overdispersion 545 overidentified equations 511 overidentifying restrictions testing 482485 overspecifying the model 78 P pairwise uncorrelated random variables 660661 panel data applying 2SLS to 487488 applying methods to other structures 448450 correlated random effects 445447 differencing with more than two periods 420425 fixed effects 435441 independently pooled cross sections vs 403 organizing 417 overview 910 pitfalls in first differencing 424 random effects 441445 simultaneous equations models with 514516 twoperiod analysis 417419 twoperiod policy analysis with 417419 unbalanced 440441 Panel Study of Income Dynamics 608 parameters defined 4 674 estimation general approach to 684686 partial derivatives 641 partial effect 66 6768 partial effect at average PEA 531532 partialling out 69 partitioned matrix multiplication 712713 pdf probability density functions 647 percentage point change 634 percentages 633634 change 633 percent correctly predicted 227 530 perfect collinearity 7476 permanent income hypothesis 513514 pesticide usage over controlling 185 physical attractiveness and wages 215216 pizzas expected revenue 654 plugin solution to the omitted variables problem 280 point estimates 674 point forecasts 588 poisson distribution 544 545 poisson regression model 543547 policy analysis with pooled cross sections 407412 with qualitative information 210 229231 with twoperiod panel data 417419 pooled cross sections See also independently pooled cross sections applying 2SLS to 487488 overview 8 policy analysis with 407412 population defined 674 population model defined 73 population regression function PRF 23 population Rsquareds 181 positive definite and semidefinite matrices defined 715 poverty rate in absence of suitable proxies 285 excluding from model 80 power of test 694 practical significance 120 practical vs statistical significance 120124 702703 PraisWinsten PW estimation 383384 386 390 predetermined variables 592 predicted variables 21 See also dependent variables prediction error 188 predictions confidence intervals for 186189 with heteroskedasticity 264266 residual analysis 190 for y when logy is dependent 191193 predictor variables 23 See also dependent variables price index 326327 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 783 prisons population and crime rates 515516 recidivism 549551 probability See also conditional distributions joint distributions features of distributions 652658 independence 649651 joint 649 normal and related distributions 665669 overview 645 random variables and their distributions 645649 probability density function pdf 647 probability limits 681683 probit model See logit and probit models productivity See worker productivity program evaluation 210 229231 projects See empirical analysis property taxes and housing prices 8 proportions 733734 proxy variables 279285 pseudo Rsquareds 531 public finance study researchers 606 Puerto Rico employment in detrending 334335 logarithmic form 323324 time series data 78 pvalues computing and using 698700 for F tests 134135 for t tests 118120 Q QMLE quasimaximum likelihood estimation 728 quadratic form for matrices 714715 716 quadratic function 634636 quadratic time trends 331 qualitative information See also linear probability model LPM in multiple regressions allowing for different slopes 218221 binary dependent variable 224229 describing 205206 discrete dependent variables 231232 interactions among dummy variables 217 with logy dependent variable 211212 multiple dummy independent variables 212217 ordinal variables 214217 overview 205 policy analysis and program evaluation 229231 proxy variables 282283 single dummy independent variable 206212 testing for differences in regression functions across groups 221224 in time series regressions main discussion 324329 seasonal 336338 quantile regression 302 quasidemeaned data 442 quasidifferenced data 382 390 quasiexperiment 410 quasi natural experiments 410 469 quasilikelihood ratio statistic 546 quasimaximum likelihood estimation QMLE 545 728 R R2 j 8386 race arrests and 229 baseball player salaries and 220221 discrimination in hiring asymptotic confidence interval 692693 hypothesis testing 698 pvalue 701 random coefficient model 285287 random effects correlated 445447 estimator 442 fixed effects vs 444445 main discussion 441445 random sampling assumption for multiple linear regressions 74 for simple linear regressions 4041 42 44 crosssectional data and 57 defined 675 random slope model 285287 random variables 645649 random vectors 716 random walks 354 rank condition 479 497 506507 rank of matrix 714 rational distributed lag models 572574 RD and sales confidence intervals 123124 nonnested models 182184 outliers 296298 RDL rational distributed lag models 572574 real dollars 326 recidivism duration analysis 549551 reduced form equations 473 504 reduced form error 504 reduced form parameters 504 regressands 21 See also dependent variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 784 regression analysis 5051 See also multiple regression analysis simple regression model time series data regression specification error test RESET 277278 regression through origin 5052 regressors 21 185186 See also independent variables rejection region 695 rejection rule 110 See also t tests relative change 633 relative efficiency 679680 relevant variables excluding 7883 reporting multiple regression results 137138 resampling method 203 rescaling 166168 RESET regression specification error test 277 residual analysis 190 residuals See also OLS ordinary least squares in multiple regressions 68 297298 in simple regressions 27 32 48 studentized 297298 residual sum of squares SSR See sum of squared residuals response probability 225 525 response variables 21 See also dependent variables REST regression specification error test 277278 restricted model 128129 See also F tests retrospective data 2 returns on equity and CEO salaries fitted values and residuals 32 goodnessoffit 35 OLS Estimates 2930 RMSE root mean squared error 50 88 591 robust regression 302 rooms and housing prices beta coefficients 175176 interaction effect 177179 quadratic functions 175177 residual analysis 190 root mean squared error RMSE 50 88 591 row vectors 709 Rsquareds See also predictions adjusted 181184 374 after IV estimation 471 change in unit of measurement and 37 in fixed effects estimation 437 438439 for F statistic 133134 in multiple regressions main discussion 7073 for probit and logit models 531 for PW estimation 383384 in regressions through origin 5051 73 in simple regressions 3536 size of 180181 in time series regressions 374 trending dependent variables and 334335 uncentered 214 S salaries See CEO salaries income wages sales CEO salaries and constant elasticity model 39 nonnested models 183184 motivation for multiple regression 6364 RD and See RD and sales sales tax increase 634 sample average 675 sample correlation coefficient 685 sample covariance 685 sample regression function SRF 28 65 sample selection corrections 553558 sample standard deviation 683 sample variation in the explanatory variable assumption 42 44 sampling nonrandom 293300 sampling distributions defined 676 of OLS estimators 105108 sampling standard deviation 693 sampling variances of estimators in general 678679 of OLS estimators for multiple linear regressions 82 83 for simple linear regressions 4748 savings housing expenditures and 502 income and heteroskedasticity 254256 scatterplot 25 measurement error in 289 with nonrandom sample 294295 scalar multiplication 710 scalar variancecovariance matrices 724 scatterplots RD and sales 297298 savings and income 25 wage and education 27 school lunch program and math performance 4445 school size and student performance 113114 score statistic 158160 scrap rates and job training 2SLS 487 confidence interval 700701 confidence interval and hypothesis testing 702 fixed effects estimation 436437 measurement error in 289 program evaluation 229 pvalue 700701 statistical vs practical significance 121122 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 785 twoperiod panel data 418 unbalanced panel data 441 seasonal dummy variables 337 seasonality forecasting 594598 serial correlation and 381 of time series 336338 seasonally adjusted patterns 336 selected samples 553 selfselection problems 230 SEM See simultaneous equations models semielasticity 39 639 sensitivity analysis 613 sequential exogeneity 363 serial correlation correcting for 381387 differencing and 387389 heteroskedasticity and 395 lagged dependent variables and 374375 no serial correlation assumption 320322 351354 properties of OLS with 373375 testing for 376381 serial correlationrobust standard errors 388391 serially uncorrelation 360 shortrun elasticity 324 significance level 110 simple linear regression model 20 simple regression model 2024 See also OLS ordinary least squares incorporating nonlinearities in 3739 IV estimation 462471 multiple regression vs 6063 regression on a constant 51 regression through origin 5051 simultaneity bias 504 simultaneous equations models bias in OLS 503504 identifying and estimating structural equations 504510 overview and nature of 449503 with panel data 514516 systems with more than two equations 510511 with time series 511514 skewness 658 sleeping vs working tradeoff 415416 slopes See also OLS estimators regression analysis change in unit of measurement and 3637 39 defined 21 630 parameter 21 qualitative information and 218221 random 285287 in regressions on a constant 51 in regressions through origin 5051 smearing estimates 191 smoking birth weight and asymptotic standard error 158 data scaling 166170 cigarette taxes and consumption 411412 demand for cigarettes 261262 IV estimation 470 measurement error 292 Social Sciences Citation Index 606 soybean yields and fertilizers causality 11 12 simple equation 2122 specification search 613 spreadsheets 610 spurious regression 332333 578580 square matrices 709710 SRF sample regression function 28 65 SSE explained sum of squares 34 7071 SSR residual sum of squares See sum of squared residuals SST total sum of squares 34 7071 SSTj total sample variation in xj 83 stable AR1 processes 347 standard deviation of bˆj 8990 defined 45 657 estimating 49 properties of 657 standard error of the regression SER 50 88 standard errors asymptotic 157 of bˆj 88 heteroskedasticityrobust 246247 of OLS estimators 8789 of bˆ1 50 serial correlationrobust 388391 standardized coefficients 169170 standardized random variables 657658 standardized test scores beta coefficients 169 collinearity 7475 interaction effect 178179 motivation for multiple regression 61 62 omitted variable bias 80 81 omitting unobservables 285 residual analysis 190 standard normal distribution 666668 743744 static models 314 350 static Phillips curve 314 322323 377 378 386 stationary time series processes 345346 statistical inference with IV estimator 466469 for OLS in matrix form 726728 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 786 statistical significance defined 115 economicpractical significance vs 120124 economicpractical significance vs 702 joint 130 statistical tables 743749 statistics See also hypothesis testing asymptotic properties of estimators 681684 finite sample properties of estimators 675680 interval estimation and confidence intervals 687693 notation 703 overview and definitions 674675 parameter estimation general approaches to 684686 stepwise regression 614 stochastic process 313 345 stock prices and trucking regulations 325 stock returns 393 394 See also efficient markets hypothesis EMH stratified sampling 295 strict exogeneity assumption 414420 570 strictly exogenous variables correcting for 381387 serial correlation testing for 376381 strict stationarity 345 strongly dependent time series See highly persistent time series structural equations definitions 471 500 501 504 identifying and estimating 504510 structural error 501 structural parameters 504 student enrollment t test 116117 studentized residuals 298 student performance See also college GPA final exam scores standardized test scores in math lunch program and 4445 school expenditures and 85 school size and 113114 style hints for empirical papers 619621 summation operator 628630 sum of squared residuals See also OLS ordinary least squares in multiple regressions 7071 in simple regressions 34 supply shock 353 Survey of Consumer Finances 608 symmetric matrices 712 systematic part defined 24 system estimation methods 511 T tables statistical 743749 tax exemption See under fertility rate Tbill rates cointegration 580584 error correction model 585 inflation deficits See under interest rates random walk characterization of 355 356 unit root test 576 t distribution critical values table 745 discussions 108110 660670 717 for standardized estimators 108110 teachers salarypension tradeoff 137138 teenage motherhood 448449 tenure See also wages interpreting equations 67 motivation for multiple regression 6364 testing overidentifying restrictions 482485 test scores as indicators of ability 481 test statistic 695 text editor 609 text files and editors 608609 theorems asymptotic efficiency of OLS 162 for time series regressions 351354 consistency of OLS for multiple linear regressions 150154 for time series regressions 348351 GaussMarkov for multiple linear regressions 8990 for time series regressions 320322 normal sampling distributions 107108 for OLS in matrix form GaussMarkov 725726 statistical inference 726728 unbiasedness 726 variancecovariance matrix of OLS estimator 724725 sampling variances of OLS estimators for simple linear regressions 4748 for time series regressions 320322 unbiased estimation of s2 for multiple linear regressions 8889 for time series regressions 321 unbiasedness of OLS for multiple linear regressions 77 for time series regressions 317320 theoretical framework 615 three stage least squares 511 timedemeaned data 435 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 787 time series data absence of serial correlation 360363 applying 2SLS to 485486 cointegration 580584 dynamically complete models 360363 error correction models 584586 examples of models 313316 functional forms 323324 heteroskedasticity in 391395 highly persistent See highly persistent time series homoskedasticity assumption for 363364 infinite distributed lag models 569574 nature of 312313 OLS See under OLS ordinary least squares OLS estimators overview 78 in panel data 910 in pooled cross sections 89 with qualitative information See under qualitative information seasonality 336338 simultaneous equations models with 511514 spurious regression 578580 stationary and nonstationary 345346 unit roots testing for 574579 weakly dependent 346348 time trends See trends timevarying error 413 tobit model interpreting estimates 537542 overview 536537 specification issues in 543 top coding 548 total sample variation in xj 83 total sum of squares SST 34 7071 trace of matrix 713 traffic fatalities beer taxes and 184 training grants See also job training program evaluation 229 single dummy variable 210211 transpose of matrix 712 treatment group 210 trends characterizing trending time series 329332 detrending 334335 forecasting 594598 high persistence vs 352 Rsquared and trending dependent variable 334335 seasonality and 337338 time 329 using trending variables 332333 trendstationary processes 348 trucking regulations and stock prices 325 true model defined 74 truncated normal regression model 551 truncated regression models 548 551552 t statistics See also t tests asymptotic 157 defined 109 696 F statistic and 132133 heteroskedasticityrobust 246247 t tests See also t statistics for AR1 serial correlation 376378 null hypothesis 108110 onesided alternatives 110114 other hypotheses about bj 116118 overview 108110 pvalues for 118120 twosided alternatives 114115 twoperiod panel data analysis 417419 policy analysis with 417419 twosided alternatives 695696 two stage least squares applied to pooled cross sections and panel data 487488 applied to time series data 485486 with heteroskedasticity 485486 multiple endogenous explanatory variables 478479 for SEM 508510 511 single endogenous explanatory variable 475477 tesing multiple hypotheses after estimation 479 testing for endogeneity 481482 twotailed tests 115 697 See also t tests Type III error 694 U u unobserved term CEV assumption and 292 foregoing specifying models with 284285 general discussions 45 2123 in time series regressions 319 using proxy variables for 279285 unanticipated inflation 353 unbalanced panels 440441 unbiased estimation of s² for multiple linear regressions 8889 for simple linear regressions 49 for time series regressions 321 unbiasedness in general 677678 of OLS in matrix form 724 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 788 unbiasedness continued in multiple regressions 77 for simple linear regressions 4344 in simple regressions 4044 in time series regressions 317323 373375 of sˆ ² 726 uncentered Rsquareds 214 unconditional forecasts 587 uncorrelated random variables 660 underspecifying the model 7883 unemployment See employment and unemployment unidentified equations 511 unit roots forecasting processes with 597598 testing for 574579 gross domestic product GDP 578 inflation 577 process 355 358 units of measurement effects of changing 3637 166168 universities vs junior colleges 124127 unobserved effectsheterogeneity 413 435 See also fixed effects unobserved terms See u unobserved term unrestricted model 128129 See also F tests unsystematic part defined 24 upward bias 80 81 utility maximization 2 V variables See also dependent variables independent variables specific types dummy 206 See also qualitative information in multiple regressions 6164 seasonal dummy 337 in simple regressions 2021 variancecovariance matrices 716 724725 variance inflation factor VIF 86 variance of prediction error 188 variances conditional 665 of OLS estimators in multiple regressions 8189 in simple regressions 4550 in time series regressions 320322 overview and properties of 656657 660661 of prediction error 189 VAR model 589 597598 vector autoregressive model 589 597598 vectors defined 709 veterans earnings of 469 voting outcomes campaign expenditures and deriving OLS estimate 31 economic performance and 328329 perfect collinearity 7576 W wages causality 1314 education and 2SLS 488 conditional expectation 661665 heteroskedasticity 4647 independent cross sections 405406 nonlinear relationship 3739 OLS estimates 3031 partial effect 641 rounded averages 33 scatterplot 27 simple equation 22 experience and See under experience with heteroskedasticityrobust standard errors 246247 labor supply and demand 500501 labor supply function 639 multiple regressions See also qualitative information homoskedasticity 8283 Wald teststatistics 529530 537 730731 weak instruments 471 weakly dependent time series 346348 wealth See financial wealth weighted least squares estimation linear probability model 265267 overview 254 prediction and prediction intervals 264265 for time series regressions 390 393394 when assumed heteroskedasticity function is wrong 262264 when heteroskedasticity function must be estimated 258263 when heteroskedasticity is known up to a multiplicative constant 254259 White test for heteroskedasticity 252254 within estimators 435 See also fixed effects within transformation 435 women in labor force heteroskedasticity 265267 LPM logit and probit estimates 533535 return to education 2SLS 477 IV estimation 467 testing for endogeneity 482 testing overidentifying restrictions 482 sample selection correction 556557 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 789 womens fertility See fertility rate worker compensation laws and weeks out of work 411 worker productivity job training and program evaluation 229 sample model 4 in US trend in 331 wages and 360 working vs sleeping tradeoff 415416 working women See women in labor force writing empirical papers 614621 conceptual or theoretical framework 615 conclusions 618619 data description 617618 econometric models and estimation methods 615617 introduction 614615 results section 618 style hints 619621 Y year dummy variables in fixed effects model 436438 pooling independent cross sections across time 403407 in random effects model 443444 Z zero conditional mean assumption homoskedasticity vs 45 for multiple linear regressions 6263 7677 for OLS in matrix form 724 for simple linear regressions 2324 42 44 for time series regressions 318319 349 zero mean and zero correlation assumption 152 zeroone variables 206 See also qualitative information Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it
Envie sua pergunta para a IA e receba a resposta na hora
Recomendado para você
15
Anotações sobre Dados Empilhados e Dados em Painel em Econometria II
Econometria
UFABC
13
Notas de Aula: Econometria II - Equações Simultâneas
Econometria
UFABC
13
Notas de Aula: Econometria II - Equações Simultâneas
Econometria
UFABC
1
Universidade Federal do ABC
Econometria
UFABC
11
Notas de Aula - Econometria II: Dados Empilhados e Dados em Painel
Econometria
UFABC
15
Referências de Marketing e Gestão de Marca
Econometria
UCAM
2
3ª Prova Online de Econometria II - FEA PUC
Econometria
PUC
8
Resumo sobre Processos Estocásticos e Determinísticos
Econometria
IBMEC
33
Gerenciamento de Risco e Valor no Brasil: Um Estudo Empírico
Econometria
FGV
6
Aula sobre Intervalos de Confiança e Testes de Hipóteses em Econometria
Econometria
UNIGRAN
Texto de pré-visualização
A Modern ApproAch S I X T H E d I T I o n Jeffrey M Wooldridge Michigan State University Australia Brazil Mexico Singapore United Kingdom United States Introductory econometrics Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it This is an electronic version of the print textbook Due to electronic rights restrictions some third party content may be suppressed Editorial review has deemed that any suppressed content does not materially affect the overall learning experience The publisher reserves the right to remove content from this title at any time if subsequent rights restrictions require it For valuable information on pricing previous editions changes to current editions and alternate formats please visit wwwcengagecomhighered to search by ISBN author title or keyword for materials in your areas of interest Important Notice Media content referenced within the product description or the product text may not be available in the eBook version Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Printed in the United States of America Print Number 01 Print Year 2015 2016 2013 Cengage Learning ALL RIGHTS RESERVED No part of this work covered by the copyright herein may be reproduced transmitted stored or used in any form or by any means graphic electronic or mechanical including but not limited to photocopying recording scanning digitizing taping Web distribution information networks or information storage and retrieval systems except as permitted under Section 107 or 108 of the 1976 United States Copyright Act without the prior written permission of the publisher Introductory Econometrics 6e Jeffrey M Wooldridge Vice President General Manager Social Science Qualitative Business Erin Joyner Product Director Mike Worls Associate Product Manager Tara Singer Content Developer Chris Rader Marketing Director Kristen Hurd Marketing Manager Katie Jergens Marketing Coordinator Chris Walz Art and Cover Direction Production Management and Composition Lumina Datamatics Inc Intellectual Property Analyst Jennifer Nonenmacher Project Manager Sarah Shainwald Manufacturing Planner Kevin Kluck Cover Image kentohShutterstock Unless otherwise noted all items Cengage Learning For product information and technology assistance contact us at Cengage Learning Customer Sales Support 18003549706 For permission to use material from this text or product submit all requests online at wwwcengagecompermissions Further permissions questions can be emailed to permissionrequest cengagecom Library of Congress Control Number 2015944828 Student Edition ISBN 9781305270107 Cengage Learning 20 Channel Center Street Boston MA 02210 USA Cengage Learning is a leading provider of customized learning solutions with employees residing in nearly 40 different countries and sales in more than 125 countries around the world Find your local representative at wwwcengagecom Cengage Learning products are represented in Canada by Nelson Education Ltd To learn more about Cengage Learning Solutions visit wwwcengagecom Purchase any of our products at your local college store or at our preferred online store wwwcengagebraincom Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it WCN 02200203 iii Brief contents Chapter 1 The Nature of Econometrics and Economic Data 1 Part 1 Regression Analysis with CrossSectional Data 19 Chapter 2 The Simple Regression Model 20 Chapter 3 Multiple Regression Analysis Estimation 60 Chapter 4 Multiple Regression Analysis Inference 105 Chapter 5 Multiple Regression Analysis OLS Asymptotics 149 Chapter 6 Multiple Regression Analysis Further Issues 166 Chapter 7 Multiple Regression Analysis with Qualitative Information Binary or Dummy Variables 205 Chapter 8 Heteroskedasticity 243 Chapter 9 More on Specification and Data Issues 274 Part 2 Regression Analysis with Time Series Data 311 Chapter 10 Basic Regression Analysis with Time Series Data 312 Chapter 11 Further Issues in Using OLS with Time Series Data 344 Chapter 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 372 Part 3 Advanced Topics 401 Chapter 13 Pooling Cross Sections Across Time Simple Panel Data Methods 402 Chapter 14 Advanced Panel Data Methods 434 Chapter 15 Instrumental Variables Estimation and Two Stage Least Squares 461 Chapter 16 Simultaneous Equations Models 499 Chapter 17 Limited Dependent Variable Models and Sample Selection Corrections 524 Chapter 18 Advanced Time Series Topics 568 Chapter 19 Carrying Out an Empirical Project 605 aPPendices Appendix A Basic Mathematical Tools 628 Appendix B Fundamentals of Probability 645 Appendix C Fundamentals of Mathematical Statistics 674 Appendix D Summary of Matrix Algebra 709 Appendix E The Linear Regression Model in Matrix Form 720 Appendix F Answers to Chapter Questions 734 Appendix G Statistical Tables 743 References 750 Glossary 756 Index 771 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it iv Contents Preface xii About the Author xxi chapter 1 The Nature of Econometrics and Economic Data 1 11 What Is Econometrics 1 12 Steps in Empirical Economic Analysis 2 13 The Structure of Economic Data 5 13a CrossSectional Data 5 13b Time Series Data 7 13c Pooled Cross Sections 8 13d Panel or Longitudinal Data 9 13e A Comment on Data Structures 10 14 Causality and the Notion of Ceteris Paribus in Econometric Analysis 10 Summary 14 Key Terms 14 Problems 15 Computer Exercises 15 P a r t 1 Regression Analysis with CrossSectional Data 19 chapter 2 The Simple Regression Model 20 21 Definition of the Simple Regression Model 20 22 Deriving the Ordinary Least Squares Estimates 24 22a A Note on Terminology 31 23 Properties of OLS on Any Sample of Data 32 23a Fitted Values and Residuals 32 23b Algebraic Properties of OLS Statistics 32 23c GoodnessofFit 35 24 Units of Measurement and Functional Form 36 24a The Effects of Changing Units of Measurement on OLS Statistics 36 24b Incorporating Nonlinearities in Simple Regression 37 24c The Meaning of Linear Regression 40 25 Expected Values and Variances of the OLS Estimators 40 25a Unbiasedness of OLS 40 25b Variances of the OLS Estimators 45 25c Estimating the Error Variance 48 26 Regression through the Origin and Regression on a Constant 50 Summary 51 Key Terms 52 Problems 53 Computer Exercises 56 Appendix 2A 59 chapter 3 Multiple Regression Analysis Estimation 60 31 Motivation for Multiple Regression 61 31a The Model with Two Independent Variables 61 31b The Model with k Independent Variables 63 32 Mechanics and Interpretation of Ordinary Least Squares 64 32a Obtaining the OLS Estimates 64 32b Interpreting the OLS Regression Equation 65 32c On the Meaning of Holding Other Factors Fixed in Multiple Regression 67 32d Changing More Than One Independent Variable Simultaneously 68 32e OLS Fitted Values and Residuals 68 32f A Partialling Out Interpretation of Multiple Regression 69 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents v 32g Comparison of Simple and Multiple Regression Estimates 69 32h GoodnessofFit 70 32i Regression through the Origin 73 33 The Expected Value of the oLS Estimators 73 33a Including Irrelevant Variables in a Regression Model 77 33b Omitted Variable Bias The Simple Case 78 33c Omitted Variable Bias More General Cases 81 34 The Variance of the oLS Estimators 81 34a The Components of the OLS Variances Multicollinearity 83 34b Variances in Misspecified Models 86 34c Estimating s2 Standard Errors of the OLS Estimators 87 35 Efficiency of oLS The GaussMarkov Theorem 89 36 Some Comments on the Language of Multiple Regression Analysis 90 Summary 91 Key Terms 93 Problems 93 Computer Exercises 97 Appendix 3A 101 chapter 4 Multiple Regression Analysis Inference 105 41 Sampling distributions of the oLS Estimators 105 42 Testing Hypotheses about a Single Population Parameter The t Test 108 42a Testing against OneSided Alternatives 110 42b TwoSided Alternatives 114 42c Testing Other Hypotheses about bj 116 42d Computing pValues for t Tests 118 42e A Reminder on the Language of Classical Hypothesis Testing 120 42f Economic or Practical versus Statistical Significance 120 43 Confidence Intervals 122 44 Testing Hypotheses about a Single Linear Combination of the Parameters 124 45 Testing Multiple Linear Restrictions The F Test 127 45a Testing Exclusion Restrictions 127 45b Relationship between F and t Statistics 132 45c The RSquared Form of the F Statistic 133 45d Computing pValues for F Tests 134 45e The F Statistic for Overall Significance of a Regression 135 45f Testing General Linear Restrictions 136 46 Reporting Regression Results 137 Summary 139 Key Terms 140 Problems 141 Computer Exercises 146 chapter 5 Multiple Regression Analysis OLS Asymptotics 149 51 Consistency 150 51a Deriving the Inconsistency in OLS 153 52 Asymptotic normality and Large Sample Inference 154 52a Other Large Sample Tests The Lagrange Multiplier Statistic 158 53 Asymptotic Efficiency of oLS 161 Summary 162 Key Terms 162 Problems 162 Computer Exercises 163 Appendix 5A 165 chapter 6 Multiple Regression Analysis Further Issues 166 61 Effects of data Scaling on oLS Statistics 166 61a Beta Coefficients 169 62 More on Functional Form 171 62a More on Using Logarithmic Functional Forms 171 62b Models with Quadratics 173 62c Models with Interaction Terms 177 62d Computing Average Partial Effects 179 63 More on GoodnessofFit and Selection of Regressors 180 63a Adjusted RSquared 181 63b Using Adjusted RSquared to Choose between Nonnested Models 182 63c Controlling for Too Many Factors in Regression Analysis 184 63d Adding Regressors to Reduce the Error Variance 185 64 Prediction and Residual Analysis 186 64a Confidence Intervals for Predictions 186 64b Residual Analysis 190 64c Predicting y When logy Is the Dependent Variable 190 64d Predicting y When the Dependent Variable Is logy 192 bj Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents vi Summary 194 Key Terms 196 Problems 196 Computer Exercises 199 Appendix 6A 203 chapter 7 Multiple Regression Analysis with Qualitative Information Binary or Dummy Variables 205 71 describing Qualitative Information 205 72 A Single dummy Independent Variable 206 72a Interpreting Coefficients on Dummy Explanatory Variables When the Dependent Variable Is logy 211 73 Using dummy Variables for Multiple Categories 212 73a Incorporating Ordinal Information by Using Dummy Variables 214 74 Interactions Involving dummy Variables 217 74a Interactions among Dummy Variables 217 74b Allowing for Different Slopes 218 74c Testing for Differences in Regression Functions across Groups 221 75 A Binary dependent Variable The Linear Probability Model 224 76 More on Policy Analysis and Program Evaluation 229 77 Interpreting Regression Results with discrete dependent Variables 231 Summary 232 Key Terms 233 Problems 233 Computer Exercises 237 chapter 8 Heteroskedasticity 243 81 Consequences of Heteroskedasticity for oLS 243 82 HeteroskedasticityRobust Inference after oLS Estimation 244 82a Computing HeteroskedasticityRobust LM Tests 248 83 Testing for Heteroskedasticity 250 83a The White Test for Heteroskedasticity 252 84 Weighted Least Squares Estimation 254 84a The Heteroskedasticity Is Known up to a Multiplicative Constant 254 84b The Heteroskedasticity Function Must Be Estimated Feasible GLS 259 84c What If the Assumed Heteroskedasticity Function Is Wrong 262 84d Prediction and Prediction Intervals with Heteroskedasticity 264 85 The Linear Probability Model Revisited 265 Summary 267 Key Terms 268 Problems 268 Computer Exercises 270 chapter 9 More on Specification and Data Issues 274 91 Functional Form Misspecification 275 91a RESET as a General Test for Functional Form Misspecification 277 91b Tests against Nonnested Alternatives 278 92 Using Proxy Variables for Unobserved Explanatory Variables 279 92a Using Lagged Dependent Variables as Proxy Variables 283 92b A Different Slant on Multiple Regression 284 93 Models with Random Slopes 285 94 Properties of oLS under Measurement Error 287 94a Measurement Error in the Dependent Variable 287 94b Measurement Error in an Explanatory Variable 289 95 Missing data nonrandom Samples and outlying observations 293 95a Missing Data 293 95b Nonrandom Samples 294 95c Outliers and Influential Observations 296 96 Least Absolute deviations Estimation 300 Summary 302 Key Terms 303 Problems 303 Computer Exercises 307 P a r t 2 Regression Analysis with Time Series Data 311 chapter 10 Basic Regression Analysis with Time Series Data 312 101 The nature of Time Series data 312 102 Examples of Time Series Regression Models 313 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents vii 102a Static Models 314 102b Finite Distributed Lag Models 314 102c A Convention about the Time Index 316 103 Finite Sample Properties of oLS under Classical Assumptions 317 103a Unbiasedness of OLS 317 103b The Variances of the OLS Estimators and the GaussMarkov Theorem 320 103c Inference under the Classical Linear Model Assumptions 322 104 Functional Form dummy Variables and Index numbers 323 105 Trends and Seasonality 329 105a Characterizing Trending Time Series 329 105b Using Trending Variables in Regression Analysis 332 105c A Detrending Interpretation of Regressions with a Time Trend 334 105d Computing RSquared When the Dependent Variable Is Trending 335 105e Seasonality 336 Summary 338 Key Terms 339 Problems 339 Computer Exercises 341 chapter 11 Further Issues in Using OLS with Time Series Data 344 111 Stationary and Weakly dependent Time Series 345 111a Stationary and Nonstationary Time Series 345 111b Weakly Dependent Time Series 346 112 Asymptotic Properties of oLS 348 113 Using Highly Persistent Time Series in Regression Analysis 354 113a Highly Persistent Time Series 354 113b Transformations on Highly Persistent Time Series 358 113c Deciding Whether a Time Series Is I1 359 114 dynamically Complete Models and the Absence of Serial Correlation 360 115 The Homoskedasticity Assumption for Time Series Models 363 Summary 364 Key Terms 365 Problems 365 Computer Exercises 368 chapter 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 372 121 Properties of oLS with Serially Correlated Errors 373 121a Unbiasedness and Consistency 373 121b Efficiency and Inference 373 121c Goodness of Fit 374 121d Serial Correlation in the Presence of Lagged Dependent Variables 374 122 Testing for Serial Correlation 376 122a A t Test for AR1 Serial Correlation with Strictly Exogenous Regressors 376 122b The DurbinWatson Test under Classical Assumptions 378 122c Testing for AR1 Serial Correlation without Strictly Exogenous Regressors 379 122d Testing for Higher Order Serial Correlation 380 123 Correcting for Serial Correlation with Strictly Exogenous Regressors 381 123a Obtaining the Best Linear Unbiased Estimator in the AR1 Model 382 123b Feasible GLS Estimation with AR1 Errors 383 123c Comparing OLS and FGLS 385 123d Correcting for Higher Order Serial Correlation 386 124 differencing and Serial Correlation 387 125 Serial CorrelationRobust Inference after oLS 388 126 Heteroskedasticity in Time Series Regressions 391 126a HeteroskedasticityRobust Statistics 392 126b Testing for Heteroskedasticity 392 126c Autoregressive Conditional Heteroskedasticity 393 126d Heteroskedasticity and Serial Correlation in Regression Models 395 Summary 396 Key Terms 396 Problems 396 Computer Exercises 397 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents viii P a r t 3 Advanced Topics 401 chapter 13 Pooling Cross Sections across Time Simple Panel Data Methods 402 131 Pooling Independent Cross Sections across Time 403 131a The Chow Test for Structural Change across Time 407 132 Policy Analysis with Pooled Cross Sections 407 133 TwoPeriod Panel data Analysis 412 133a Organizing Panel Data 417 134 Policy Analysis with TwoPeriod Panel data 417 135 differencing with More Than Two Time Periods 420 135a Potential Pitfalls in First Differencing Panel Data 424 Summary 424 Key Terms 425 Problems 425 Computer Exercises 426 Appendix 13A 432 chapter 14 Advanced Panel Data Methods 434 141 Fixed Effects Estimation 435 141a The Dummy Variable Regression 438 141b Fixed Effects or First Differencing 439 141c Fixed Effects with Unbalanced Panels 440 142 Random Effects Models 441 142a Random Effects or Fixed Effects 444 143 The Correlated Random Effects Approach 445 143a Unbalanced Panels 447 144 Applying Panel data Methods to other data Structures 448 Summary 450 Key Terms 451 Problems 451 Computer Exercises 453 Appendix 14A 457 chapter 15 Instrumental Variables Estimation and Two Stage Least Squares 461 151 Motivation omitted Variables in a Simple Regression Model 462 151a Statistical Inference with the IV Estimator 466 151b Properties of IV with a Poor Instrumental Variable 469 151c Computing RSquared after IV Estimation 471 152 IV Estimation of the Multiple Regression Model 471 153 Two Stage Least Squares 475 153a A Single Endogenous Explanatory Variable 475 153b Multicollinearity and 2SLS 477 153c Detecting Weak Instruments 478 153d Multiple Endogenous Explanatory Variables 478 153e Testing Multiple Hypotheses after 2SLS Estimation 479 154 IV Solutions to ErrorsinVariables Problems 479 155 Testing for Endogeneity and Testing overidentifying Restrictions 481 155a Testing for Endogeneity 481 155b Testing Overidentification Restrictions 482 156 2SLS with Heteroskedasticity 484 157 Applying 2SLS to Time Series Equations 485 158 Applying 2SLS to Pooled Cross Sections and Panel data 487 Summary 488 Key Terms 489 Problems 489 Computer Exercises 492 Appendix 15A 496 chapter 16 Simultaneous Equations Models 499 161 The nature of Simultaneous Equations Models 500 162 Simultaneity Bias in oLS 503 163 Identifying and Estimating a Structural Equation 504 163a Identification in a TwoEquation System 505 163b Estimation by 2SLS 508 164 Systems with More Than Two Equations 510 164a Identification in Systems with Three or More Equations 510 164b Estimation 511 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents ix 165 Simultaneous Equations Models with Time Series 511 166 Simultaneous Equations Models with Panel data 514 Summary 516 Key Terms 517 Problems 517 Computer Exercises 519 chapter 17 Limited Dependent Variable Models and Sample Selection Corrections 524 171 Logit and Probit Models for Binary Response 525 171a Specifying Logit and Probit Models 525 171b Maximum Likelihood Estimation of Logit and Probit Models 528 171c Testing Multiple Hypotheses 529 171d Interpreting the Logit and Probit Estimates 530 172 The Tobit Model for Corner Solution Responses 536 172a Interpreting the Tobit Estimates 537 172b Specification Issues in Tobit Models 543 173 The Poisson Regression Model 543 174 Censored and Truncated Regression Models 547 174a Censored Regression Models 548 174b Truncated Regression Models 551 175 Sample Selection Corrections 553 175a When Is OLS on the Selected Sample Consistent 553 175b Incidental Truncation 554 Summary 558 Key Terms 558 Problems 559 Computer Exercises 560 Appendix 17A 565 Appendix 17B 566 chapter 18 Advanced Time Series Topics 568 181 Infinite distributed Lag Models 569 181a The Geometric or Koyck Distributed Lag 571 181b Rational Distributed Lag Models 572 182 Testing for Unit Roots 574 183 Spurious Regression 578 184 Cointegration and Error Correction Models 580 184a Cointegration 580 184b Error Correction Models 584 185 Forecasting 586 185a Types of Regression Models Used for Forecasting 587 185b OneStepAhead Forecasting 588 185c Comparing OneStepAhead Forecasts 591 185d MultipleStepAhead Forecasts 592 185e Forecasting Trending Seasonal and Integrated Processes 594 Summary 598 Key Terms 599 Problems 600 Computer Exercises 601 chapter 19 Carrying Out an Empirical Project 605 191 Posing a Question 605 192 Literature Review 607 193 data Collection 608 193a Deciding on the Appropriate Data Set 608 193b Entering and Storing Your Data 609 193c Inspecting Cleaning and Summarizing Your Data 610 194 Econometric Analysis 611 195 Writing an Empirical Paper 614 195a Introduction 614 195b Conceptual or Theoretical Framework 615 195c Econometric Models and Estimation Methods 615 195d The Data 617 195e Results 618 195f Conclusions 618 195g Style Hints 619 Summary 621 Key Terms 621 Sample Empirical Projects 621 List of Journals 626 Data Sources 627 appendix a Basic Mathematical Tools 628 A1 The Summation operator and descriptive Statistics 628 A2 Properties of Linear Functions 630 A3 Proportions and Percentages 633 A4 Some Special Functions and their Properties 634 A4a Quadratic Functions 634 A4b The Natural Logarithm 636 A4c The Exponential Function 639 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents x A5 differential Calculus 640 Summary 642 Key Terms 642 Problems 643 appendix B Fundamentals of Probability 645 B1 Random Variables and Their Probability distributions 645 B1a Discrete Random Variables 646 B1b Continuous Random Variables 648 B2 Joint distributions Conditional distributions and Independence 649 B2a Joint Distributions and Independence 649 B2b Conditional Distributions 651 B3 Features of Probability distributions 652 B3a A Measure of Central Tendency The Expected Value 652 B3b Properties of Expected Values 653 B3c Another Measure of Central Tendency The Median 655 B3d Measures of Variability Variance and Standard Deviation 656 B3e Variance 656 B3f Standard Deviation 657 B3g Standardizing a Random Variable 657 B3h Skewness and Kurtosis 658 B4 Features of Joint and Conditional distributions 658 B4a Measures of Association Covariance and Correlation 658 B4b Covariance 658 B4c Correlation Coefficient 659 B4d Variance of Sums of Random Variables 660 B4e Conditional Expectation 661 B4f Properties of Conditional Expectation 663 B4g Conditional Variance 665 B5 The normal and Related distributions 665 B5a The Normal Distribution 665 B5b The Standard Normal Distribution 666 B5c Additional Properties of the Normal Distribution 668 B5d The ChiSquare Distribution 669 B5e The t Distribution 669 B5f The F Distribution 670 Summary 672 Key Terms 672 Problems 672 appendix c Fundamentals of Mathematical Statistics 674 C1 Populations Parameters and Random Sampling 674 C1a Sampling 674 C2 Finite Sample Properties of Estimators 675 C2a Estimators and Estimates 675 C2b Unbiasedness 676 C2d The Sampling Variance of Estimators 678 C2e Efficiency 679 C3 Asymptotic or Large Sample Properties of Estimators 681 C3a Consistency 681 C3b Asymptotic Normality 683 C4 General Approaches to Parameter Estimation 684 C4a Method of Moments 685 C4b Maximum Likelihood 685 C4c Least Squares 686 C5 Interval Estimation and Confidence Intervals 687 C5a The Nature of Interval Estimation 687 C5b Confidence Intervals for the Mean from a Normally Distributed Population 689 C5c A Simple Rule of Thumb for a 95 Confidence Interval 691 C5d Asymptotic Confidence Intervals for Nonnormal Populations 692 C6 Hypothesis Testing 693 C6a Fundamentals of Hypothesis Testing 693 C6b Testing Hypotheses about the Mean in a Normal Population 695 C6c Asymptotic Tests for Nonnormal Populations 698 C6d Computing and Using pValues 698 C6e The Relationship between Confidence Intervals and Hypothesis Testing 701 C6f Practical versus Statistical Significance 702 C7 Remarks on notation 703 Summary 703 Key Terms 704 Problems 704 appendix d Summary of Matrix Algebra 709 D1 Basic definitions 709 D2 Matrix operations 710 D2a Matrix Addition 710 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Contents xi D2b Scalar Multiplication 710 D2c Matrix Multiplication 711 D2d Transpose 712 D2e Partitioned Matrix Multiplication 712 D2f Trace 713 D2g Inverse 713 D3 Linear Independence and Rank of a Matrix 714 D4 Quadratic Forms and Positive definite Matrices 714 D5 Idempotent Matrices 715 D6 differentiation of Linear and Quadratic Forms 715 D7 Moments and distributions of Random Vectors 716 D7a Expected Value 716 D7b VarianceCovariance Matrix 716 D7c Multivariate Normal Distribution 716 D7d ChiSquare Distribution 717 D7e t Distribution 717 D7f F Distribution 717 Summary 717 Key Terms 717 Problems 718 appendix e The Linear Regression Model in Matrix Form 720 E1 The Model and ordinary Least Squares Estimation 720 E1a The FrischWaugh Theorem 722 E2 Finite Sample Properties of oLS 723 E3 Statistical Inference 726 E4 Some Asymptotic Analysis 728 E4a Wald Statistics for Testing Multiple Hypotheses 730 Summary 731 Key Terms 731 Problems 731 appendix F Answers to Chapter Questions 734 appendix G Statistical Tables 743 References 750 Glossary 756 Index 771 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xii preface My motivation for writing the first edition of Introductory Econometrics A Modern Approach was that I saw a fairly wide gap between how econometrics is taught to undergraduates and how empirical researchers think about and apply econometric methods I became convinced that teaching introduc tory econometrics from the perspective of professional users of econometrics would actually simplify the presentation in addition to making the subject much more interesting Based on the positive reactions to earlier editions it appears that my hunch was correct Many instructors having a variety of backgrounds and interests and teaching students with different lev els of preparation have embraced the modern approach to econometrics espoused in this text The emphasis in this edition is still on applying econometrics to realworld problems Each econometric method is motivated by a particular issue facing researchers analyzing nonexperimental data The focus in the main text is on understanding and interpreting the assumptions in light of actual empiri cal applications the mathematics required is no more than college algebra and basic probability and statistics Organized for Todays Econometrics Instructor The sixth edition preserves the overall organization of the fifth The most noticeable feature that distinguishes this text from most others is the separation of topics by the kind of data being ana lyzed This is a clear departure from the traditional approach which presents a linear model lists all assumptions that may be needed at some future point in the analysis and then proves or asserts results without clearly connecting them to the assumptions My approach is first to treat in Part 1 multiple regression analysis with crosssectional data under the assumption of random sampling This set ting is natural to students because they are familiar with random sampling from a population in their introductory statistics courses Importantly it allows us to distinguish assumptions made about the underlying population regression modelassumptions that can be given economic or behavioral con tentfrom assumptions about how the data were sampled Discussions about the consequences of nonrandom sampling can be treated in an intuitive fashion after the students have a good grasp of the multiple regression model estimated using random samples An important feature of a modern approach is that the explanatory variablesalong with the dependent variableare treated as outcomes of random variables For the social sciences allow ing random explanatory variables is much more realistic than the traditional assumption of nonran dom explanatory variables As a nontrivial benefit the population modelrandom sampling approach reduces the number of assumptions that students must absorb and understand Ironically the classical approach to regression analysis which treats the explanatory variables as fixed in repeated samples and is still pervasive in introductory texts literally applies to data collected in an experimental setting In addition the contortions required to state and explain assumptions can be confusing to students My focus on the population model emphasizes that the fundamental assumptions underlying regression analysis such as the zero mean assumption on the unobservable error term are properly Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xiii Preface stated conditional on the explanatory variables This leads to a clear understanding of the kinds of problems such as heteroskedasticity nonconstant variance that can invalidate standard inference procedures By focusing on the population I am also able to dispel several misconceptions that arise in econometrics texts at all levels For example I explain why the usual Rsquared is still valid as a goodnessoffit measure in the presence of heteroskedasticity Chapter 8 or serially correlated errors Chapter 12 I provide a simple demonstration that tests for functional form should not be viewed as general tests of omitted variables Chapter 9 and I explain why one should always include in a regression model extra control variables that are uncorrelated with the explanatory variable of inter est which is often a key policy variable Chapter 6 Because the assumptions for crosssectional analysis are relatively straightforward yet realis tic students can get involved early with serious crosssectional applications without having to worry about the thorny issues of trends seasonality serial correlation high persistence and spurious regres sion that are ubiquitous in time series regression models Initially I figured that my treatment of regression with crosssectional data followed by regression with time series data would find favor with instructors whose own research interests are in applied microeconomics and that appears to be the case It has been gratifying that adopters of the text with an applied time series bent have been equally enthusiastic about the structure of the text By postponing the econometric analysis of time series data I am able to put proper focus on the potential pitfalls in analyzing time series data that do not arise with crosssectional data In effect time series econometrics finally gets the serious treat ment it deserves in an introductory text As in the earlier editions I have consciously chosen topics that are important for reading journal articles and for conducting basic empirical research Within each topic I have deliberately omitted many tests and estimation procedures that while traditionally included in textbooks have not with stood the empirical test of time Likewise I have emphasized more recent topics that have clearly demonstrated their usefulness such as obtaining test statistics that are robust to heteroskedasticity or serial correlation of unknown form using multiple years of data for policy analysis or solving the omitted variable problem by instrumental variables methods I appear to have made fairly good choices as I have received only a handful of suggestions for adding or deleting material I take a systematic approach throughout the text by which I mean that each topic is presented by building on the previous material in a logical fashion and assumptions are introduced only as they are needed to obtain a conclusion For example empirical researchers who use econometrics in their research understand that not all of the GaussMarkov assumptions are needed to show that the ordi nary least squares OLS estimators are unbiased Yet the vast majority of econometrics texts intro duce a complete set of assumptions many of which are redundant or in some cases even logically conflicting before proving the unbiasedness of OLS Similarly the normality assumption is often included among the assumptions that are needed for the GaussMarkov Theorem even though it is fairly well known that normality plays no role in showing that the OLS estimators are the best linear unbiased estimators My systematic approach is illustrated by the order of assumptions that I use for multiple regres sion in Part 1 This structure results in a natural progression for briefly summarizing the role of each assumption MLR1 Introduce the population model and interpret the population parameters which we hope to estimate MLR2 Introduce random sampling from the population and describe the data that we use to estimate the population parameters MLR3 Add the assumption on the explanatory variables that allows us to compute the estimates from our sample this is the socalled no perfect collinearity assumption MLR4 Assume that in the population the mean of the unobservable error does not depend on the values of the explanatory variables this is the mean independence assumption combined with a zero population mean for the error and it is the key assumption that delivers unbiasedness of OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xiv Preface After introducing Assumptions MLR1 to MLR3 one can discuss the algebraic properties of ordi nary least squaresthat is the properties of OLS for a particular set of data By adding Assumption MLR4 we can show that OLS is unbiased and consistent Assumption MLR5 homoskedastic ity is added for the GaussMarkov Theorem and for the usual OLS variance formulas to be valid Assumption MLR6 normality which is not introduced until Chapter 4 is added to round out the classical linear model assumptions The six assumptions are used to obtain exact statistical inference and to conclude that the OLS estimators have the smallest variances among all unbiased estimators I use parallel approaches when I turn to the study of largesample properties and when I treat regression for time series data in Part 2 The careful presentation and discussion of assumptions makes it relatively easy to transition to Part 3 which covers advanced topics that include using pooled crosssectional data exploiting panel data structures and applying instrumental variables methods Generally I have strived to provide a unified view of econometrics where all estimators and test sta tistics are obtained using just a few intuitively reasonable principles of estimation and testing which of course also have rigorous justification For example regressionbased tests for heteroskedasticity and serial correlation are easy for students to grasp because they already have a solid understanding of regression This is in contrast to treatments that give a set of disjointed recipes for outdated econo metric testing procedures Throughout the text I emphasize ceteris paribus relationships which is why after one chapter on the simple regression model I move to multiple regression analysis The multiple regression setting motivates students to think about serious applications early I also give prominence to policy analysis with all kinds of data structures Practical topics such as using proxy variables to obtain ceteris pari bus effects and interpreting partial effects in models with interaction terms are covered in a simple fashion New to This Edition I have added new exercises to almost every chapter including the appendices Most of the new com puter exercises use new data sets including a data set on student performance and attending a Catholic high school and a time series data set on presidential approval ratings and gasoline prices I have also added some harder problems that require derivations There are several changes to the text worth noting Chapter 2 contains a more extensive dis cussion about the relationship between the simple regression coefficient and the correlation coef ficient Chapter 3 clarifies issues with comparing Rsquareds from models when data are missing on some variables thereby reducing sample sizes available for regressions with more explanatory variables Chapter 6 introduces the notion of an average partial effect APE for models linear in the param eters but including nonlinear functions primarily quadratics and interaction terms The notion of an APE which was implicit in previous editions has become an important concept in empirical work understanding how to compute and interpret APEs in the context of OLS is a valuable skill For more advanced classes the introduction in Chapter 6 eases the way to the discussion of APEs in the non linear models studied in Chapter 17 which also includes an expanded discussion of APEsincluding now showing APEs in tables alongside coefficients in logit probit and Tobit applications In Chapter 8 I refine some of the discussion involving the issue of heteroskedasticity including an expanded discussion of Chow tests and a more precise description of weighted least squares when the weights must be estimated Chapter 9 which contains some optional slightly more advanced topics defines terms that appear often in the large literature on missing data A common practice in empirical work is to create indicator variables for missing data and to include them in a multiple regression analysis Chapter 9 discusses how this method can be implemented and when it will pro duce unbiased and consistent estimators Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xv Preface The treatment of unobserved effects panel data models in chapter 14 has been expanded to include more of a discussion of unbalanced panel data sets including how the fixed effects random effects and correlated random effects approaches still can be applied Another important addition is a much more detailed discussion on applying fixed effects and random effects methods to cluster sam ples I also include discussion of some subtle issues that can arise in using clustered standard errors when the data have been obtained from a random sampling scheme Chapter 15 now has a more detailed discussion of the problem of weak instrumental variables so that students can access the basics without having to track down more advanced sources Targeted at Undergraduates Adaptable for Masters Students The text is designed for undergraduate economics majors who have taken college algebra and one semester of introductory probability and statistics Appendices A B and C contain the requisite background material A onesemester or onequarter econometrics course would not be expected to cover all or even any of the more advanced material in Part 3 A typical introductory course includes Chapters 1 through 8 which cover the basics of simple and multiple regression for crosssectional data Provided the emphasis is on intuition and interpreting the empirical exam ples the material from the first eight chapters should be accessible to undergraduates in most economics departments Most instructors will also want to cover at least parts of the chapters on regression analysis with time series data Chapters 10 and 12 in varying degrees of depth In the onesemester course that I teach at Michigan State I cover Chapter 10 fairly carefully give an overview of the material in Chapter 11 and cover the material on serial correlation in Chapter 12 I find that this basic onesemester course puts students on a solid footing to write empirical papers such as a term paper a senior seminar paper or a senior thesis Chapter 9 contains more specialized topics that arise in analyzing crosssectional data including data problems such as outliers and nonrandom sampling for a onesemester course it can be skipped without loss of continuity The structure of the text makes it ideal for a course with a crosssectional or policy analysis focus the time series chapters can be skipped in lieu of topics from Chapters 9 or 15 Chapter 13 is advanced only in the sense that it treats two new data structures independently pooled cross sections and twoperiod panel data analysis Such data structures are especially useful for policy analysis and the chapter provides several examples Students with a good grasp of Chapters 1 through 8 will have little difficulty with Chapter 13 Chapter 14 covers more advanced panel data methods and would probably be covered only in a second course A good way to end a course on crosssectional methods is to cover the rudiments of instrumental variables estimation in Chapter 15 I have used selected material in Part 3 including Chapters 13 and 17 in a senior seminar geared to producing a serious research paper Along with the basic onesemester course students who have been exposed to basic panel data analysis instrumental variables estimation and limited dependent variable models are in a position to read large segments of the applied social sciences literature Chapter 17 provides an introduction to the most common limited dependent variable models The text is also well suited for an introductory masters level course where the emphasis is on applications rather than on derivations using matrix algebra Several instructors have used the text to teach policy analysis at the masters level For instructors wanting to present the material in matrix form Appendices D and E are selfcontained treatments of the matrix algebra and the multiple regres sion model in matrix form At Michigan State PhD students in many fields that require data analysisincluding accounting agricultural economics development economics economics of education finance international eco nomics labor economics macroeconomics political science and public financehave found the text Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xvi Preface to be a useful bridge between the empirical work that they read and the more theoretical econometrics they learn at the PhD level Design Features Numerous intext questions are scattered throughout with answers supplied in Appendix F These questions are intended to provide students with immediate feedback Each chapter contains many numbered examples Several of these are case studies drawn from recently published papers but where I have used my judgment to simplify the analysis hopefully without sacrificing the main point The endofchapter problems and computer exercises are heavily oriented toward empirical work rather than complicated derivations The students are asked to reason carefully based on what they have learned The computer exercises often expand on the intext examples Several exercises use data sets from published works or similar data sets that are motivated by published research in economics and other fields A pioneering feature of this introductory econometrics text is the extensive glossary The short definitions and descriptions are a helpful refresher for students studying for exams or reading empiri cal research that uses econometric methods I have added and updated several entries for the fifth edition Data SetsAvailable in Six Formats This edition adds R data set as an additional format for viewing and analyzing data In response to popular demand this edition also provides the Minitab format With more than 100 data sets in six different formats including Stata EViews Minitab Microsoft Excel and R the instructor has many options for problem sets examples and term projects Because most of the data sets come from actual research some are very large Except for partial lists of data sets to illustrate the various data structures the data sets are not reported in the text This book is geared to a course where computer work plays an integral role Updated Data Sets Handbook An extensive data description manual is also available online This manual contains a list of data sources along with suggestions for ways to use the data sets that are not described in the text This unique handbook created by author Jeffrey M Wooldridge lists the source of all data sets for quick reference and how each might be used Because the data book contains page numbers it is easy to see how the author used the data in the text Students may want to view the descriptions of each data set and it can help guide instructors in generating new homework exercises exam problems or term projects The author also provides suggestions on improving the data sets in this detailed resource that is available on the books companion website at httplogincengagecom and students can access it free at wwwcengagebraincom Instructor Supplements Instructors Manual with Solutions The Instructors Manual with Solutions contains answers to all problems and exercises as well as teaching tips on how to present the material in each chapter The instructors manual also contains Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xvii Preface sources for each of the data files with many suggestions for how to use them on problem sets exams and term papers This supplement is available online only to instructors at httplogincengagecom PowerPoint Slides Exceptional PowerPoint presentation slides help you create engaging memorable lectures You will find teaching slides for each chapter in this edition including the advanced chapters in Part 3 You can modify or customize the slides for your specific course PowerPoint slides are available for conve nient download on the instructoronly passwordprotected portion of the books companion website at httplogincengagecom Scientific Word Slides Developed by the author Scientific Word slides offer an alternative format for instructors who prefer the Scientific Word platform the word processor created by MacKichan Software Inc for composing mathematical and technical documents using LaTeX typesetting These slides are based on the authors actual lectures and are available in PDF and TeX formats for convenient download on the instructoronly passwordprotected section of the books companion website at httplogin cengagecom Test Bank Cengage Learning Testing powered by Cognero is a flexible online system that allows you to import edit and manipulate content from the texts test bank or elsewhere You have the flexibility to include your own favorite test questions create multiple test versions in an instant and deliver tests from your LMS your classroom or wherever you want In the test bank for INTRODUCTORY ECONOMETRICS 6E you will find a wealth and variety of problems ranging from multiplechoice to questions that require simple statistical derivations to questions that require interpreting computer output Student Supplements MindTap MindTap for INTRODUCTORY ECONOMETRICS 6E provides you with the tools you need to better manage your limited timeyou can complete assignments whenever and wherever you are ready to learn with course material specially customized by your instructor and streamlined in one proven easytouse interface With an array of tools and appsfrom note taking to flashcardsyou will get a true understanding of course concepts helping you to achieve better grades and setting the groundwork for your future courses Aplia Millions of students use Aplia to better prepare for class and for their exams Aplia assignments mean no surpriseswith an ataglance view of current assignments organized by due date You always know whats due and when Aplia ties your lessons into realworld applications so you get a bigger better picture of how youll use your education in your future workplace Automatic grading and immediate feedback helps you master content the right way the first time Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xviii Preface Student Solutions Manual Now you can maximize your study time and further your course success with this dynamic online resource This helpful Solutions Manual includes detailed steps and solutions to oddnumbered prob lems as well as computer exercises in the text This supplement is available as a free resource at wwwcengagebraincom Suggestions for Designing Your Course I have already commented on the contents of most of the chapters as well as possible outlines for courses Here I provide more specific comments about material in chapters that might be covered or skipped Chapter 9 has some interesting examples such as a wage regression that includes IQ score as an explanatory variable The rubric of proxy variables does not have to be formally introduced to present these kinds of examples and I typically do so when finishing up crosssectional analysis In Chapter 12 for a onesemester course I skip the material on serial correlation robust inference for ordinary least squares as well as dynamic models of heteroskedasticity Even in a second course I tend to spend only a little time on Chapter 16 which covers simultane ous equations analysis I have found that instructors differ widely in their opinions on the importance of teaching simultaneous equations models to undergraduates Some think this material is funda mental others think it is rarely applicable My own view is that simultaneous equations models are overused see Chapter 16 for a discussion If one reads applications carefully omitted variables and measurement error are much more likely to be the reason one adopts instrumental variables estima tion and this is why I use omitted variables to motivate instrumental variables estimation in Chapter 15 Still simultaneous equations models are indispensable for estimating demand and supply func tions and they apply in some other important cases as well Chapter 17 is the only chapter that considers models inherently nonlinear in their parameters and this puts an extra burden on the student The first material one should cover in this chapter is on probit and logit models for binary response My presentation of Tobit models and censored regression still appears to be novel in introductory texts I explicitly recognize that the Tobit model is applied to corner solution outcomes on random samples while censored regression is applied when the data col lection process censors the dependent variable at essentially arbitrary thresholds Chapter 18 covers some recent important topics from time series econometrics including test ing for unit roots and cointegration I cover this material only in a secondsemester course at either the undergraduate or masters level A fairly detailed introduction to forecasting is also included in Chapter 18 Chapter 19 which would be added to the syllabus for a course that requires a term paper is much more extensive than similar chapters in other texts It summarizes some of the methods appropriate for various kinds of problems and data structures points out potential pitfalls explains in some detail how to write a term paper in empirical economics and includes suggestions for possible projects Acknowledgments I would like to thank those who reviewed and provided helpful comments for this and previous editions of the text Erica Johnson Gonzaga University Mary Ellen Benedict Bowling Green State University Yan Li Temple University Melissa Tartari Yale University Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xix Preface Michael Allgrunn University of South Dakota Gregory Colman Pace University YooMi Chin Missouri University of Science and Technology Arsen Melkumian Western Illinois University Kevin J Murphy Oakland University Kristine Grimsrud University of New Mexico Will Melick Kenyon College Philip H Brown Colby College Argun Saatcioglu University of Kansas Ken Brown University of Northern Iowa Michael R Jonas University of San Francisco Melissa Yeoh Berry College Nikolaos Papanikolaou SUNY at New Paltz Konstantin Golyaev University of Minnesota Soren Hauge Ripon College Kevin Williams University of Minnesota Hailong Qian Saint Louis University Rod Hissong University of Texas at Arlington Steven Cuellar Sonoma State University Yanan Di Wagner College John Fitzgerald Bowdoin College Philip N Jefferson Swarthmore College Yongsheng Wang Washington and Jefferson College ShengKai Chang National Taiwan University Damayanti Ghosh Binghamton University Susan Averett Lafayette College Kevin J Mumford Purdue University Nicolai V Kuminoff Arizona State University Subarna K Samanta The College of New Jersey Jing Li South Dakota State University Gary Wagner University of ArkansasLittle Rock Kelly Cobourn Boise State University Timothy Dittmer Central Washington University Daniel Fischmar Westminster College Subha Mani Fordham University John Maluccio Middlebury College James Warner College of Wooster Christopher Magee Bucknell University Andrew Ewing Eckerd College Debra Israel Indiana State University Jay Goodliffe Brigham Young University Stanley R Thompson The Ohio State University Michael Robinson Mount Holyoke College Ivan Jeliazkov University of California Irvine Heather ONeill Ursinus College Leslie Papke Michigan State University Timothy Vogelsang Michigan State University Stephen Woodbury Michigan State University Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xx Preface Some of the changes I discussed earlier were driven by comments I received from people on this list and I continue to mull over other specific suggestions made by one or more reviewers Many students and teaching assistants too numerous to list have caught mistakes in earlier editions or have suggested rewording some paragraphs I am grateful to them As always it was a pleasure working with the team at Cengage Learning Mike Worls my long time Product Director has learned very well how to guide me with a firm yet gentle hand Chris Rader has quickly mastered the difficult challenges of being the developmental editor of a dense techni cal textbook His careful reading of the manuscript and fine eye for detail have improved this sixth edition considerably This book is dedicated to my wife Leslie Papke who contributed materially to this edition by writing the initial versions of the Scientific Word slides for the chapters in Part 3 she then used the slides in her public policy course Our children have contributed too Edmund has helped me keep the data handbook current and Gwenyth keeps us entertained with her artistic talents Jeffrey M Wooldridge Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it xxi About the Author Jeffrey M Wooldridge is University Distinguished Professor of Economics at Michigan State University where he has taught since 1991 From 1986 to 1991 he was an assistant professor of eco nomics at the Massachusetts Institute of Technology He received his bachelor of arts with majors in computer science and economics from the University of California Berkeley in 1982 and received his doctorate in economics in 1986 from the University of California San Diego He has published more than 60 articles in internationally recognized journals as well as several book chapters He is also the author of Econometric Analysis of Cross Section and Panel Data second edition His awards include an Alfred P Sloan Research Fellowship the Plura Scripsit award from Econometric Theory the Sir Richard Stone prize from the Journal of Applied Econometrics and three graduate teacheroftheyear awards from MIT He is a fellow of the Econometric Society and of the Journal of Econometrics He is past editor of the Journal of Business and Economic Statistics and past econometrics coeditor of Economics Letters He has served on the editorial boards of Econometric Theory the Journal of Economic Literature the Journal of Econometrics the Review of Economics and Statistics and the Stata Journal He has also acted as an occasional econometrics consultant for Arthur Andersen Charles River Associates the Washington State Institute for Public Policy Stratus Consulting and Industrial Economics Incorporated Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 1 C hapter 1 discusses the scope of econometrics and raises general issues that arise in the application of econometric methods Section 11 provides a brief discussion about the purpose and scope of econometrics and how it fits into economic analysis Section 12 provides exam ples of how one can start with an economic theory and build a model that can be estimated using data Section 13 examines the kinds of data sets that are used in business economics and other social sciences Section 14 provides an intuitive discussion of the difficulties associated with the inference of causality in the social sciences 11 What Is Econometrics Imagine that you are hired by your state government to evaluate the effectiveness of a publicly funded job training program Suppose this program teaches workers various ways to use computers in the manufacturing process The 20week program offers courses during nonworking hours Any hourly manufacturing worker may participate and enrollment in all or part of the program is volun tary You are to determine what if any effect the training program has on each workers subsequent hourly wage Now suppose you work for an investment bank You are to study the returns on different invest ment strategies involving shortterm US treasury bills to decide whether they comply with implied economic theories The task of answering such questions may seem daunting at first At this point you may only have a vague idea of the kind of data you would need to collect By the end of this introductory econometrics course you should know how to use econometric methods to formally evaluate a job training program or to test a simple economic theory The Nature of Econometrics and Economic Data c h a p t e r 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 2 Econometrics is based upon the development of statistical methods for estimating economic relationships testing economic theories and evaluating and implementing government and business policy The most common application of econometrics is the forecasting of such important macroeco nomic variables as interest rates inflation rates and gross domestic product GDP Whereas fore casts of economic indicators are highly visible and often widely published econometric methods can be used in economic areas that have nothing to do with macroeconomic forecasting For example we will study the effects of political campaign expenditures on voting outcomes We will consider the effect of school spending on student performance in the field of education In addition we will learn how to use econometric methods for forecasting economic time series Econometrics has evolved as a separate discipline from mathematical statistics because the for mer focuses on the problems inherent in collecting and analyzing nonexperimental economic data Nonexperimental data are not accumulated through controlled experiments on individuals firms or segments of the economy Nonexperimental data are sometimes called observational data or retrospective data to emphasize the fact that the researcher is a passive collector of the data Experimental data are often collected in laboratory environments in the natural sciences but they are much more difficult to obtain in the social sciences Although some social experiments can be devised it is often impossible prohibitively expensive or morally repugnant to conduct the kinds of controlled experiments that would be needed to address economic issues We give some specific examples of the differences between experimental and nonexperimental data in Section 14 Naturally econometricians have borrowed from mathematical statisticians whenever possible The method of multiple regression analysis is the mainstay in both fields but its focus and interpreta tion can differ markedly In addition economists have devised new techniques to deal with the com plexities of economic data and to test the predictions of economic theories 12 Steps in Empirical Economic Analysis Econometric methods are relevant in virtually every branch of applied economics They come into play either when we have an economic theory to test or when we have a relationship in mind that has some importance for business decisions or policy analysis An empirical analysis uses data to test a theory or to estimate a relationship How does one go about structuring an empirical economic analysis It may seem obvious but it is worth emphasizing that the first step in any empirical analysis is the careful formulation of the question of interest The question might deal with testing a certain aspect of an economic theory or it might pertain to testing the effects of a government policy In principle econometric methods can be used to answer a wide range of questions In some cases especially those that involve the testing of economic theories a formal economic model is constructed An economic model consists of mathematical equations that describe various relationships Economists are well known for their building of models to describe a vast array of be haviors For example in intermediate microeconomics individual consumption decisions subject to a budget constraint are described by mathematical models The basic premise underlying these models is utility maximization The assumption that individuals make choices to maximize their wellbeing subject to resource constraints gives us a very powerful framework for creating tractable economic models and making clear predictions In the context of consumption decisions utility maximization leads to a set of demand equations In a demand equation the quantity demanded of each commodity depends on the price of the goods the price of substitute and complementary goods the consumers income and the individuals characteristics that affect taste These equations can form the basis of an econometric analysis of consumer demand Economists have used basic economic tools such as the utility maximization framework to explain behaviors that at first glance may appear to be noneconomic in nature A classic example is Beckers 1968 economic model of criminal behavior Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 3 ExamplE 11 Economic model of Crime In a seminal article Nobel Prize winner Gary Becker postulated a utility maximization framework to describe an individuals participation in crime Certain crimes have clear economic rewards but most criminal behaviors have costs The opportunity costs of crime prevent the criminal from participating in other activities such as legal employment In addition there are costs associated with the possibility of being caught and then if convicted the costs associated with incarceration From Beckers per spective the decision to undertake illegal activity is one of resource allocation with the benefits and costs of competing activities taken into account Under general assumptions we can derive an equation describing the amount of time spent in criminal activity as a function of various factors We might represent such a function as y 5 f 1x1 x2 x3 x4 x5 x6 x72 11 where y 5 hours spent in criminal activities x1 5 wage for an hour spent in criminal activity x2 5 hourly wage in legal employment x3 5 income other than from crime or employment x4 5 probability of getting caught x5 5 probability of being convicted if caught x6 5 expected sentence if convicted and x7 5 age Other factors generally affect a persons decision to participate in crime but the list above is rep resentative of what might result from a formal economic analysis As is common in economic theory we have not been specific about the function f in 11 This function depends on an underlying util ity function which is rarely known Nevertheless we can use economic theoryor introspectionto predict the effect that each variable would have on criminal activity This is the basis for an econometric analysis of individual criminal activity Formal economic modeling is sometimes the starting point for empirical analysis but it is more com mon to use economic theory less formally or even to rely entirely on intuition You may agree that the deter minants of criminal behavior appearing in equation 11 are reasonable based on common sense we might arrive at such an equation directly without starting from utility maximization This view has some merit although there are cases in which formal derivations provide insights that intuition can overlook Next is an example of an equation that we can derive through somewhat informal reasoning ExamplE 12 Job Training and Worker productivity Consider the problem posed at the beginning of Section 11 A labor economist would like to examine the effects of job training on worker productivity In this case there is little need for formal economic theory Basic economic understanding is sufficient for realizing that factors such as education experi ence and training affect worker productivity Also economists are well aware that workers are paid commensurate with their productivity This simple reasoning leads to a model such as wage 5 f 1educ exper training2 12 where wage 5 hourly wage educ 5 years of formal education exper 5 years of workforce experience and training 5 weeks spent in job training Again other factors generally affect the wage rate but equation 12 captures the essence of the problem Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 4 After we specify an economic model we need to turn it into what we call an econometric model Because we will deal with econometric models throughout this text it is important to know how an econometric model relates to an economic model Take equation 11 as an example The form of the function f must be specified before we can undertake an econometric analysis A second issue con cerning 11 is how to deal with variables that cannot reasonably be observed For example consider the wage that a person can earn in criminal activity In principle such a quantity is well defined but it would be difficult if not impossible to observe this wage for a given individual Even variables such as the probability of being arrested cannot realistically be obtained for a given individual but at least we can observe relevant arrest statistics and derive a variable that approximates the probability of arrest Many other factors affect criminal behavior that we cannot even list let alone observe but we must somehow account for them The ambiguities inherent in the economic model of crime are resolved by specifying a particular econometric model crime 5 b0 1 b1wagem 1 b2othinc 1 b3freqarr 1 b4freqconv 1 b5avgsen 1 b6age 1 u 13 where crime 5 some measure of the frequency of criminal activity wagem 5 the wage that can be earned in legal employment othinc 5 the income from other sources assets inheritance and so on freqarr 5 the frequency of arrests for prior infractions to approximate the probability of arrest freqconv 5 the frequency of conviction and avgsen 5 the average sentence length after conviction The choice of these variables is determined by the economic theory as well as data considerations The term u contains unobserved factors such as the wage for criminal activity moral character fam ily background and errors in measuring things like criminal activity and the probability of arrest We could add family background variables to the model such as number of siblings parents education and so on but we can never eliminate u entirely In fact dealing with this error term or disturbance term is perhaps the most important component of any econometric analysis The constants b0 b1 c b6 are the parameters of the econometric model and they describe the directions and strengths of the relationship between crime and the factors used to determine crime in the model A complete econometric model for Example 12 might be wage 5 b0 1 b1educ 1 b2exper 1 b3training 1 u 14 where the term u contains factors such as innate ability quality of education family background and the myriad other factors that can influence a persons wage If we are specifically concerned about the effects of job training then b3 is the parameter of interest For the most part econometric analysis begins by specifying an econometric model without con sideration of the details of the models creation We generally follow this approach largely because careful derivation of something like the economic model of crime is time consuming and can take us into some specialized and often difficult areas of economic theory Economic reasoning will play a role in our examples and we will merge any underlying economic theory into the econometric model specification In the economic model of crime example we would start with an econometric model such as 13 and use economic reasoning and common sense as guides for choosing the variables Although this approach loses some of the richness of economic analysis it is commonly and effec tively applied by careful researchers Once an econometric model such as 13 or 14 has been specified various hypotheses of in terest can be stated in terms of the unknown parameters For example in equation 13 we might hypothesize that wagem the wage that can be earned in legal employment has no effect on criminal behavior In the context of this particular econometric model the hypothesis is equivalent to b1 5 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 5 An empirical analysis by definition requires data After data on the relevant variables have been collected econometric methods are used to estimate the parameters in the econometric model and to formally test hypotheses of interest In some cases the econometric model is used to make predic tions in either the testing of a theory or the study of a policys impact Because data collection is so important in empirical work Section 13 will describe the kinds of data that we are likely to encounter 13 The Structure of Economic Data Economic data sets come in a variety of types Whereas some econometric methods can be applied with little or no modification to many different kinds of data sets the special features of some data sets must be accounted for or should be exploited We next describe the most important data structures encountered in applied work 13a CrossSectional Data A crosssectional data set consists of a sample of individuals households firms cities states countries or a variety of other units taken at a given point in time Sometimes the data on all units do not cor respond to precisely the same time period For example several families may be surveyed during different weeks within a year In a pure crosssectional analysis we would ignore any minor timing differences in collecting the data If a set of families was surveyed during different weeks of the same year we would still view this as a crosssectional data set An important feature of crosssectional data is that we can often assume that they have been obtained by random sampling from the underlying population For example if we obtain informa tion on wages education experience and other characteristics by randomly drawing 500 people from the working population then we have a random sample from the population of all working people Random sampling is the sampling scheme covered in introductory statistics courses and it simplifies the analysis of crosssectional data A review of random sampling is contained in Appendix C Sometimes random sampling is not appropriate as an assumption for analyzing crosssectional data For example suppose we are interested in studying factors that influence the accumulation of family wealth We could survey a random sample of families but some families might refuse to report their wealth If for example wealthier families are less likely to disclose their wealth then the result ing sample on wealth is not a random sample from the population of all families This is an illustra tion of a sample selection problem an advanced topic that we will discuss in Chapter 17 Another violation of random sampling occurs when we sample from units that are large relative to the population particularly geographical units The potential problem in such cases is that the popula tion is not large enough to reasonably assume the observations are independent draws For example if we want to explain new business activity across states as a function of wage rates energy prices corporate and property tax rates services provided quality of the workforce and other state charac teristics it is unlikely that business activities in states near one another are independent It turns out that the econometric methods that we discuss do work in such situations but they sometimes need to be refined For the most part we will ignore the intricacies that arise in analyzing such situations and treat these problems in a random sampling framework even when it is not technically correct to do so Crosssectional data are widely used in economics and other social sciences In economics the analysis of crosssectional data is closely aligned with the applied microeconomics fields such as labor economics state and local public finance industrial organization urban economics demogra phy and health economics Data on individuals households firms and cities at a given point in time are important for testing microeconomic hypotheses and evaluating economic policies The crosssectional data used for econometric analysis can be represented and stored in comput ers Table 11 contains in abbreviated form a crosssectional data set on 526 working individuals Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 6 for the year 1976 This is a subset of the data in the file WAGE1 The variables include wage in dollars per hour educ years of education exper years of potential labor force experience female an indicator for gender and married marital status These last two variables are binary zeroone in nature and serve to indicate qualitative features of the individual the person is female or not the person is married or not We will have much to say about binary variables in Chapter 7 and beyond The variable obsno in Table 11 is the observation number assigned to each person in the sample Unlike the other variables it is not a characteristic of the individual All econometrics and statistics software packages assign an observation number to each data unit Intuition should tell you that for data such as that in Table 11 it does not matter which person is labeled as observation 1 which per son is called observation 2 and so on The fact that the ordering of the data does not matter for econo metric analysis is a key feature of crosssectional data sets obtained from random sampling Different variables sometimes correspond to different time periods in crosssectional data sets For example to determine the effects of government policies on longterm economic growth econo mists have studied the relationship between growth in real per capita GDP over a certain period say 1960 to 1985 and variables determined in part by government policy in 1960 government consump tion as a percentage of GDP and adult secondary education rates Such a data set might be repre sented as in Table 12 which constitutes part of the data set used in the study of crosscountry growth rates by De Long and Summers 1991 The variable gpcrgdp represents average growth in real per capita GDP over the period 1960 to 1985 The fact that govcons60 government consumption as a percentage of GDP and second60 TAblE 11 A CrossSectional Data Set on Wages and Other Individual Characteristics obsno wage educ exper female married 1 310 11 2 1 0 2 324 12 22 1 1 3 300 11 2 0 0 4 600 8 44 0 1 5 530 12 7 0 1 525 1156 16 5 0 1 526 350 14 5 1 0 TAblE 12 A Data Set on Economic Growth Rates and Country Characteristics obsno country gpcrgdp govcons60 second60 1 Argentina 089 9 32 2 Austria 332 16 50 3 Belgium 256 13 69 4 Bolivia 124 18 12 61 Zimbabwe 230 17 6 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 7 percentage of adult population with a secondary education correspond to the year 1960 while gpcrgdp is the average growth over the period from 1960 to 1985 does not lead to any special prob lems in treating this information as a crosssectional data set The observations are listed alphabeti cally by country but nothing about this ordering affects any subsequent analysis 13b Time Series Data A time series data set consists of observations on a variable or several variables over time Examples of time series data include stock prices money supply consumer price index GDP annual homicide rates and automobile sales figures Because past events can influence future events and lags in behav ior are prevalent in the social sciences time is an important dimension in a time series data set Unlike the arrangement of crosssectional data the chronological ordering of observations in a time series conveys potentially important information A key feature of time series data that makes them more difficult to analyze than crosssectional data is that economic observations can rarely if ever be assumed to be independent across time Most economic and other time series are related often strongly related to their recent histories For example knowing something about the GDP from last quarter tells us quite a bit about the likely range of the GDP during this quarter because GDP tends to remain fairly stable from one quarter to the next Although most econometric procedures can be used with both crosssectional and time series data more needs to be done in specifying econometric models for time series data before standard econometric methods can be justified In addition modifications and embellishments to standard econometric techniques have been developed to account for and exploit the dependent nature of economic time series and to address other issues such as the fact that some economic variables tend to display clear trends over time Another feature of time series data that can require special attention is the data frequency at which the data are collected In economics the most common frequencies are daily weekly monthly quarterly and annually Stock prices are recorded at daily intervals excluding Saturday and Sunday The money supply in the US economy is reported weekly Many macroeconomic series are tabulated monthly including inflation and unemployment rates Other macro series are recorded less frequently such as every three months every quarter GDP is an important example of a quarterly series Other time series such as infant mortality rates for states in the United States are available only on an annual basis Many weekly monthly and quarterly economic time series display a strong seasonal pattern which can be an important factor in a time series analysis For example monthly data on housing starts differ across the months simply due to changing weather conditions We will learn how to deal with seasonal time series in Chapter 10 Table 13 contains a time series data set obtained from an article by CastilloFreeman and Freeman 1992 on minimum wage effects in Puerto Rico The earliest year in the data set is the first TAblE 13 Minimum Wage Unemployment and Related Data for Puerto Rico obsno year avgmin avgcov prunemp prgnp 1 1950 020 201 154 8787 2 1951 021 207 160 9250 3 1952 023 226 148 10159 37 1986 335 581 189 42816 38 1987 335 582 168 44967 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 8 observation and the most recent year available is the last observation When econometric methods are used to analyze time series data the data should be stored in chronological order The variable avgmin refers to the average minimum wage for the year avgcov is the average cov erage rate the percentage of workers covered by the minimum wage law prunemp is the unemploy ment rate and prgnp is the gross national product in millions of 1954 dollars We will use these data later in a time series analysis of the effect of the minimum wage on employment 13c Pooled Cross Sections Some data sets have both crosssectional and time series features For example suppose that two crosssectional household surveys are taken in the United States one in 1985 and one in 1990 In 1985 a random sample of households is surveyed for variables such as income savings fam ily size and so on In 1990 a new random sample of households is taken using the same survey questions To increase our sample size we can form a pooled cross section by combining the two years Pooling cross sections from different years is often an effective way of analyzing the effects of a new government policy The idea is to collect data from the years before and after a key policy change As an example consider the following data set on housing prices taken in 1993 and 1995 before and after a reduction in property taxes in 1994 Suppose we have data on 250 houses for 1993 and on 270 houses for 1995 One way to store such a data set is given in Table 14 Observations 1 through 250 correspond to the houses sold in 1993 and observations 251 through 520 correspond to the 270 houses sold in 1995 Although the order in which we store the data turns out not to be crucial keeping track of the year for each observation is usually very important This is why we enter year as a separate variable A pooled cross section is analyzed much like a standard cross section except that we often need to account for secular differences in the variables across the time In fact in addition to increasing the sample size the point of a pooled crosssectional analysis is often to see how a key relationship has changed over time TAblE 14 Pooled Cross Sections Two Years of Housing Prices obsno year hprice proptax sqrft bdrms bthrms 1 1993 85500 42 1600 3 20 2 1993 67300 36 1440 3 25 3 1993 134000 38 2000 4 25 250 1993 243600 41 2600 4 30 251 1995 65000 16 1250 2 10 252 1995 182400 20 2200 4 20 253 1995 97500 15 1540 3 20 520 1995 57200 16 1100 2 15 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 9 13d Panel or Longitudinal Data A panel data or longitudinal data set consists of a time series for each crosssectional member in the data set As an example suppose we have wage education and employment history for a set of individuals followed over a 10year period Or we might collect information such as investment and financial data about the same set of firms over a fiveyear time period Panel data can also be collected on geographical units For example we can collect data for the same set of counties in the United States on immigration flows tax rates wage rates government expenditures and so on for the years 1980 1985 and 1990 The key feature of panel data that distinguishes them from a pooled cross section is that the same crosssectional units individuals firms or counties in the preceding examples are followed over a given time period The data in Table 14 are not considered a panel data set because the houses sold are likely to be different in 1993 and 1995 if there are any duplicates the number is likely to be so small as to be unimportant In contrast Table 15 contains a twoyear panel data set on crime and related statistics for 150 cities in the United States There are several interesting features in Table 15 First each city has been given a number from 1 through 150 Which city we decide to call city 1 city 2 and so on is irrelevant As with a pure cross section the ordering in the cross section of a panel data set does not matter We could use the city name in place of a number but it is often useful to have both A second point is that the two years of data for city 1 fill the first two rows or observations Observations 3 and 4 correspond to city 2 and so on Because each of the 150 cities has two rows of data any econometrics package will view this as 300 observations This data set can be treated as a pooled cross section where the same cities happen to show up in each year But as we will see in Chapters 13 and 14 we can also use the panel structure to analyze questions that cannot be answered by simply viewing this as a pooled cross section In organizing the observations in Table 15 we place the two years of data for each city adjacent to one another with the first year coming before the second in all cases For just about every practi cal purpose this is the preferred way for ordering panel data sets Contrast this organization with the way the pooled cross sections are stored in Table 14 In short the reason for ordering panel data as in Table 15 is that we will need to perform data transformations for each city across the two years Because panel data require replication of the same units over time panel data sets especially those on individuals households and firms are more difficult to obtain than pooled cross sections Not surprisingly observing the same units over time leads to several advantages over crosssectional data or even pooled crosssectional data The benefit that we will focus on in this text is that having TAblE 15 A TwoYear Panel Data Set on City Crime Statistics obsno city year murders population unem police 1 1 1986 5 350000 87 440 2 1 1990 8 359200 72 471 3 2 1986 2 64300 54 75 4 2 1990 1 65100 55 75 297 149 1986 10 260700 96 286 298 149 1990 6 245000 98 334 299 150 1986 25 543000 43 520 300 150 1990 32 546200 52 493 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 10 multiple observations on the same units allows us to control for certain unobserved characteristics of individuals firms and so on As we will see the use of more than one observation can facilitate causal inference in situations where inferring causality would be very difficult if only a single cross section were available A second advantage of panel data is that they often allow us to study the importance of lags in behavior or the result of decision making This information can be significant because many economic policies can be expected to have an impact only after some time has passed Most books at the undergraduate level do not contain a discussion of econometric methods for panel data However economists now recognize that some questions are difficult if not impossible to answer satisfactorily without panel data As you will see we can make considerable progress with simple panel data analysis a method that is not much more difficult than dealing with a standard crosssectional data set 13e A Comment on Data Structures Part 1 of this text is concerned with the analysis of crosssectional data because this poses the fewest conceptual and technical difficulties At the same time it illustrates most of the key themes of econo metric analysis We will use the methods and insights from crosssectional analysis in the remainder of the text Although the econometric analysis of time series uses many of the same tools as crosssectional analysis it is more complicated because of the trending highly persistent nature of many economic time series Examples that have been traditionally used to illustrate the manner in which economet ric methods can be applied to time series data are now widely believed to be flawed It makes little sense to use such examples initially since this practice will only reinforce poor econometric practice Therefore we will postpone the treatment of time series econometrics until Part 2 when the impor tant issues concerning trends persistence dynamics and seasonality will be introduced In Part 3 we will treat pooled cross sections and panel data explicitly The analysis of indepen dently pooled cross sections and simple panel data analysis are fairly straightforward extensions of pure crosssectional analysis Nevertheless we will wait until Chapter 13 to deal with these topics 14 Causality and the Notion of Ceteris Paribus in Econometric Analysis In most tests of economic theory and certainly for evaluating public policy the economists goal is to infer that one variable such as education has a causal effect on another variable such as worker productivity Simply finding an association between two or more variables might be suggestive but unless causality can be established it is rarely compelling The notion of ceteris paribuswhich means other relevant factors being equalplays an important role in causal analysis This idea has been implicit in some of our earlier discussion par ticularly Examples 11 and 12 but thus far we have not explicitly mentioned it You probably remember from introductory economics that most economic questions are ceteris paribus by nature For example in analyzing consumer demand we are interested in knowing the ef fect of changing the price of a good on its quantity demanded while holding all other factorssuch as income prices of other goods and individual tastesfixed If other factors are not held fixed then we cannot know the causal effect of a price change on quantity demanded Holding other factors fixed is critical for policy analysis as well In the job training example Example 12 we might be interested in the effect of another week of job training on wages with all other components being equal in particular education and experience If we succeed in holding all other relevant factors fixed and then find a link between job training and wages we can conclude that job training has a causal effect on worker productivity Although this may seem pretty simple even at this early stage it should be clear that except in very special cases it will not be possible to literally hold all else equal The key question in most empirical studies is Have enough other factors Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 11 been held fixed to make a case for causality Rarely is an econometric study evaluated without raising this issue In most serious applications the number of factors that can affect the variable of interestsuch as criminal activity or wagesis immense and the isolation of any particular variable may seem like a hopeless effort However we will eventually see that when carefully applied econometric methods can simulate a ceteris paribus experiment At this point we cannot yet explain how econometric methods can be used to estimate ceteris paribus effects so we will consider some problems that can arise in trying to infer causality in eco nomics We do not use any equations in this discussion For each example the problem of inferring causality disappears if an appropriate experiment can be carried out Thus it is useful to describe how such an experiment might be structured and to observe that in most cases obtaining experimental data is impractical It is also helpful to think about why the available data fail to have the important features of an experimental data set We rely for now on your intuitive understanding of such terms as random independence and correlation all of which should be familiar from an introductory probability and statistics course These concepts are reviewed in Appendix B We begin with an example that illustrates some of these important issues ExamplE 13 Effects of Fertilizer on Crop Yield Some early econometric studies for example Griliches 1957 considered the effects of new fertilizers on crop yields Suppose the crop under consideration is soybeans Since fertilizer amount is only one factor affecting yieldssome others include rainfall quality of land and presence of para sitesthis issue must be posed as a ceteris paribus question One way to determine the causal effect of fertilizer amount on soybean yield is to conduct an experiment which might include the following steps Choose several oneacre plots of land Apply different amounts of fertilizer to each plot and subsequently measure the yields this gives us a crosssectional data set Then use statistical methods to be introduced in Chapter 2 to measure the association between yields and fertilizer amounts As described earlier this may not seem like a very good experiment because we have said noth ing about choosing plots of land that are identical in all respects except for the amount of fertilizer In fact choosing plots of land with this feature is not feasible some of the factors such as land quality cannot even be fully observed How do we know the results of this experiment can be used to measure the ceteris paribus effect of fertilizer The answer depends on the specifics of how fertilizer amounts are chosen If the levels of fertilizer are assigned to plots independently of other plot features that affect yieldthat is other characteristics of plots are completely ignored when deciding on fertilizer amountsthen we are in business We will justify this statement in Chapter 2 The next example is more representative of the difficulties that arise when inferring causality in applied economics ExamplE 14 measuring the Return to Education Labor economists and policy makers have long been interested in the return to education Somewhat informally the question is posed as follows If a person is chosen from the population and given an other year of education by how much will his or her wage increase As with the previous examples this is a ceteris paribus question which implies that all other factors are held fixed while another year of education is given to the person We can imagine a social planner designing an experiment to get at this issue much as the agri cultural researcher can design an experiment to estimate fertilizer effects Assume for the moment Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 12 that the social planner has the ability to assign any level of education to any person How would this planner emulate the fertilizer experiment in Example 13 The planner would choose a group of people and randomly assign each person an amount of education some people are given an eighth grade education some are given a high school education some are given two years of college and so on Subsequently the planner measures wages for this group of people where we assume that each person then works in a job The people here are like the plots in the fertilizer example where educa tion plays the role of fertilizer and wage rate plays the role of soybean yield As with Example 13 if levels of education are assigned independently of other characteristics that affect productivity such as experience and innate ability then an analysis that ignores these other factors will yield useful results Again it will take some effort in Chapter 2 to justify this claim for now we state it without support Unlike the fertilizeryield example the experiment described in Example 14 is unfeasible The ethi cal issues not to mention the economic costs associated with randomly determining education levels for a group of individuals are obvious As a logistical matter we could not give someone only an eighthgrade education if he or she already has a college degree Even though experimental data cannot be obtained for measuring the return to education we can certainly collect nonexperimental data on education levels and wages for a large group by sampling randomly from the population of working people Such data are available from a variety of surveys used in labor economics but these data sets have a feature that makes it difficult to estimate the ceteris paribus return to education People choose their own levels of education therefore education levels are probably not determined independently of all other factors affecting wage This problem is a feature shared by most nonexperimental data sets One factor that affects wage is experience in the workforce Since pursuing more educa tion generally requires postponing entering the workforce those with more education usually have less experience Thus in a nonexperimental data set on wages and education education is likely to be negatively associated with a key variable that also affects wage It is also believed that people with more innate ability often choose higher levels of education Since higher ability leads to higher wages we again have a correlation between education and a critical factor that affects wage The omitted factors of experience and ability in the wage example have analogs in the fertilizer example Experience is generally easy to measure and therefore is similar to a variable such as rain fall Ability on the other hand is nebulous and difficult to quantify it is similar to land quality in the fertilizer example As we will see throughout this text accounting for other observed factors such as experience when estimating the ceteris paribus effect of another variable such as education is rela tively straightforward We will also find that accounting for inherently unobservable factors such as ability is much more problematic It is fair to say that many of the advances in econometric methods have tried to deal with unobserved factors in econometric models One final parallel can be drawn between Examples 13 and 14 Suppose that in the fertilizer example the fertilizer amounts were not entirely determined at random Instead the assistant who chose the fertilizer levels thought it would be better to put more fertilizer on the higherquality plots of land Agricultural researchers should have a rough idea about which plots of land are of bet ter quality even though they may not be able to fully quantify the differences This situation is completely analogous to the level of schooling being related to unobserved ability in Example 14 Because better land leads to higher yields and more fertilizer was used on the better plots any observed relationship between yield and fertilizer might be spurious Difficulty in inferring causality can also arise when studying data at fairly high levels of aggregation as the next example on city crime rates shows Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 13 ExamplE 15 The Effect of law Enforcement on City Crime levels The issue of how best to prevent crime has been and will probably continue to be with us for some time One especially important question in this regard is Does the presence of more police officers on the street deter crime The ceteris paribus question is easy to state If a city is randomly chosen and given say ten additional police officers by how much would its crime rates fall Another way to state the question is If two cities are the same in all respects except that city A has ten more police officers than city B by how much would the two cities crime rates differ It would be virtually impossible to find pairs of communities identical in all respects except for the size of their police force Fortunately econometric analysis does not require this What we do need to know is whether the data we can collect on community crime levels and the size of the police force can be viewed as experimental We can certainly imagine a true experiment involving a large collec tion of cities where we dictate how many police officers each city will use for the upcoming year Although policies can be used to affect the size of police forces we clearly cannot tell each city how many police officers it can hire If as is likely a citys decision on how many police officers to hire is correlated with other city factors that affect crime then the data must be viewed as nonexperimental In fact one way to view this problem is to see that a citys choice of police force size and the amount of crime are simultaneously determined We will explicitly address such problems in Chapter 16 The first three examples we have discussed have dealt with crosssectional data at various levels of aggregation for example at the individual or city levels The same hurdles arise when inferring causality in time series problems ExamplE 16 The Effect of the minimum Wage on Unemployment An important and perhaps contentious policy issue concerns the effect of the minimum wage on unemployment rates for various groups of workers Although this problem can be studied in a variety of data settings crosssectional time series or panel data time series data are often used to look at aggregate effects An example of a time series data set on unemployment rates and minimum wages was given in Table 13 Standard supply and demand analysis implies that as the minimum wage is increased above the market clearing wage we slide up the demand curve for labor and total employment decreases Labor supply exceeds labor demand To quantify this effect we can study the relationship between employment and the minimum wage over time In addition to some special difficulties that can arise in dealing with time series data there are possible problems with inferring causality The minimum wage in the United States is not determined in a vacuum Various economic and political forces impinge on the final minimum wage for any given year The minimum wage once determined is usually in place for several years unless it is indexed for inflation Thus it is probable that the amount of the minimum wage is related to other factors that have an effect on employment levels We can imagine the US government conducting an experiment to determine the employment effects of the minimum wage as opposed to worrying about the welfare of lowwage workers The minimum wage could be randomly set by the government each year and then the employment out comes could be tabulated The resulting experimental time series data could then be analyzed using fairly simple econometric methods But this scenario hardly describes how minimum wages are set If we can control enough other factors relating to employment then we can still hope to estimate the ceteris paribus effect of the minimum wage on employment In this sense the problem is very similar to the previous crosssectional examples Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 14 CHAPTER 1 The Nature of Econometrics and Economic Data Even when economic theories are not most naturally described in terms of causality they often have predictions that can be tested using econometric methods The following example demonstrates this approach ExamplE 17 The Expectations Hypothesis The expectations hypothesis from financial economics states that given all information available to investors at the time of investing the expected return on any two investments is the same For exam ple consider two possible investments with a threemonth investment horizon purchased at the same time 1 Buy a threemonth Tbill with a face value of 10000 for a price below 10000 in three months you receive 10000 2 Buy a sixmonth Tbill at a price below 10000 and in three months sell it as a threemonth Tbill Each investment requires roughly the same amount of initial capital but there is an important difference For the first investment you know exactly what the return is at the time of purchase because you know the initial price of the threemonth Tbill along with its face value This is not true for the second investment although you know the price of a sixmonth Tbill when you purchase it you do not know the price you can sell it for in three months Therefore there is uncertainty in this investment for someone who has a threemonth investment horizon The actual returns on these two investments will usually be different According to the expecta tions hypothesis the expected return from the second investment given all information at the time of investment should equal the return from purchasing a threemonth Tbill This theory turns out to be fairly easy to test as we will see in Chapter 11 Summary In this introductory chapter we have discussed the purpose and scope of econometric analysis Econometrics is used in all applied economics fields to test economic theories to inform government and private policy makers and to predict economic time series Sometimes an econometric model is derived from a formal economic model but in other cases econometric models are based on informal economic reasoning and intuition The goals of any econometric analysis are to estimate the parameters in the model and to test hypotheses about these parameters the values and signs of the parameters determine the validity of an economic theory and the effects of certain policies Crosssectional time series pooled crosssectional and panel data are the most common types of data structures that are used in applied econometrics Data sets involving a time dimension such as time series and panel data require special treatment because of the correlation across time of most economic time series Other issues such as trends and seasonality arise in the analysis of time series data but not cross sectional data In Section 14 we discussed the notions of ceteris paribus and causal inference In most cases hypoth eses in the social sciences are ceteris paribus in nature all other relevant factors must be fixed when study ing the relationship between two variables Because of the nonexperimental nature of most data collected in the social sciences uncovering causal relationships is very challenging Key Terms Causal Effect Ceteris Paribus CrossSectional Data Set Data Frequency Econometric Model Economic Model Empirical Analysis Experimental Data Nonexperimental Data Observational Data Panel Data Pooled Cross Section Random Sampling Retrospective Data Time Series Data Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 15 Problems 1 Suppose that you are asked to conduct a study to determine whether smaller class sizes lead to improved student performance of fourth graders i If you could conduct any experiment you want what would you do Be specific ii More realistically suppose you can collect observational data on several thousand fourth grad ers in a given state You can obtain the size of their fourthgrade class and a standardized test score taken at the end of fourth grade Why might you expect a negative correlation between class size and test score iii Would a negative correlation necessarily show that smaller class sizes cause better performance Explain 2 A justification for job training programs is that they improve worker productivity Suppose that you are asked to evaluate whether more job training makes workers more productive However rather than having data on individual workers you have access to data on manufacturing firms in Ohio In particu lar for each firm you have information on hours of job training per worker training and number of nondefective items produced per worker hour output i Carefully state the ceteris paribus thought experiment underlying this policy question ii Does it seem likely that a firms decision to train its workers will be independent of worker characteristics What are some of those measurable and unmeasurable worker characteristics iii Name a factor other than worker characteristics that can affect worker productivity iv If you find a positive correlation between output and training would you have convincingly established that job training makes workers more productive Explain 3 Suppose at your university you are asked to find the relationship between weekly hours spent study ing study and weekly hours spent working work Does it make sense to characterize the problem as inferring whether study causes work or work causes study Explain 4 States and provinces that have control over taxation sometimes reduce taxes in an attempt to spur economic growth Suppose that you are hired by a state to estimate the effect of corporate tax rates on say the growth in per capita gross state product GSP i What kind of data would you need to collect to undertake a statistical analysis ii Is it feasible to do a controlled experiment What would be required iii Is a correlation analysis between GSP growth and tax rates likely to be convincing Explain Computer Exercises C1 Use the data in WAGE1 for this exercise i Find the average education level in the sample What are the lowest and highest years of education ii Find the average hourly wage in the sample Does it seem high or low iii The wage data are reported in 1976 dollars Using the Internet or a printed source find the Consumer Price Index CPI for the years 1976 and 2013 iv Use the CPI values from part iii to find the average hourly wage in 2013 dollars Now does the average hourly wage seem reasonable v How many women are in the sample How many men C2 Use the data in BWGHT to answer this question i How many women are in the sample and how many report smoking during pregnancy ii What is the average number of cigarettes smoked per day Is the average a good measure of the typical woman in this case Explain iii Among women who smoked during pregnancy what is the average number of cigarettes smoked per day How does this compare with your answer from part ii and why Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 16 CHAPTER 1 The Nature of Econometrics and Economic Data iv Find the average of fatheduc in the sample Why are only 1192 observations used to compute this average v Report the average family income and its standard deviation in dollars C3 The data in MEAP01 are for the state of Michigan in the year 2001 Use these data to answer the fol lowing questions i Find the largest and smallest values of math4 Does the range make sense Explain ii How many schools have a perfect pass rate on the math test What percentage is this of the total sample iii How many schools have math pass rates of exactly 50 iv Compare the average pass rates for the math and reading scores Which test is harder to pass v Find the correlation between math4 and read4 What do you conclude vi The variable exppp is expenditure per pupil Find the average of exppp along with its standard deviation Would you say there is wide variation in per pupil spending vii Suppose School A spends 6000 per student and School B spends 5500 per student By what percentage does School As spending exceed School Bs Compare this to 100 log6000 log5500 which is the approximation percentage difference based on the difference in the natural logs See Section A4 in Appendix A C4 The data in JTRAIN2 come from a job training experiment conducted for lowincome men during 19761977 see Lalonde 1986 i Use the indicator variable train to determine the fraction of men receiving job training ii The variable re78 is earnings from 1978 measured in thousands of 1982 dollars Find the averages of re78 for the sample of men receiving job training and the sample not receiving job training Is the difference economically large iii The variable unem78 is an indicator of whether a man is unemployed or not in 1978 What fraction of the men who received job training are unemployed What about for men who did not receive job training Comment on the difference iv From parts ii and iii does it appear that the job training program was effective What would make our conclusions more convincing C5 The data in FERTIL2 were collected on women living in the Republic of Botswana in 1988 The vari able children refers to the number of living children The variable electric is a binary indicator equal to one if the womans home has electricity and zero if not i Find the smallest and largest values of children in the sample What is the average of children ii What percentage of women have electricity in the home iii Compute the average of children for those without electricity and do the same for those with electricity Comment on what you find iv From part iii can you infer that having electricity causes women to have fewer children Explain C6 Use the data in COUNTYMURDERS to answer this question Use only the year 1996 The variable murders is the number of murders reported in the county The variable execs is the number of execu tions that took place of people sentenced to death in the given county Most states in the United States have the death penalty but several do not i How many counties are there in the data set Of these how many have zero murders What percentage of counties have zero executions Remember use only the 1996 data ii What is the largest number of murders What is the largest number of executions Why is the average number of executions so small iii Compute the correlation coefficient between murders and execs and describe what you find iv You should have computed a positive correlation in part iii Do you think that more executions cause more murders to occur What might explain the positive correlation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 1 The Nature of Econometrics and Economic Data 17 C7 The data set in ALCOHOL contains information on a sample of men in the United States Two key variables are selfreported employment status and alcohol abuse along with many other variables The variables employ and abuse are both binary or indicator variables they take on only the values zero and one i What is percentage of the men in the sample report abusing alcohol What is the employment rate ii Consider the group of men who abuse alcohol What is the employment rate iii What is the employment rate for the group of men who do not abuse alcohol iv Discuss the difference in your answers to parts ii and iii Does this allow you to conclude that alcohol abuse causes unemployment Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 19 P art 1 of the text covers regression analysis with crosssectional data It builds upon a solid base of college algebra and basic concepts in probability and statistics Appendices A B and C contain complete reviews of these topics Chapter 2 begins with the simple linear regression model where we explain one variable in terms of another variable Although simple regression is not widely used in applied econometrics it is used occasionally and serves as a natural starting point because the algebra and interpretations are relatively straightforward Chapters 3 and 4 cover the fundamentals of multiple regression analysis where we allow more than one variable to affect the variable we are trying to explain Multiple regression is still the most commonly used method in empirical research and so these chapters deserve careful attention Chapter 3 focuses on the algebra of the method of ordinary least squares OLS while also estab lishing conditions under which the OLS estimator is unbiased and best linear unbiased Chapter 4 covers the important topic of statistical inference Chapter 5 discusses the large sample or asymptotic properties of the OLS estimators This provides justification of the inference procedures in Chapter 4 when the errors in a regression model are not normally distributed Chapter 6 covers some additional topics in regression analysis including advanced functional form issues data scaling prediction and goodnessoffit Chapter 7 explains how qualitative information can be incorporated into multiple regression models Chapter 8 illustrates how to test for and correct the problem of heteroskedasticity or noncon stant variance in the error terms We show how the usual OLS statistics can be adjusted and we also present an extension of OLS known as weighted least squares which explicitly accounts for different variances in the errors Chapter 9 delves further into the very important problem of correla tion between the error term and one or more of the explanatory variables We demonstrate how the availability of a proxy variable can solve the omitted variables problem In addition we establish the bias and inconsistency in the OLS estimators in the presence of certain kinds of measurement errors in the variables Various data problems are also discussed including the problem of outliers Regression Analysis with CrossSectional Data Part 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 20 c h a p t e r 2 The Simple Regression Model T he simple regression model can be used to study the relationship between two variables For reasons we will see the simple regression model has limitations as a general tool for empirical analysis Nevertheless it is sometimes appropriate as an empirical tool Learning how to inter pret the simple regression model is good practice for studying multiple regression which we will do in subsequent chapters 21 Definition of the Simple Regression Model Much of applied econometric analysis begins with the following premise y and x are two variables representing some population and we are interested in explaining y in terms of x or in studying how y varies with changes in x We discussed some examples in Chapter 1 including y is soybean crop yield and x is amount of fertilizer y is hourly wage and x is years of education and y is a com munity crime rate and x is number of police officers In writing down a model that will explain y in terms of x we must confront three issues First since there is never an exact relationship between two variables how do we allow for other factors to affect y Second what is the functional relationship between y and x And third how can we be sure we are capturing a ceteris paribus relationship between y and x if that is a desired goal We can resolve these ambiguities by writing down an equation relating y to x A simple equation is y 5 b0 1 b1x 1 u 21 Equation 21 which is assumed to hold in the population of interest defines the simple linear regression model It is also called the twovariable linear regression model or bivariate linear Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 21 regression model because it relates the two variables x and y We now discuss the meaning of each of the quantities in equation 21 Incidentally the term regression has origins that are not espe cially important for most modern econometric applications so we will not explain it here See Stigler 1986 for an engaging history of regression analysis When related by equation 21 the variables y and x have several different names used inter changeably as follows y is called the dependent variable the explained variable the response variable the predicted variable or the regressand x is called the independent variable the explanatory variable the control variable the predictor variable or the regressor The term covariate is also used for x The terms dependent variable and independent variable are fre quently used in econometrics But be aware that the label independent here does not refer to the statistical notion of independence between random variables see Appendix B The terms explained and explanatory variables are probably the most descriptive Response and control are used mostly in the experimental sciences where the variable x is under the experi menters control We will not use the terms predicted variable and predictor although you some times see these in applications that are purely about prediction and not causality Our terminology for simple regression is summarized in Table 21 The variable u called the error term or disturbance in the relationship represents factors other than x that affect y A simple regression analysis effectively treats all factors affecting y other than x as being unobserved You can usefully think of u as standing for unobserved Equation 21 also addresses the issue of the functional relationship between y and x If the other factors in u are held fixed so that the change in u is zero Du 5 0 then x has a linear effect on y Dy 5 b1Dx if Du 5 0 22 Thus the change in y is simply b1 multiplied by the change in x This means that b1 is the slope parameter in the relationship between y and x holding the other factors in u fixed it is of primary interest in applied economics The intercept parameter b0 sometimes called the constant term also has its uses although it is rarely central to an analysis ExamplE 21 Soybean Yield and Fertilizer Suppose that soybean yield is determined by the model yield 5 b0 1 b1 fertilizer 1 u 23 so that y 5 yield and x 5 fertilizer The agricultural researcher is interested in the effect of fertilizer on yield holding other factors fixed This effect is given by b1 The error term u contains factors such as land quality rainfall and so on The coefficient b1 measures the effect of fertilizer on yield hold ing other factors fixed Dyield 5 b1D fertilizer TAblE 21 Terminology for Simple Regression Y X Dependent variable Independent variable Explained variable Explanatory variable Response variable Control variable Predicted variable Predictor variable Regressand Regressor Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 22 ExamplE 22 a Simple Wage Equation A model relating a persons wage to observed education and other unobserved factors is wage 5 b0 1 b1educ 1 u 24 If wage is measured in dollars per hour and educ is years of education then b1 measures the change in hourly wage given another year of education holding all other factors fixed Some of those factors include labor force experience innate ability tenure with current employer work ethic and numerous other things The linearity of equation 21 implies that a oneunit change in x has the same effect on y regardless of the initial value of x This is unrealistic for many economic applications For example in the wageeducation example we might want to allow for increasing returns the next year of educa tion has a larger effect on wages than did the previous year We will see how to allow for such pos sibilities in Section 24 The most difficult issue to address is whether model 21 really allows us to draw ceteris paribus conclusions about how x affects y We just saw in equation 22 that b1 does measure the effect of x on y holding all other factors in u fixed Is this the end of the causality issue Unfortunately no How can we hope to learn in general about the ceteris paribus effect of x on y holding other factors fixed when we are ignoring all those other factors Section 25 will show that we are only able to get reliable estimators of b0 and b1 from a random sample of data when we make an assumption restricting how the unobservable u is related to the explanatory variable x Without such a restriction we will not be able to estimate the ceteris paribus effect b1 Because u and x are random variables we need a concept grounded in probability Before we state the key assumption about how x and u are related we can always make one assumption about u As long as the intercept b0 is included in the equation nothing is lost by assum ing that the average value of u in the population is zero Mathematically E1u2 5 0 25 Assumption 25 says nothing about the relationship between u and x but simply makes a state ment about the distribution of the unobserved factors in the population Using the previous exam ples for illustration we can see that assumption 25 is not very restrictive In Example 21 we lose nothing by normalizing the unobserved factors affecting soybean yield such as land quality to have an average of zero in the population of all cultivated plots The same is true of the unobserved factors in Example 22 Without loss of generality we can assume that things such as average ability are zero in the population of all working people If you are not convinced you should work through Problem 2 to see that we can always redefine the intercept in equation 21 to make equa tion 25 true We now turn to the crucial assumption regarding how u and x are related A natural measure of the association between two random variables is the correlation coefficient See Appendix B for definition and properties If u and x are uncorrelated then as random variables they are not linearly related Assuming that u and x are uncorrelated goes a long way toward defining the sense in which u and x should be unrelated in equation 21 But it does not go far enough because correlation meas ures only linear dependence between u and x Correlation has a somewhat counterintuitive feature it is possible for u to be uncorrelated with x while being correlated with functions of x such as x2 See Section B4 for further discussion This possibility is not acceptable for most regression purposes as it causes problems for interpreting the model and for deriving statistical properties A better assump tion involves the expected value of u given x Because u and x are random variables we can define the conditional distribution of u given any value of x In particular for any x we can obtain the expected or average value of u for that slice of Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 23 the population described by the value of x The crucial assumption is that the average value of u does not depend on the value of x We can write this assumption as E1u0x2 5 E1u2 26 Equation 26 says that the average value of the unobservables is the same across all slices of the population determined by the value of x and that the common average is necessarily equal to the average of u over the entire population When assumption 26 holds we say that u is mean inde pendent of x Of course mean independence is implied by full independence between u and x an assumption often used in basic probability and statistics When we combine mean independence with assumption 25 we obtain the zero conditional mean assumption E1u0x2 5 0 It is critical to remember that equation 26 is the assumption with impact assumption 25 essentially defines the intercept b0 Let us see what equation 26 entails in the wage example To simplify the discussion assume that u is the same as innate ability Then equation 26 requires that the average level of ability is the same regardless of years of education For example if E1abil082 denotes the average ability for the group of all people with eight years of education and E1abil0162 denotes the average ability among people in the population with sixteen years of education then equation 26 implies that these must be the same In fact the average ability level must be the same for all education levels If for exam ple we think that average ability increases with years of education then equation 26 is false This would happen if on average people with more ability choose to become more educated As we can not observe innate ability we have no way of know ing whether or not average ability is the same for all education levels But this is an issue that we must address before relying on simple regression analysis In the fertilizer example if fertilizer amounts are chosen independently of other features of the plots then equation 26 will hold the average land quality will not depend on the amount of fertilizer However if more fertilizer is put on the higherquality plots of land then the expected value of u changes with the level of fertilizer and equation 26 fails The zero conditional mean assumption gives b1 another interpretation that is often useful Taking the expected value of equation 21 conditional on x and using E1u0x2 5 0 gives E1y0x2 5 b0 1 b1x 28 Equation 28 shows that the population regression function PRF E1y0x2 is a linear function of x The linearity means that a oneunit increase in x changes the expected value of y by the amount b1 For any given value of x the distribution of y is centered about E1y0x2 as illustrated in Figure 21 It is important to understand that equation 28 tells us how the average value of y changes with x it does not say that y equals b0 1 b1x for all units in the population For example suppose that x is the high school grade point average and y is the college GPA and we happen to know that E1colGPA0hsGPA2 5 15 1 05 hsGPA Of course in practice we never know the population intercept and slope but it is useful to pretend momentarily that we do to understand the nature of equation 28 This GPA equation tells us the average college GPA among all students who have a given high school GPA So suppose that hsGPA 5 36 Then the average colGPA for all high school graduates who attend college with hsGPA 5 36 is 15 1 051362 5 33 We are certainly not say ing that every student with hsGPA 5 36 will have a 33 college GPA this is clearly false The PRF gives us a relationship between the average level of y at different levels of x Some students with hsGPA 5 36 will have a college GPA higher than 33 and some will have a lower college GPA Whether the actual colGPA is above or below 33 depends on the unobserved factors in u and those differ among students even within the slice of the population with hsGPA 5 36 Suppose that a score on a final exam score depends on classes attended attend and unobserved factors that affect exam perfor mance such as student ability Then score 5 b0 1 b1attend 1 u 27 When would you expect this model to satisfy equation 26 Exploring FurthEr 21 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 24 Given the zero conditional mean assumption E1u0x2 5 0 it is useful to view equation 21 as breaking y into two components The piece b0 1 b1x which represents E1y0x2 is called the system atic part of ythat is the part of y explained by xand u is called the unsystematic part or the part of y not explained by x In Chapter 3 when we introduce more than one explanatory variable we will discuss how to determine how large the systematic part is relative to the unsystematic part In the next section we will use assumptions 25 and 26 to motivate estimators of b0 and b1 given a random sample of data The zero conditional mean assumption also plays a crucial role in the statistical analysis in Section 25 22 Deriving the Ordinary Least Squares Estimates Now that we have discussed the basic ingredients of the simple regression model we will address the important issue of how to estimate the parameters b0 and b1 in equation 21 To do this we need a sample from the population Let 1xiyi2 i 5 1 c n denote a random sample of size n from the population Because these data come from equation 21 we can write yi 5 b0 1 b1xi 1 ui 29 for each i Here ui is the error term for observation i because it contains all factors affecting yi other than xi As an example xi might be the annual income and yi the annual savings for family i during a par ticular year If we have collected data on 15 families then n 5 15 A scatterplot of such a data set is given in Figure 22 along with the necessarily fictitious population regression function We must decide how to use these data to obtain estimates of the intercept and slope in the popula tion regression of savings on income y x1 Eyx 5 0 1 1x x2 x3 FiguRE 21 E1y0x2 as a linear function of x Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 25 There are several ways to motivate the following estimation procedure We will use equa tion 25 and an important implication of assumption 26 in the population u is uncorrelated with x Therefore we see that u has zero expected value and that the covariance between x and u is zero E1u2 5 0 210 and Cov1xu2 5 E1xu2 5 0 211 where the first equality in equation 211 follows from 210 See Section B4 for the definition and properties of covariance In terms of the observable variables x and y and the unknown param eters b0 and b1 equations 210 and 211 can be written as E1y 2 b0 2 b1x2 5 0 212 and E3x1y 2 b0 2 b1x2 4 5 0 213 respectively Equations 212 and 213 imply two restrictions on the joint probability distribution of xy in the population Since there are two unknown parameters to estimate we might hope that equations 212 and 213 can be used to obtain good estimators of b0 and b1 In fact they can be Given a sample of data we choose estimates b 0 and b 1 to solve the sample counterparts of equations 212 and 213 n21 a n i51 1yi 2 b 0 2 b 1xi2 5 0 214 and n21 a n i51xi1yi 2 b 0 2 b 1xi2 5 0 215 This is an example of the method of moments approach to estimation See Section C4 for a discus sion of different estimation approaches These equations can be solved for b 0 and b 1 Esavingsincome 5 0 1 1income savings 0 income 0 FiguRE 22 Scatterplot of savings and income for 15 families and the population regression E1savings0income2 5 b0 1 b1 income Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 26 Using the basic properties of the summation operator from Appendix A equation 214 can be rewritten as y 5 b 0 1 b 1x 216 where y 5 n21g n i51 yi is the sample average of the yi and likewise for x This equation allows us to write b 0 in terms of b 1 y and x b 0 5 y 2 b 1x 217 Therefore once we have the slope estimate b 1 it is straightforward to obtain the intercept estimate b 0 given y and x Dropping the n21 in 215 since it does not affect the solution and plugging 217 into 215 yields a n i51xi3yi 2 1y 2 b 1x2 2 b 1xi4 5 0 which upon rearrangement gives a n i51xi1yi 2 y2 5 b 1 a n i51xi1xi 2 x2 From basic properties of the summation operator see A7 and A8 a n i51xi1xi 2 x2 5 a n i51 1xi 2 x2 2 and a n i51xi1yi 2 y2 5 a n i51 1xi 2 x2 1yi 2 y2 Therefore provided that a n i51 1xi 2 x2 2 0 218 the estimated slope is b 1 5 a n i51 1xi 2 x2 1yi 2 y2 a n i51 1xi 2 x2 2 219 Equation 219 is simply the sample covariance between xi and yi divided by the sample variance of xi Using simple algebra we can also write b 1 as b 1 5 r xy as x s y b where r xy is the sample correlation between xi and yi and s x s y denote the sample standard devia tions See Appendix C for definitions of correlation and standard deviation Dividing all sums by n21 does not affect the formulas An immediate implication is that if xi and yi are positively corre lated in the sample then b 1 0 if xi and yi are negatively correlated then b 1 0 Not surprisingly the formula for b 1 in terms of the sample correlation and sample standard devia tions is the sample analog of the population relationship b1 5 rxy asx syb where all quantities are defined for the entire population Recognition that b1 is just a scaled version rxy highlights an important limitation of simple regression when we do not have experimental data In effect simple regression is an analysis of correlation between two variables and so one must be careful in inferring causality Although the method for obtaining 217 and 219 is motivated by 26 the only assumption needed to compute the estimates for a particular sample is 218 This is hardly an assumption at all 218 is true provided the xi in the sample are not all equal to the same value If 218 fails then Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 27 we have either been unlucky in obtaining our sample from the population or we have not specified an interesting problem x does not vary in the population For example if y 5 wage and x 5 educ then 218 fails only if everyone in the sample has the same amount of education for example if everyone is a high school graduate see Figure 23 If just one person has a different amount of education then 218 holds and the estimates can be computed The estimates given in 217 and 219 are called the ordinary least squares OLS estimates of b0 and b1 To justify this name for any b 0 and b 1 define a fitted value for y when x 5 xi as yi 5 b 0 1 b 1xi 220 This is the value we predict for y when x 5 xi for the given intercept and slope There is a fitted value for each observation in the sample The residual for observation i is the difference between the actual yi and its fitted value u i 5 yi 2 yi 5 yi 2 b 0 1 b 1xi 221 Again there are n such residuals These are not the same as the errors in 29 a point we return to in Section 25 The fitted values and residuals are indicated in Figure 24 Now suppose we choose b 0 and b 1 to make the sum of squared residuals a n i51u 2 i 5 a n i51 1yi 2 b 0 2 b 1xi2 2 222 as small as possible The appendix to this chapter shows that the conditions necessary for 1b 0b 12 to minimize 222 are given exactly by equations 214 and 215 without n21 Equations 214 and 215 are often called the first order conditions for the OLS estimates a term that comes from opti mization using calculus see Appendix A From our previous calculations we know that the solutions to the OLS first order conditions are given by 217 and 219 The name ordinary least squares comes from the fact that these estimates minimize the sum of squared residuals FiguRE 23 A scatterplot of wage against education when educi 5 12 for all i wage 12 educ 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 28 When we view ordinary least squares as minimizing the sum of squared residuals it is natural to ask Why not minimize some other function of the residuals such as the absolute values of the residu als In fact as we will discuss in the more advanced Section 94 minimizing the sum of the absolute values of the residuals is sometimes very useful But it does have some drawbacks First we can not obtain formulas for the resulting estimators given a data set the estimates must be obtained by numerical optimization routines As a consequence the statistical theory for estimators that minimize the sum of the absolute residuals is very complicated Minimizing other functions of the residuals say the sum of the residuals each raised to the fourth power has similar drawbacks We would never choose our estimates to minimize say the sum of the residuals themselves as residuals large in mag nitude but with opposite signs would tend to cancel out With OLS we will be able to derive unbias edness consistency and other important statistical properties relatively easily Plus as the motivation in equations 212 and 213 suggests and as we will see in Section 25 OLS is suited for estimating the parameters appearing in the conditional mean function 28 Once we have determined the OLS intercept and slope estimates we form the OLS regression line y 5 b 0 1 b 1x 223 where it is understood that b 0 and b 1 have been obtained using equations 217 and 219 The notation y read as y hat emphasizes that the predicted values from equation 223 are estimates The intercept b 0 is the predicted value of y when x 5 0 although in some cases it will not make sense to set x 5 0 In those situations b 0 is not in itself very interesting When using 223 to com pute predicted values of y for various values of x we must account for the intercept in the calcula tions Equation 223 is also called the sample regression function SRF because it is the estimated version of the population regression function E1y0x2 5 b0 1 b1x It is important to remember that the PRF is something fixed but unknown in the population Because the SRF is obtained for a given sample of data a new sample will generate a different slope and intercept in equation 223 In most cases the slope estimate which we can write as b 1 5 DyDx 224 y 5 0 1 1x y ˆ ˆ ˆ x1 xi x yi yi 5 fitted value y1 ûi 5 residual ˆ y1ˆ FiguRe 24 Fitted values and residuals Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 29 is of primary interest It tells us the amount by which y changes when x increases by one unit Equivalently Dy 5 b 1Dx 225 so that given any change in x whether positive or negative we can compute the predicted change in y We now present several examples of simple regression obtained by using real data In other words we find the intercept and slope estimates with equations 217 and 219 Since these exam ples involve many observations the calculations were done using an econometrics software package At this point you should be careful not to read too much into these regressions they are not neces sarily uncovering a causal relationship We have said nothing so far about the statistical properties of OLS In Section 25 we consider statistical properties after we explicitly impose assumptions on the population model equation 21 ExamplE 23 CEO Salary and Return on Equity For the population of chief executive officers let y be annual salary salary in thousands of dol lars Thus y 5 8563 indicates an annual salary of 856300 and y 5 14526 indicates a salary of 1452600 Let x be the average return on equity roe for the CEOs firm for the previous three years Return on equity is defined in terms of net income as a percentage of common equity For example if roe 5 10 then average return on equity is 10 To study the relationship between this measure of firm performance and CEO compensation we postulate the simple model salary 5 b0 1 b1roe 1 u The slope parameter b1 measures the change in annual salary in thousands of dollars when return on equity increases by one percentage point Because a higher roe is good for the company we think b1 0 The data set CEOSAL1 contains information on 209 CEOs for the year 1990 these data were obtained from Business Week 5691 In this sample the average annual salary is 1281120 with the smallest and largest being 223000 and 14822000 respectively The average return on equity for the years 1988 1989 and 1990 is 1718 with the smallest and largest values being 05 and 563 respectively Using the data in CEOSAL1 the OLS regression line relating salary to roe is salary 5 963191 1 18501 roe 226 n 5 209 where the intercept and slope estimates have been rounded to three decimal places we use salary hat to indicate that this is an estimated equation How do we interpret the equation First if the return on equity is zero roe 5 0 then the predicted salary is the intercept 963191 which equals 963191 since salary is measured in thousands Next we can write the predicted change in salary as a func tion of the change in roe Dsalary 5 18501 1Droe2 This means that if the return on equity increases by one percentage point Droe 5 1 then salary is predicted to change by about 185 or 18500 Because 226 is a linear equation this is the estimated change regardless of the initial salary We can easily use 226 to compare predicted salaries at different values of roe Suppose roe 5 30 Then salary 5 963191 1 185011302 5 1518221 which is just over 15 million How ever this does not mean that a particular CEO whose firm had a roe 5 30 earns 1518221 Many other factors affect salary This is just our prediction from the OLS regression line 226 The esti mated line is graphed in Figure 25 along with the population regression function E1salary0roe2 We will never know the PRF so we cannot tell how close the SRF is to the PRF Another sample of data will give a different regression line which may or may not be closer to the population regression line Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 30 ExamplE 24 Wage and Education For the population of people in the workforce in 1976 let y 5 wage where wage is measured in dol lars per hour Thus for a particular person if wage 5 675 the hourly wage is 675 Let x 5 educ denote years of schooling for example educ 5 12 corresponds to a complete high school education Since the average wage in the sample is 590 the Consumer Price Index indicates that this amount is equivalent to 1906 in 2003 dollars Using the data in WAGE1 where n 5 526 individuals we obtain the following OLS regression line or sample regression function wage 5 2090 1 054 educ 227 n 5 526 We must interpret this equation with caution The intercept of 090 literally means that a person with no education has a predicted hourly wage of 90 an hour This of course is silly It turns out that only 18 people in the sample of 526 have less than eight years of education Consequently it is not surprising that the regression line does poorly at very low levels of education For a person with eight years of education the predicted wage is wage 5 2090 1 054182 5 342 or 342 per hour in 1976 dollars The slope estimate in 227 implies that one more year of education increases hourly wage by 54 an hour Therefore four more years of educa tion increase the predicted wage by 410542 5 216 or 216 per hour These are fairly large effects salary 963191 salary 5 963191 1 18501 roe Esalaryroe 5 0 1 1roe roe FiguRE 25 The OLS regression line salary 5 963191 1 18501roe and the unknown population regression function The estimated wage from 227 when educ 5 8 is 342 in 1976 dollars What is this value in 2003 dollars Hint You have enough information in Example 24 to answer this question Exploring FurthEr 22 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 31 Because of the linear nature of 227 another year of education increases the wage by the same amount regardless of the initial level of education In Section 24 we discuss some methods that allow for nonconstant marginal effects of our explanatory variables ExamplE 25 Voting Outcomes and Campaign Expenditures The file VOTE1 contains data on election outcomes and campaign expenditures for 173 twoparty races for the US House of Representatives in 1988 There are two candidates in each race A and B Let voteA be the percentage of the vote received by Candidate A and shareA be the percentage of total campaign expenditures accounted for by Candidate A Many factors other than shareA affect the election outcome including the quality of the candidates and possibly the dollar amounts spent by A and B Nevertheless we can estimate a simple regression model to find out whether spending more relative to ones challenger implies a higher percentage of the vote The estimated equation using the 173 observations is voteA 5 2681 1 0464 shareA 228 n 5 173 This means that if Candidate As share of spending increases by one percentage point Candidate A receives almost onehalf a percentage point 0464 more of the total vote Whether or not this is a causal effect is unclear but it is not unbelievable If shareA 5 50 voteA is predicted to be about 50 or half the vote In some cases regression analysis is not used to determine causality but to simply look at whether two variables are positively or negatively related much like a standard correlation analysis An example of this occurs in Computer Exercise C3 where you are asked to use data from Biddle and Hamermesh 1990 on time spent sleeping and working to investi gate the tradeoff between these two factors 22a A Note on Terminology In most cases we will indicate the estimation of a relationship through OLS by writing an equation such as 226 227 or 228 Sometimes for the sake of brevity it is useful to indicate that an OLS regression has been run without actually writing out the equation We will often indicate that equation 223 has been obtained by OLS in saying that we run the regression of y on x 229 or simply that we regress y on x The positions of y and x in 229 indicate which is the depend ent variable and which is the independent variable we always regress the dependent variable on the independent variable For specific applications we replace y and x with their names Thus to obtain 226 we regress salary on roe or to obtain 228 we regress voteA on shareA When we use such terminology in 229 we will always mean that we plan to estimate the intercept b 0 along with the slope b 1 This case is appropriate for the vast majority of applications In Example 25 what is the predicted vote for Candidate A if shareA 5 60 which means 60 Does this answer seem reasonable Exploring FurthEr 23 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 32 Occasionally we may want to estimate the relationship between y and x assuming that the intercept is zero so that x 5 0 implies that y 5 0 we cover this case briefly in Section 26 Unless explicitly stated otherwise we always estimate an intercept along with a slope 23 Properties of OLS on Any Sample of Data In the previous section we went through the algebra of deriving the formulas for the OLS intercept and slope estimates In this section we cover some further algebraic properties of the fitted OLS regression line The best way to think about these properties is to remember that they hold by con struction for any sample of data The harder taskconsidering the properties of OLS across all pos sible random samples of datais postponed until Section 25 Several of the algebraic properties we are going to derive will appear mundane Nevertheless having a grasp of these properties helps us to figure out what happens to the OLS estimates and related statistics when the data are manipulated in certain ways such as when the measurement units of the dependent and independent variables change 23a Fitted Values and Residuals We assume that the intercept and slope estimates b 0 and b 1 have been obtained for the given sam ple of data Given b 0 and b 1 we can obtain the fitted value yi for each observation This is given by equation 220 By definition each fitted value of yi is on the OLS regression line The OLS residual associated with observation i u i is the difference between yi and its fitted value as given in equation 221 If u i is positive the line underpredicts yi if u i is negative the line overpredicts yi The ideal case for observation i is when u i 5 0 but in most cases every residual is not equal to zero In other words none of the data points must actually lie on the OLS line ExamplE 26 CEO Salary and Return on Equity Table 22 contains a listing of the first 15 observations in the CEO data set along with the fitted values called salaryhat and the residuals called uhat The first four CEOs have lower salaries than what we predicted from the OLS regression line 226 in other words given only the firms roe these CEOs make less than what we predicted As can be seen from the positive uhat the fifth CEO makes more than predicted from the OLS regression line 23b Algebraic Properties of OLS Statistics There are several useful algebraic properties of OLS estimates and their associated statistics We now cover the three most important of these 1 The sum and therefore the sample average of the OLS residuals is zero Mathematically a n i51u i 5 0 230 This property needs no proof it follows immediately from the OLS first order condition 214 when we remember that the residuals are defined by u i 5 yi 2 b 0 2 b 1xi In other words the OLS estimates Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 33 b 0 and b 1 are chosen to make the residuals add up to zero for any data set This says nothing about the residual for any particular observation i 2 The sample covariance between the regressors and the OLS residuals is zero This follows from the first order condition 215 which can be written in terms of the residuals as a n i51xiu i 5 0 231 The sample average of the OLS residuals is zero so the lefthand side of 231 is proportional to the sample covariance between xi and u i 3 The point 1xy2 is always on the OLS regression line In other words if we take equation 223 and plug in x for x then the predicted value is y This is exactly what equation 216 showed us ExamplE 27 Wage and Education For the data in WAGE1 the average hourly wage in the sample is 590 rounded to two decimal places and the average education is 1256 If we plug educ 5 1256 into the OLS regression line 227 we get wage 5 2090 1 054112562 5 58824 which equals 59 when rounded to the first decimal place These figures do not exactly agree because we have rounded the average wage and education as well as the intercept and slope estimates If we did not initially round any of the values we would get the answers to agree more closely but to little useful effect TAblE 22 Fitted Values and Residuals for the First 15 CEOs obsno roe salary salaryhat uhat 1 141 1095 1224058 1290581 2 109 1001 1164854 1638542 3 235 1122 1397969 2759692 4 59 578 1072348 4943484 5 138 1368 1218508 1494923 6 200 1145 1333215 1882151 7 164 1078 1266611 1886108 8 163 1094 1264761 1707606 9 105 1237 1157454 7954626 10 263 833 1449773 6167726 11 259 567 1442372 8753721 12 268 933 1459023 5260231 13 148 1339 1237009 1019911 14 223 937 1375768 4387678 15 563 2011 2004808 6191895 Writing each yi as its fitted value plus its residual provides another way to interpret an OLS regression For each i write yi 5 yi 1 u i 232 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 34 From property 1 the average of the residuals is zero equivalently the sample average of the fitted values yi is the same as the sample average of the yi or y 5 y Further properties 1 and 2 can be used to show that the sample covariance between yi and u i is zero Thus we can view OLS as decom posing each yi into two parts a fitted value and a residual The fitted values and residuals are uncor related in the sample Define the total sum of squares SST the explained sum of squares SSE and the residual sum of squares SSR also known as the sum of squared residuals as follows SST a n i51 1yi 2 y2 2 233 SSE a n i51 1yi 2 y2 2 234 SSR a n i51u 2 i 235 SST is a measure of the total sample variation in the yi that is it measures how spread out the yi are in the sample If we divide SST by n 2 1 we obtain the sample variance of y as discussed in Appendix C Similarly SSE measures the sample variation in the yi where we use the fact that y 5 y and SSR measures the sample variation in the u i The total variation in y can always be expressed as the sum of the explained variation and the unexplained variation SSR Thus SST 5 SSE 1 SSR 236 Proving 236 is not difficult but it requires us to use all of the properties of the summation operator covered in Appendix A Write a n i51 1yi 2 y2 2 5 a n i51 3 1yi 2 yi2 1 1yi 2 y2 42 5 a n i51 3u i 1 1yi 2 y2 42 5 a n i51u 2 i 1 2 a n i51u i1yi 2 y2 1 a n i51 1yi 2 y2 2 5 SSR 1 2 a n i51u i1yi 2 y2 1 SSE Now 236 holds if we show that a n i51u i1yi 2 y2 5 0 237 But we have already claimed that the sample covariance between the residuals and the fitted values is zero and this covariance is just 237 divided by n 2 1 Thus we have established 236 Some words of caution about SST SSE and SSR are in order There is no uniform agree ment on the names or abbreviations for the three quantities defined in equations 233 234 and 235 The total sum of squares is called either SST or TSS so there is little confusion here Unfortunately the explained sum of squares is sometimes called the regression sum of squares If this term is given its natural abbreviation it can easily be confused with the term residual sum of squares Some regression packages refer to the explained sum of squares as the model sum of squares To make matters even worse the residual sum of squares is often called the error sum of squares This is especially unfortunate because as we will see in Section 25 the errors and the residuals are different quantities Thus we will always call 235 the residual sum of squares or the sum of squared residuals We prefer to use the abbreviation SSR to denote the sum of squared residu als because it is more common in econometric packages Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 35 23c GoodnessofFit So far we have no way of measuring how well the explanatory or independent variable x explains the dependent variable y It is often useful to compute a number that summarizes how well the OLS regression line fits the data In the following discussion be sure to remember that we assume that an intercept is estimated along with the slope Assuming that the total sum of squares SST is not equal to zerowhich is true except in the very unlikely event that all the yi equal the same valuewe can divide 236 by SST to get 1 5 SSESST 1 SSRSST The Rsquared of the regression sometimes called the coefficient of determination is defined as R2 SSESST 5 1 2 SSRSST 238 R2 is the ratio of the explained variation compared to the total variation thus it is interpreted as the fraction of the sample variation in y that is explained by x The second equality in 238 provides another way for computing R2 From 236 the value of R2 is always between zero and one because SSE can be no greater than SST When interpreting R2 we usually multiply it by 100 to change it into a percent 100 R2 is the percentage of the sample variation in y that is explained by x If the data points all lie on the same line OLS provides a perfect fit to the data In this case R2 5 1 A value of R2 that is nearly equal to zero indicates a poor fit of the OLS line very little of the variation in the yi is captured by the variation in the yi which all lie on the OLS regression line In fact it can be shown that R2 is equal to the square of the sample correlation coefficient between yi and yi This is where the term Rsquared came from The letter R was traditionally used to denote an estimate of a population correlation coefficient and its usage has survived in regression analysis ExamplE 28 CEO Salary and Return on Equity In the CEO salary regression we obtain the following salary 5 963191 1 18501 roe 239 n 5 209 R2 5 00132 We have reproduced the OLS regression line and the number of observations for clarity Using the Rsquared rounded to four decimal places reported for this equation we can see how much of the variation in salary is actually explained by the return on equity The answer is not much The firms return on equity explains only about 13 of the variation in salaries for this sample of 209 CEOs That means that 987 of the salary variations for these CEOs is left unexplained This lack of explanatory power may not be too surprising because many other characteristics of both the firm and the individual CEO should influence salary these factors are necessarily included in the errors in a simple regression analysis In the social sciences low Rsquareds in regression equations are not uncommon especially for crosssectional analysis We will discuss this issue more generally under multiple regres sion analysis but it is worth emphasizing now that a seemingly low Rsquared does not neces sarily mean that an OLS regression equation is useless It is still possible that 239 is a good estimate of the ceteris paribus relationship between salary and roe whether or not this is true does not depend directly on the size of Rsquared Students who are first learning econometrics tend to put too much weight on the size of the Rsquared in evaluating regression equations For now be aware that using Rsquared as the main gauge of success for an econometric analysis can lead to trouble Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 36 Sometimes the explanatory variable explains a substantial part of the sample variation in the dependent variable ExamplE 29 Voting Outcomes and Campaign Expenditures In the voting outcome equation in 228 R2 5 0856 Thus the share of campaign expenditures explains over 85 of the variation in the election outcomes for this sample This is a sizable portion 24 Units of Measurement and Functional Form Two important issues in applied economics are 1 understanding how changing the units of measure ment of the dependent andor independent variables affects OLS estimates and 2 knowing how to incorporate popular functional forms used in economics into regression analysis The mathematics needed for a full understanding of functional form issues is reviewed in Appendix A 24a The Effects of Changing Units of Measurement on OLS Statistics In Example 23 we chose to measure annual salary in thousands of dollars and the return on equity was measured as a percentage rather than as a decimal It is crucial to know how salary and roe are measured in this example in order to make sense of the estimates in equation 239 We must also know that OLS estimates change in entirely expected ways when the units of measurement of the dependent and independent variables change In Example 23 suppose that rather than measuring salary in thousands of dollars we measure it in dollars Let salardol be salary in dollars salardol 5 845761 would be interpreted as 845761 Of course salardol has a simple relationship to the salary measured in thousands of dollars salardol 5 1000 salary We do not need to actually run the regression of salardol on roe to know that the estimated equation is salardol 5 963191 1 18501 roe 240 We obtain the intercept and slope in 240 simply by multiplying the intercept and the slope in 239 by 1000 This gives equations 239 and 240 the same interpretation Looking at 240 if roe 5 0 then salardol 5 963191 so the predicted salary is 963191 the same value we obtained from equa tion 239 Furthermore if roe increases by one then the predicted salary increases by 18501 again this is what we concluded from our earlier analysis of equation 239 Generally it is easy to figure out what happens to the intercept and slope estimates when the dependent variable changes units of measurement If the dependent variable is multiplied by the constant cwhich means each value in the sample is multiplied by cthen the OLS intercept and slope estimates are also multiplied by c This assumes nothing has changed about the independent variable In the CEO sal ary example c 5 1000 in moving from salary to salardol We can also use the CEO salary example to see what happens when we change the units of measurement of the independent variable Define roedec 5 roe100 to be the decimal equivalent of roe thus roedec 5 023 means a return on equity of 23 To focus on changing the units of measurement Suppose that salary is measured in hun dreds of dollars rather than in thousands of dollars say salarhun What will be the OLS intercept and slope estimates in the regres sion of salarhun on roe Exploring FurthEr 24 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 37 of the independent variable we return to our original dependent variable salary which is measured in thousands of dollars When we regress salary on roedec we obtain salary 5 963191 1 18501 roedec 241 The coefficient on roedec is 100 times the coefficient on roe in 239 This is as it should be Changing roe by one percentage point is equivalent to Droedec 5 001 From 241 if Droedec 5 001 then Dsalary 5 1850110012 5 18501 which is what is obtained by using 239 Note that in moving from 239 to 241 the independent variable was divided by 100 and so the OLS slope estimate was multiplied by 100 preserving the interpretation of the equation Generally if the independent variable is divided or multiplied by some nonzero constant c then the OLS slope coefficient is multiplied or divided by c respectively The intercept has not changed in 241 because roedec 5 0 still corresponds to a zero return on equity In general changing the units of measurement of only the independent variable does not affect the intercept In the previous section we defined Rsquared as a goodnessoffit measure for OLS regression We can also ask what happens to R2 when the unit of measurement of either the independent or the dependent variable changes Without doing any algebra we should know the result the goodness of fit of the model should not depend on the units of measurement of our variables For example the amount of variation in salary explained by the return on equity should not depend on whether salary is measured in dollars or in thousands of dollars or on whether return on equity is a percentage or a decimal This intuition can be verified mathematically using the definition of R2 it can be shown that R2 is in fact invariant to changes in the units of y or x 24b Incorporating Nonlinearities in Simple Regression So far we have focused on linear relationships between the dependent and independent variables As we mentioned in Chapter 1 linear relationships are not nearly general enough for all economic applications Fortunately it is rather easy to incorporate many nonlinearities into simple regression analysis by appropriately defining the dependent and independent variables Here we will cover two possibilities that often appear in applied work In reading applied work in the social sciences you will often encounter regression equations where the dependent variable appears in logarithmic form Why is this done Recall the wage education example where we regressed hourly wage on years of education We obtained a slope esti mate of 054 see equation 227 which means that each additional year of education is predicted to increase hourly wage by 54 cents Because of the linear nature of 227 54 cents is the increase for either the first year of education or the twentieth year this may not be reasonable Probably a better characterization of how wage changes with education is that each year of edu cation increases wage by a constant percentage For example an increase in education from 5 years to 6 years increases wage by say 8 ceteris paribus and an increase in education from 11 to 12 years also increases wage by 8 A model that gives approximately a constant percentage effect is log1wage2 5 b0 1 b1educ 1 u 242 where log12 denotes the natural logarithm See Appendix A for a review of logarithms In particu lar if Du 5 0 then Dwage 1100 b12Deduc 243 Notice how we multiply b1 by 100 to get the percentage change in wage given one additional year of education Since the percentage change in wage is the same for each additional year of edu cation the change in wage for an extra year of education increases as education increases in other words 242 implies an increasing return to education By exponentiating 242 we can write wage 5 exp1b0 1 b1educ 1 u2 This equation is graphed in Figure 26 with u 5 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 38 ExamplE 210 a log Wage Equation Using the same data as in Example 24 but using logwage as the dependent variable we obtain the following relationship log1wage2 5 0584 1 0083 educ 244 n 5 526 R2 5 0186 The coefficient on educ has a percentage interpretation when it is multiplied by 100 wage increases by 83 for every additional year of education This is what economists mean when they refer to the return to another year of education It is important to remember that the main reason for using the log of wage in 242 is to impose a constant percentage effect of education on wage Once equation 244 is obtained the natural log of wage is rarely mentioned In particular it is not correct to say that another year of education increases logwage by 83 The intercept in 244 is not very meaningful because it gives the predicted logwage when educ 5 0 The Rsquared shows that educ explains about 186 of the variation in logwage not wage Finally equation 244 might not capture all of the nonlinearity in the relationship between wage and schooling If there are diploma effects then the twelfth year of educationgraduation from high schoolcould be worth much more than the eleventh year We will learn how to allow for this kind of nonlinearity in Chapter 7 wage educ 0 FiguRE 26 wage 5 exp 1b0 1 b1educ2 with b1 0 Estimating a model such as 242 is straightforward when using simple regression Just define the dependent variable y to be y 5 log1wage2 The independent variable is represented by x 5 educ The mechanics of OLS are the same as before the intercept and slope estimates are given by the formulas 217 and 219 In other words we obtain b 0 and b 1 from the OLS regression of logwage on educ Another important use of the natural log is in obtaining a constant elasticity model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 39 ExamplE 211 CEO Salary and Firm Sales We can estimate a constant elasticity model relating CEO salary to firm sales The data set is the same one used in Example 23 except we now relate salary to sales Let sales be annual firm sales meas ured in millions of dollars A constant elasticity model is log1salary2 5 b0 1 b1log1sales2 1 u 245 where b1 is the elasticity of salary with respect to sales This model falls under the simple regression model by defining the dependent variable to be y logsalary and the independent variable to be x 5 log1sales2 Estimating this equation by OLS gives log1salary2 5 4822 1 0257 log1sales2 246 n 5 209 R2 5 0211 The coefficient of logsales is the estimated elasticity of salary with respect to sales It implies that a 1 increase in firm sales increases CEO salary by about 0257the usual interpretation of an elasticity The two functional forms covered in this section will often arise in the remainder of this text We have covered models containing natural logarithms here because they appear so frequently in applied work The interpretation of such models will not be much different in the multiple regression case It is also useful to note what happens to the intercept and slope estimates if we change the units of measurement of the dependent variable when it appears in logarithmic form Because the change to logarithmic form approximates a proportionate change it makes sense that nothing happens to the slope We can see this by writing the rescaled variable as c1yi for each observation i The original equation is log1yi2 5 b0 1 b1xi 1 ui If we add log1c12 to both sides we get log1c12 1 log1yi2 5 3log1c12 1 b04 1 b1xi 1 ui or log1c1yi2 5 3log1c12 1 b04 1 b1xi 1 ui Remember that the sum of the logs is equal to the log of their product as shown in Appendix A Therefore the slope is still b1 but the intercept is now log1c12 1 b0 Similarly if the independent variable is log1x2 and we change the units of measurement of x before taking the log the slope remains the same but the intercept changes You will be asked to verify these claims in Problem 9 We end this subsection by summarizing four combinations of functional forms available from using either the original variable or its natural log In Table 23 x and y stand for the variables in their original form The model with y as the dependent variable and x as the independent variable is called the levellevel model because each variable appears in its level form The model with log1y2 as the dependent variable and x as the independent variable is called the loglevel model We will not explic itly discuss the levellog model here because it arises less often in practice In any case we will see examples of this model in later chapters The last column in Table 23 gives the interpretation of b1 In the loglevel model 100 b1 is sometimes called the semielasticity of y with respect to x As we mentioned in Example 211 in the loglog model b1 is the elasticity of y with respect to x Table 23 warrants careful study as we will refer to it often in the remainder of the text TAblE 23 Summary of Functional Forms Involving Logarithms Model Dependent Variable Independent Variable Interpretation of b1 Levellevel y x Dy 5 b1Dx Levellog y logx Dy 5 1b11002Dx Loglevel logy x Dy 5 1100b12Dx Loglog logy logx Dy 5 b1Dx Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 40 24c The Meaning of Linear Regression The simple regression model that we have studied in this chapter is also called the simple linear regression model Yet as we have just seen the general model also allows for certain nonlinear relationships So what does linear mean here You can see by looking at equation 21 that y 5 b0 1 b1x 1 u The key is that this equation is linear in the parameters b0 and b1 There are no restrictions on how y and x relate to the original explained and explanatory variables of interest As we saw in Examples 210 and 211 y and x can be natural logs of variables and this is quite com mon in applications But we need not stop there For example nothing prevents us from using simple regression to estimate a model such as cons 5 b0 1 b1inc 1 u where cons is annual consumption and inc is annual income Whereas the mechanics of simple regression do not depend on how y and x are defined the interpretation of the coefficients does depend on their definitions For successful empirical work it is much more important to become proficient at interpreting coefficients than to become efficient at computing formulas such as 219 We will get much more practice with interpreting the estimates in OLS regression lines when we study multiple regression Plenty of models cannot be cast as a linear regression model because they are not linear in their parameters an example is cons 5 11b0 1 b1inc2 1 u Estimation of such models takes us into the realm of the nonlinear regression model which is beyond the scope of this text For most applica tions choosing a model that can be put into the linear regression framework is sufficient 25 Expected Values and Variances of the OLS Estimators In Section 21 we defined the population model y 5 b0 1 b1x 1 u and we claimed that the key assumption for simple regression analysis to be useful is that the expected value of u given any value of x is zero In Sections 22 23 and 24 we discussed the algebraic properties of OLS estimation We now return to the population model and study the statistical properties of OLS In other words we now view b 0 and b 1 as estimators for the parameters b0 and b1 that appear in the population model This means that we will study properties of the distributions of b 0 and b 1 over different random sam ples from the population Appendix C contains definitions of estimators and reviews some of their important properties 25a Unbiasedness of OLS We begin by establishing the unbiasedness of OLS under a simple set of assumptions For future ref erence it is useful to number these assumptions using the prefix SLR for simple linear regression The first assumption defines the population model Assumption SLR1 linear in parameters In the population model the dependent variable y is related to the independent variable x and the error or disturbance u as y 5 b0 1 b1x 1 u 247 where b0 and b1 are the population intercept and slope parameters respectively To be realistic y x and u are all viewed as random variables in stating the population model We dis cussed the interpretation of this model at some length in Section 21 and gave several examples In the previous section we learned that equation 247 is not as restrictive as it initially seems by choosing Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 41 y and x appropriately we can obtain interesting nonlinear relationships such as constant elasticity models We are interested in using data on y and x to estimate the parameters b0 and especially b1 We assume that our data were obtained as a random sample See Appendix C for a review of random sampling Assumption SLR2 Random Sampling We have a random sample of size n 5 1xi yi2 i 5 1 2 c n6 following the population model in equation 247 We will have to address failure of the random sampling assumption in later chapters that deal with time series analysis and sample selection problems Not all crosssectional samples can be viewed as outcomes of random samples but many can be We can write 247 in terms of the random sample as yi 5 b0 1 b1xi 1 ui i 5 1 2 c n 248 where ui is the error or disturbance for observation i for example person i firm i city i and so on Thus ui contains the unobservables for observation i that affect yi The ui should not be confused with the residuals u i that we defined in Section 23 Later on we will explore the relationship between the errors and the residuals For interpreting b0 and b1 in a particular application 247 is most informa tive but 248 is also needed for some of the statistical derivations The relationship 248 can be plotted for a particular outcome of data as shown in Figure 27 As we already saw in Section 22 the OLS slope and intercept estimates are not defined unless we have some sample variation in the explanatory variable We now add variation in the xi to our list of assumptions y x1 xi x yi u1 y1 ui Eyx 5 0 1 1x PRF FiguRE 27 Graph of yi 5 b0 1 b1xi 1 ui Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 42 Assumption SLR3 Sample Variation in the Explanatory Variable The sample outcomes on x namely 5xi i 5 1 c n6 are not all the same value This is a very weak assumptioncertainly not worth emphasizing but needed nevertheless If x varies in the population random samples on x will typically contain variation unless the population variation is minimal or the sample size is small Simple inspection of summary statistics on xi reveals whether Assumption SLR3 fails if the sample standard deviation of xi is zero then Assumption SLR3 fails otherwise it holds Finally in order to obtain unbiased estimators of b0 and b1 we need to impose the zero condi tional mean assumption that we discussed in some detail in Section 21 We now explicitly add it to our list of assumptions Assumption SLR4 Zero Conditional mean The error u has an expected value of zero given any value of the explanatory variable In other words E1u0x2 5 0 For a random sample this assumption implies that E1ui0xi2 5 0 for all i 5 1 2 c n In addition to restricting the relationship between u and x in the population the zero conditional mean assumptioncoupled with the random sampling assumptionallows for a convenient technical simplification In particular we can derive the statistical properties of the OLS estimators as conditional on the values of the xi in our sample Technically in statistical derivations conditioning on the sample values of the independent variable is the same as treating the xi as fixed in repeated samples which we think of as follows We first choose n sample values for x1 x2 c xn These can be repeated Given these values we then obtain a sample on y effectively by obtaining a random sample of the ui Next another sample of y is obtained using the same values for x1 x2 c xn Then another sample of y is obtained again using the same x1 x2 c xn And so on The fixedinrepeatedsamples scenario is not very realistic in nonexperimental contexts For instance in sampling individuals for the wageeducation example it makes little sense to think of choosing the values of educ ahead of time and then sampling individuals with those particular levels of education Random sampling where individuals are chosen randomly and their wage and education are both recorded is representative of how most data sets are obtained for empirical analysis in the social sciences Once we assume that E1u0x2 5 0 and we have random sampling nothing is lost in derivations by treating the xi as nonrandom The danger is that the fixedin repeatedsamples assumption always implies that ui and xi are independent In deciding when sim ple regression analysis is going to produce unbiased estimators it is critical to think in terms of Assumption SLR4 Now we are ready to show that the OLS estimators are unbiased To this end we use the fact that g n i511xi 2 x2 1yi 2 y2 5 g n i511xi 2 x2yi see Appendix A to write the OLS slope estimator in equation 219 as b 1 5 a n i51 1xi 2 x2yi a n i51 1xi 2 x2 2 249 Because we are now interested in the behavior of b 1 across all possible samples b 1 is properly viewed as a random variable Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 43 We can write b 1 in terms of the population coefficient and errors by substituting the righthand side of 248 into 249 We have b 1 5 a n i51 1xi 2 x2yi SSTx 5 a n i51 1xi 2 x2 1b0 1 b1xi 1 ui2 SSTx 250 where we have defined the total variation in xi as SSTx 5 g n i511xi 2 x2 2 to simplify the notation This is not quite the sample variance of the xi because we do not divide by n 2 1 Using the algebra of the summation operator write the numerator of b 1 as a n i51 1xi 2 x2b0 1 a n i51 1xi 2 x2b1xi 1 a n i51 1xi 2 x2ui 5 b0 a n i51 1xi 2 x2 1 b1 a n i51 1xi 2 x2xi 1 a n i51 1xi 2 x2ui 251 As shown in Appendix A g n i511xi 2 x2 5 0 and g n i511xi 2 x2xi 5 g n i511xi 2 x2 2 5 SSTx Therefore we can write the numerator of b 1 as b1SSTx 1 g n i511xi 2 x2ui Putting this over the denominator gives b 1 5 b1 1 a n i51 1xi 2 x2ui SSTx 5 b1 1 11SSTx2 a n i51diui 252 where di 5 xi 2 x We now see that the estimator b 1 equals the population slope b1 plus a term that is a linear combination in the errors 3u1 u2 c un4 Conditional on the values of xi the randomness in b 1 is due entirely to the errors in the sample The fact that these errors are generally different from zero is what causes b 1 to differ from b1 Using the representation in 252 we can prove the first important statistical property of OLS UnbiaSEdnESS OF OlS Using Assumptions SLR1 through SLR4 E1b 02 5 b0 and E1b 12 5 b1 253 for any values of b0 and b1 In other words b 0 is unbiased for b0 and b 1 is unbiased for b1 PROOF In this proof the expected values are conditional on the sample values of the independent variable Because SSTx and di are functions only of the xi they are nonrandom in the conditioning Therefore from 252 and keeping the conditioning on 5x1 x2 c xn6 implicit we have E1b 12 5 b1 1 E3 11SSTx2 a n i51di ui4 5 b1 1 11SSTx2 a n i51 E1di ui2 5 b1 1 11SSTx2 a n i51di E1Ui2 5 b1 1 11SSTx2 a n i51di 0 5 b1 where we have used the fact that the expected value of each ui conditional on 5x1 x2 c xn6 2 is zero under Assumptions SLR2 and SLR4 Since unbiasedness holds for any outcome on 5x1 x2 c xn6 unbiasedness also holds without conditioning on 5x1 x2 c xn6 The proof for b 0 is now straightforward Average 248 across i to get y 5 b0 1 b1x 1 u and plug this into the formula for b 0 b 0 5 y 2 b 1x 5 b0 1 b1x 1 u 2 b 1x 5 b0 1 1b1 2 b 12x 1 u thEorEm 21 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 44 Remember that unbiasedness is a feature of the sampling distributions of b 1 and b 0 which says noth ing about the estimate that we obtain for a given sample We hope that if the sample we obtain is somehow typical then our estimate should be near the population value Unfortunately it is always possible that we could obtain an unlucky sample that would give us a point estimate far from b1 and we can never know for sure whether this is the case You may want to review the material on unbiased estimators in Appendix C especially the simulation exercise in Table C1 that illustrates the concept of unbiasedness Unbiasedness generally fails if any of our four assumptions fail This means that it is important to think about the veracity of each assumption for a particular application Assumption SLR1 requires that y and x be linearly related with an additive disturbance This can certainly fail But we also know that y and x can be chosen to yield interesting nonlinear relationships Dealing with the failure of 247 requires more advanced methods that are beyond the scope of this text Later we will have to relax Assumption SLR2 the random sampling assumption for time series analysis But what about using it for crosssectional analysis Random sampling can fail in a cross section when samples are not representative of the underlying population in fact some data sets are constructed by intentionally oversampling different parts of the population We will discuss problems of nonrandom sampling in Chapters 9 and 17 As we have already discussed Assumption SLR3 almost always holds in interesting regression applications Without it we cannot even obtain the OLS estimates The assumption we should concentrate on for now is SLR4 If SLR4 holds the OLS estimators are unbiased Likewise if SLR4 fails the OLS estimators generally will be biased There are ways to determine the likely direction and size of the bias which we will study in Chapter 3 The possibility that x is correlated with u is almost always a concern in simple regression analy sis with nonexperimental data as we indicated with several examples in Section 21 Using simple regression when u contains factors affecting y that are also correlated with x can result in spurious correlation that is we find a relationship between y and x that is really due to other unobserved fac tors that affect y and also happen to be correlated with x ExamplE 212 Student math performance and the School lunch program Let math10 denote the percentage of tenth graders at a high school receiving a passing score on a standardized mathematics exam Suppose we wish to estimate the effect of the federally funded school lunch program on student performance If anything we expect the lunch program to have a positive ceteris paribus effect on performance all other factors being equal if a student who is too poor to eat regular meals becomes eligible for the school lunch program his or her performance should improve Let lnchprg denote the percentage of students who are eligible for the lunch pro gram Then a simple regression model is math10 5 b0 1 b1 lnchprg 1 u 254 where u contains school and student characteristics that affect overall school performance Using the data in MEAP93 on 408 Michigan high schools for the 19921993 school year we obtain math10 5 3214 2 0319 lnchprg n 5 408 R2 5 0171 Then conditional on the values of the xi E1b 02 5 b0 1 E3 1b1 2 b 12x4 1 E1u2 5 b0 1 E3 1b1 2 b 12 4 x since E1u2 5 0 by Assumptions SLR2 and SLR4 But we showed that E1b 12 5 b1 which implies that E3 1b 1 2 b12 4 5 0 Thus E1b 02 5 b0 Both of these arguments are valid for any values of b0 and b1 and so we have established unbiasedness Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 45 This equation predicts that if student eligibility in the lunch program increases by 10 percentage points the percentage of students passing the math exam falls by about 32 percentage points Do we really believe that higher participation in the lunch program actually causes worse performance Almost certainly not A better explanation is that the error term u in equation 254 is correlated with lnchprg In fact u contains factors such as the poverty rate of children attending school which affects student performance and is highly correlated with eligibility in the lunch program Variables such as school quality and resources are also contained in u and these are likely correlated with lnchprg It is important to remember that the estimate 0319 is only for this particular sample but its sign and magnitude make us suspect that u and x are correlated so that simple regression is biased In addition to omitted variables there are other reasons for x to be correlated with u in the simple regression model Because the same issues arise in multiple regression analysis we will postpone a systematic treatment of the problem until then 25b Variances of the OLS Estimators In addition to knowing that the sampling distribution of b 1 is centered about b1 b 1is unbiased it is important to know how far we can expect b 1 to be away from b1 on average Among other things this allows us to choose the best estimator among all or at least a broad class of unbiased estimators The measure of spread in the distribution of b 1 and b 0 that is easiest to work with is the variance or its square root the standard deviation See Appendix C for a more detailed discussion It turns out that the variance of the OLS estimators can be computed under Assumptions SLR1 through SLR4 However these expressions would be somewhat complicated Instead we add an assumption that is traditional for crosssectional analysis This assumption states that the variance of the unobservable u conditional on x is constant This is known as the homoskedasticity or con stant variance assumption Assumption SLR5 Homoskedasticity The error u has the same variance given any value of the explanatory variable In other words Var1u0x2 5 s2 We must emphasize that the homoskedasticity assumption is quite distinct from the zero con ditional mean assumption E1u0x2 5 0 Assumption SLR4 involves the expected value of u while Assumption SLR5 concerns the variance of u both conditional on x Recall that we established the unbiasedness of OLS without Assumption SLR5 the homoskedasticity assumption plays no role in showing that b 0 and b 1 are unbiased We add Assumption SLR5 because it simplifies the variance calculations for b 0 and b 1 and because it implies that ordinary least squares has certain efficiency properties which we will see in Chapter 3 If we were to assume that u and x are independent then the distribution of u given x does not depend on x and so E1u0x2 5 E1u2 5 0 and Var1u0x2 5 s2 But independence is sometimes too strong of an assumption Because Var1u0x2 5 E1u20x2 2 3E1u0x2 42 and E1u0x2 5 0 s2 5 E1u20x2 which means s2 is also the unconditional expectation of u2 Therefore s2 5 E1u22 5 Var1u2 because E1u2 5 0 In other words s2 is the unconditional variance of u and so s2 is often called the error variance or disturbance variance The square root of s2 s is the standard deviation of the error A larger s means that the distribution of the unobservables affecting y is more spread out Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 46 It is often useful to write Assumptions SLR4 and SLR5 in terms of the conditional mean and conditional variance of y E1y0x2 5 b0 1 b1x 255 Var1y0x2 5 s2 256 In other words the conditional expectation of y given x is linear in x but the variance of y given x is constant This situation is graphed in Figure 28 where b0 0 and b1 0 When Var1u0x2 depends on x the error term is said to exhibit heteroskedasticity or nonconstant variance Because Var1u0x2 5 Var1y0x2 heteroskedasticity is present whenever Var1y0x2 is a func tion of x ExamplE 213 Heteroskedasticity in a Wage Equation In order to get an unbiased estimator of the ceteris paribus effect of educ on wage we must assume that E1u0educ2 5 0 and this implies E1wage0educ2 5 b0 1 b1educ If we also make the homoskedastic ity assumption then Var1u0educ2 5 s2 does not depend on the level of education which is the same as assuming Var1wage0educ2 5 s2 Thus while average wage is allowed to increase with education levelit is this rate of increase that we are interested in estimatingthe variability in wage about its mean is assumed to be constant across all education levels This may not be realistic It is likely that people with more education have a wider variety of interests and job opportunities which could lead to more wage variability at higher levels of education People with very low levels of education have fewer opportunities and often must work at the minimum wage this serves to reduce wage variability at low education levels This situation is shown in Figure 29 Ultimately whether Assumption SLR5 holds is an empirical issue and in Chapter 8 we will show how to test Assumption SLR5 x1 x2 x Eyx 5 0 1 1x fyx x3 y FiguRE 28 The simple regression model under homoskedasticity Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 47 With the homoskedasticity assumption in place we are ready to prove the following 8 12 educ Ewageeduc 5 0 1 1educ fwageeduc 16 wage FiguRE 29 Var1wage0educ2 increasing with educ thEorEm 22 Sampling VaRianCES OF tHE OlS EStimatORS Under Assumptions SLR1 through SLR5 Var1b 12 5 s2 a n i51 1xi 2 x2 2 5 s2SSTxr 257 and Var1b 02 5 s2n21 a n i51 xi 2 a n i51 1xi 2 x2 2 258 where these are conditional on the sample values 5x1 c xn6 PROOF We derive the formula for Var1b 12 leaving the other derivation as Problem 10 The starting point is equation 12522 b 1 5 b1 1 11SSTx2 g n i51di ui Because b1 is just a constant and we are condition ing on the xi SSTx and di 5 xi 2 x are also nonrandom Furthermore because the ui are independent Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 48 Equations 257 and 258 are the standard formulas for simple regression analysis which are invalid in the presence of heteroskedasticity This will be important when we turn to confidence intervals and hypothesis testing in multiple regression analysis For most purposes we are interested in Var1b 12 It is easy to summarize how this variance depends on the error variance s2 and the total variation in 5x1 x2 c xn6 SSTx First the larger the error variance the larger is Var1b 12 This makes sense since more variation in the unobservables affecting y makes it more difficult to precisely estimate b1 On the other hand more variability in the independent variable is preferred as the variability in the xi increases the variance of b 1 decreases This also makes intuitive sense since the more spread out is the sample of independent variables the easier it is to trace out the relationship between E1y0x2 and x That is the easier it is to estimate b1 If there is little variation in the xi then it can be hard to pinpoint how E1y0x2 varies with x As the sample size increases so does the total variation in the xi Therefore a larger sample size results in a smaller variance for b 1 This analysis shows that if we are interested in b1 and we have a choice then we should choose the xi to be as spread out as possible This is sometimes possible with experimental data but rarely do we have this luxury in the social sciences usually we must take the xi that we obtain via random sampling Sometimes we have an opportunity to obtain larger sample sizes although this can be costly For the purposes of constructing confidence intervals and deriving test statistics we will need to work with the standard deviations of b 1 and b 0 sdb 1 and sdb 0 Recall that these are obtained by taking the square roots of the variances in 257 and 258 In particular sd1b 12 5 sSSTx where s is the square root of s2 and SSTx is the square root of SSTx 25c Estimating the Error Variance The formulas in 257 and 258 allow us to isolate the factors that contribute to Var1b 12 and Var1b 02 But these formulas are unknown except in the extremely rare case that s2 is known Nevertheless we can use the data to estimate s2 which then allows us to estimate Var1b 12 and Var1b 02 This is a good place to emphasize the difference between the errors or disturbances and the residuals since this distinction is crucial for constructing an estimator of s2 Equation 248 shows how to write the population model in terms of a randomly sampled observation as yi 5 b0 1 b1xi 1 ui where ui is the error for observation i We can also express yi in terms of its fitted value and residual as in equation 12322 yi 5 b 0 1 b 1xi 1 u i Comparing these two equations we see that the error shows random variables across i by random sampling the variance of the sum is the sum of the variances Using these facts we have Var1b 12 5 11SSTx2 2 Vara a n i51 di uib 5 11SSTx2 2 a a n i51 d2 i Var1ui2b 5 11SSTx2 2 a a n i51 d2 i s2b 3since Var1ui2 5 s2 for all i4 5 s211SSTx2 2 a a n i51 d2 i b 5 s211SSTx2 2SSTx 5 s2SSTx which is what we wanted to show Show that when estimating b0 it is best to have x 5 0 What is Var1b 02 in this case Hint For any sample of numbers g n i51 x2 i g n i511xi 2 x2 2 with equality only if x 5 04 Exploring FurthEr 25 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 49 up in the equation containing the population parameters b0 and b1 On the other hand the residuals show up in the estimated equation with b 0 and b 1 The errors are never observed while the residuals are computed from the data We can use equations 232 and 248 to write the residuals as a function of the errors u i 5 yi 2 b 0 2 b 1xi 5 1b0 1 b1xi 1 ui2 2 b 0 2 b 1xi or u i 5 ui 2 1b 0 2 b02 2 1b 1 2 b12xi 259 Although the expected value of b 0 equals b0 and similarly for b 1 u i is not the same as ui The differ ence between them does have an expected value of zero Now that we understand the difference between the errors and the residuals we can return to esti mating s2 First s2 5 E1u22 so an unbiased estimator of s2 is n21g n i51 u2 i Unfortunately this is not a true estimator because we do not observe the errors ui But we do have estimates of the ui namely the OLS residuals u i If we replace the errors with the OLS residuals we have n21g n i51 u 2 i 5 SSRn This is a true estimator because it gives a computable rule for any sample of data on x and y One slight drawback to this estimator is that it turns out to be biased although for large n the bias is small Because it is easy to compute an unbiased estimator we use that instead The estimator SSRn is biased essentially because it does not account for two restrictions that must be satisfied by the OLS residuals These restrictions are given by the two OLS first order conditions a n i51u i 5 0 a n i51xiu i 5 0 260 One way to view these restrictions is this if we know n 2 2 of the residuals we can always get the other two residuals by using the restrictions implied by the first order conditions in 260 Thus there are only n 2 2 degrees of freedom in the OLS residuals as opposed to n degrees of freedom in the errors It is important to understand that if we replace u i with ui in 260 the restrictions would no longer hold The unbiased estimator of s2 that we will use makes a degrees of freedom adjustment s 2 5 1 1n 2 22 a n i51u 2 i 5 SSR1n 2 22 261 This estimator is sometimes denoted as S2 but we continue to use the convention of putting hats over estimators UnbiaSEd EStimatiOn OF s2 Under Assumptions SLR1 through SLR5 E1s 22 5 s2 PROOF If we average equation 259 across all i and use the fact that the OLS residuals average out to zero we have 0 5 u 2 1b 0 2 b02 2 1b 1 2 b12x subtracting this from 259 gives u i 5 1ui 2 u2 2 1b 1 2 b12 1xi 2 x2 Therefore u 2 i 5 1ui 2 u2 2 1 1b 1 2 b12 21xi 2 x2 2 2 21ui 2 u2 1b 1 2 b12 1xi 2 x2 Summing across all i gives g n i51u 2 i 5 g n i511ui 2 u2 2 1 1b 1 2 b12 2 g n i511xi 2 x2 2 2 21b 1 2 b12 g n i51 ui1xi 2 x2 Now the expected value of the first term is 1n 2 12s2 something that is shown in Appendix C The expected value of the second term is simply s2 because E3 1b 1 2 b12 24 5 Var1b 12 5 s2SSTx Finally the third term can be written as 221b 1 2 b12 2SSTx taking expectations gives 22s2 Putting these three terms together gives E1 g n i51u 2 i 2 5 1n 2 12s2 1 s2 2 2s2 5 1n 2 22s2 so that E3SSR1n 2 22 4 5 s2 thEorEm 23 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 50 If s 2 is plugged into the variance formulas 257 and 258 then we have unbiased estimators of Var1b 12 and Var1b 02 Later on we will need estimators of the standard deviations of b 1 and b 0 and this requires estimating s The natural estimator of s is s 5 s 2 262 and is called the standard error of the regression SER Other names for s are the standard error of the estimate and the root mean squared error but we will not use these Although s is not an unbi ased estimator of s we can show that it is a consistent estimator of s see Appendix C and it will serve our purposes well The estimate s is interesting because it is an estimate of the standard deviation in the unobservables affecting y equivalently it estimates the standard deviation in y after the effect of x has been taken out Most regression packages report the value of s along with the Rsquared intercept slope and other OLS statistics under one of the several names listed above For now our primary interest is in using s to esti mate the standard deviations of b 0 and b 1 Since sd1b 12 5 sSSTx the natural estimator of sd1b 12is se1b 12 5 s SSTx 5 s a a n i51 1xi 2 x2 2b 12 this is called the standard error of b 1 Note that se1b 12 is viewed as a random variable when we think of running OLS over different samples of y this is true because s varies with different samples For a given sample se1b 12 is a number just as b 1 is simply a number when we compute it from the given data Similarly se1b 02 is obtained from sd1b 02 by replacing s with s The standard error of any esti mate gives us an idea of how precise the estimator is Standard errors play a central role throughout this text we will use them to construct test statistics and confidence intervals for every econometric procedure we cover starting in Chapter 4 26 Regression through the Origin and Regression on a Constant In rare cases we wish to impose the restriction that when x 5 0 the expected value of y is zero There are certain relationships for which this is reasonable For example if income x is zero then income tax revenues y must also be zero In addition there are settings where a model that originally has a nonzero intercept is transformed into a model without an intercept Formally we now choose a slope estimator which we call b 1 and a line of the form y 5 b 1x 263 where the tildes over b 1 and y are used to distinguish this problem from the much more common problem of estimating an intercept along with a slope Obtaining 263 is called regression through the origin because the line 263 passes through the point x 5 0 y 5 0 To obtain the slope estimate in 263 we still rely on the method of ordinary least squares which in this case minimizes the sum of squared residuals a n i51 1yi 2 b 1xi2 2 264 Using onevariable calculus it can be shown that b 1 must solve the first order condition a n i51xi1yi 2 b 1xi2 5 0 265 From this we can solve for b 1 b 1 5 a n i51xiyi a n i51x2 i 266 provided that not all the xi are zero a case we rule out Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 51 Note how b 1 compares with the slope estimate when we also estimate the intercept rather than set it equal to zero These two estimates are the same if and only if x 5 0 See equation 249 for b 1 Obtaining an estimate of b1 using regression through the origin is not done very often in applied work and for good reason if the intercept b0 2 0 then b 1 is a biased estimator of b1 You will be asked to prove this in Problem 8 In cases where regression through the origin is deemed appropriate one must be careful in inter preting the Rsquared that is typically reported with such regressions Usually unless stated otherwise the Rsquared is obtained without removing the sample average of 5yi i 5 1 c n6 in obtaining SST In other words the Rsquared is computed as 1 2 a n i51 1yi 2 b 1xi2 2 a n i51y2 i 267 The numerator here makes sense because it is the sum of squared residuals but the denominator acts as if we know the average value of y in the population is zero One reason this version of the Rsquared is used is that if we use the usual total sum of squares that is we compute Rsquared as 1 2 a n i51 1yi 2 b 1xi2 2 a n i51 1yi 2 y2 2 268 it can actually be negative If expression 268 is negative then it means that using the sample average y to predict yi provides a better fit than using xi in a regression through the origin Therefore 268 is actually more attractive than equation 267 because equation 268 tells us whether using x is better than ignoring x altogether This discussion about regression through the origin and different ways to measure goodness offit prompts another question what happens if we only regress on a constant That is we set the slope to zero which means we need not even have an x and estimate an intercept only The answer is simple the intercept is y This fact is usually shown in basic statistics where it is shown that the constant that produces the smallest sum of squared deviations is always the sample average In this light equation 268 can be seen as comparing regression on x through the origin with regression only on a constant Summary We have introduced the simple linear regression model in this chapter and we have covered its basic prop erties Given a random sample the method of ordinary least squares is used to estimate the slope and intercept parameters in the population model We have demonstrated the algebra of the OLS regression line including computation of fitted values and residuals and the obtaining of predicted changes in the dependent variable for a given change in the independent variable In Section 24 we discussed two issues of practical importance 1 the behavior of the OLS estimates when we change the units of measurement of the dependent variable or the independent variable and 2 the use of the natural log to allow for constant elasticity and constant semielasticity models In Section 25 we showed that under the four Assumptions SLR1 through SLR4 the OLS estimators are unbiased The key assumption is that the error term u has zero mean given any value of the independent variable x Unfortunately there are reasons to think this is false in many social science applications of sim ple regression where the omitted factors in u are often correlated with x When we add the assumption that the variance of the error given x is constant we get simple formulas for the sampling variances of the OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 52 PART 1 Regression Analysis with CrossSectional Data estimators As we saw the variance of the slope estimator b 1 increases as the error variance increases and it decreases when there is more sample variation in the independent variable We also derived an unbiased estimator for s2 5 Var1u2 In Section 26 we briefly discussed regression through the origin where the slope estimator is ob tained under the assumption that the intercept is zero Sometimes this is useful but it appears infrequently in applied work Much work is left to be done For example we still do not know how to test hypotheses about the pop ulation parameters b0 and b1 Thus although we know that OLS is unbiased for the population parameters under Assumptions SLR1 through SLR4 we have no way of drawing inferences about the population Other topics such as the efficiency of OLS relative to other possible procedures have also been omitted The issues of confidence intervals hypothesis testing and efficiency are central to multiple regression analysis as well Since the way we construct confidence intervals and test statistics is very similar for mul tiple regressionand because simple regression is a special case of multiple regressionour time is better spent moving on to multiple regression which is much more widely applicable than simple regression Our purpose in Chapter 2 was to get you thinking about the issues that arise in econometric analysis in a fairly simple setting The GaussMarkov assuMpTions for siMple reGression For convenience we summarize the GaussMarkov assumptions that we used in this chapter It is impor tant to remember that only SLR1 through SLR4 are needed to show b 0 and b 1 are unbiased We added the homoskedasticity assumption SLR5 to obtain the usual OLS variance formulas 257 and 258 assumption slr1 linear in parameters In the population model the dependent variable y is related to the independent variable x and the error or disturbance u as y 5 b0 1 b1x 1 u where b0 and b1 are the population intercept and slope parameters respectively assumption slr2 random sampling We have a random sample of size n 5 1xiyi2 i 5 1 2 c n6 following the population model in Assump tion SLR1 assumption slr3 sample variation in the explanatory variable The sample outcomes on x namely 5xi i 5 1 c n6 are not all the same value assumption slr4 Zero Conditional Mean The error u has an expected value of zero given any value of the explanatory variable In other words E1u0x2 5 0 assumption slr5 homoskedasticity The error u has the same variance given any value of the explanatory variable In other words Var1u0x2 5 s2 Key Terms Coefficient of Determination Constant Elasticity Model Control Variable Covariate Degrees of Freedom Dependent Variable Elasticity Error Term Disturbance Error Variance Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 53 Explained Sum of Squares SSE Explained Variable Explanatory Variable First Order Conditions Fitted Value GaussMarkov Assumptions Heteroskedasticity Homoskedasticity Independent Variable Intercept Parameter Mean Independent OLS Regression Line Ordinary Least Squares OLS Population Regression Function PRF Predicted Variable Predictor Variable Regressand Regression through the Origin Regressor Residual Residual Sum of Squares SSR Response Variable Rsquared Sample Regression Function SRF Semielasticity Simple Linear Regression Model Slope Parameter Standard Error of b 1 Standard Error of the Regression SER Sum of Squared Residuals SSR Total Sum of Squares SST Zero Conditional Mean Assumption Problems 1 Let kids denote the number of children ever born to a woman and let educ denote years of education for the woman A simple model relating fertility to years of education is kids 5 b0 1 b1educ 1 u where u is the unobserved error i What kinds of factors are contained in u Are these likely to be correlated with level of education ii Will a simple regression analysis uncover the ceteris paribus effect of education on fertility Explain 2 In the simple linear regression model y 5 b0 1 b1x 1 u suppose that E1u2 2 0 Letting a0 5 E1u2 show that the model can always be rewritten with the same slope but a new intercept and error where the new error has a zero expected value 3 The following table contains the ACT scores and the GPA grade point average for eight college stu dents Grade point average is based on a fourpoint scale and has been rounded to one digit after the decimal Student GPA ACT 1 28 21 2 34 24 3 30 26 4 35 27 5 36 29 6 30 25 7 27 25 8 37 30 i Estimate the relationship between GPA and ACT using OLS that is obtain the intercept and slope estimates in the equation GPA 5 b 0 1 b 1ACT Comment on the direction of the relationship Does the intercept have a useful interpretation here Explain How much higher is the GPA predicted to be if the ACT score is increased by five points ii Compute the fitted values and residuals for each observation and verify that the residuals approximately sum to zero Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 54 PART 1 Regression Analysis with CrossSectional Data iii What is the predicted value of GPA when ACT 5 20 iv How much of the variation in GPA for these eight students is explained by ACT Explain 4 The data set BWGHT contains data on births to women in the United States Two variables of interest are the dependent variable infant birth weight in ounces bwght and an explanatory variable average number of cigarettes the mother smoked per day during pregnancy cigs The following simple regres sion was estimated using data on n 5 1388 births bwght 5 11977 2 0514 cigs i What is the predicted birth weight when cigs 5 0 What about when cigs 5 20 one pack per day Comment on the difference ii Does this simple regression necessarily capture a causal relationship between the childs birth weight and the mothers smoking habits Explain iii To predict a birth weight of 125 ounces what would cigs have to be Comment iv The proportion of women in the sample who do not smoke while pregnant is about 85 Does this help reconcile your finding from part iii 5 In the linear consumption function cons 5 b 0 1 b 1inc the estimated marginal propensity to consume MPC out of income is simply the slope b 1 while the average propensity to consume APC is consinc 5 b 0inc 1 b 1 Using observations for 100 families on annual income and consumption both measured in dollars the following equation is obtained cons 5 212484 1 0853 inc n 5 100 R2 5 0692 i Interpret the intercept in this equation and comment on its sign and magnitude ii What is the predicted consumption when family income is 30000 iii With inc on the xaxis draw a graph of the estimated MPC and APC 6 Using data from 1988 for houses sold in Andover Massachusetts from Kiel and McClain 1995 the following equation relates housing price price to the distance from a recently built garbage incin erator dist log1price2 5 940 1 0312 log1dist2 n 5 135 R2 5 0162 i Interpret the coefficient on logdist Is the sign of this estimate what you expect it to be ii Do you think simple regression provides an unbiased estimator of the ceteris paribus elasticity of price with respect to dist Think about the citys decision on where to put the incinerator iii What other factors about a house affect its price Might these be correlated with distance from the incinerator 7 Consider the savings function sav 5 b0 1 b1inc 1 u u 5 inc e where e is a random variable with E1e2 5 0 and Var1e2 5 s2 e Assume that e is independent of inc i Show that E1u0inc2 5 0 so that the key zero conditional mean assumption Assumption SLR4 is satisfied Hint If e is independent of inc then E1e0inc2 5 E1e2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 55 ii Show that Var1u0inc2 5 s2 einc so that the homoskedasticity Assumption SLR5 is violated In particular the variance of sav increases with inc Hint Var1e0inc2 5 Var1e2 if e and inc are independent iii Provide a discussion that supports the assumption that the variance of savings increases with family income 8 Consider the standard simple regression model y 5 b0 1 b1x 1 u under the GaussMarkov Assump tions SLR1 through SLR5 The usual OLS estimators b 0 and b 1 are unbiased for their respective population parameters Let b 1 be the estimator of b1 obtained by assuming the intercept is zero see Section 26 i Find E1b 12 in terms of the xi b0 and b1 Verify that b 1 is unbiased for b1 when the population intercept 1b02 is zero Are there other cases where b 1 is unbiased ii Find the variance of b 1 Hint The variance does not depend on b0 iii Show that Var1b 12 Var1b 12 Hint For any sample of data g n i21x2 i g n i511xi 2 x2 2 with strict inequality unless x 5 0 iv Comment on the tradeoff between bias and variance when choosing between b 1 and b 1 9 i Let b 0 and b 1 be the intercept and slope from the regression of yi on xi using n observations Let c1 and c2 with c2 2 0 be constants Let b 0 and b 1 be the intercept and slope from the regression of c1yi on c2xi Show that b 1 5 1c1c22b 0 and b 0 5 c1b 0 thereby verifying the claims on units of measurement in Section 24 Hint To obtain b 1 plug the scaled versions of x and y into 219 Then use 217 for b 0 being sure to plug in the scaled x and y and the correct slope ii Now let b 0 and b 1 be from the regression of 1c1 1 yi2 on 1c2 1 xi2 with no restriction on c1 or c2 Show that b 1 5 b 1 and b 0 5 b 0 1 c1 2 c2b 1 iii Now let b 0 and b 1 be the OLS estimates from the regression log1yi2 on xi where we must as sume yi 0 for all i For c1 0 let b 0 and b 1 be the intercept and slope from the regression of log1c1yi2 on xi Show that b 1 5 b 1 and b 0 5 log1c12 1 b 0 iv Now assuming that xi 0 for all i let b 0 and b 1 be the intercept and slope from the regression of yi on log1c2xi2 How do b 0 and b 1 compare with the intercept and slope from the regression of yi on log1xi2 10 Let b 0 and b 1 be the OLS intercept and slope estimators respectively and let u be the sample average of the errors not the residuals i Show that b 1 can be written as b 1 5 b1 1 g n i51wiui where wi 5 diSSTx and di 5 xi 2 x ii Use part i along with g n i51wi 5 0 to show that b 1 and u are uncorrelated Hint You are being asked to show that E3 1b 1 2 b12 u4 5 04 iii Show that b 0 can be written as b 0 5 b0 1 u 2 1b 1 2 b12x iv Use parts ii and iii to show that Var1b 02 5 s2n 1 s21x2 2SSTx v Do the algebra to simplify the expression in part iv to equation 258 Hint SSTxn 5 n21g n i51x2 i 2 1x2 2 11 Suppose you are interested in estimating the effect of hours spent in an SAT preparation course hours on total SAT score sat The population is all collegebound high school seniors for a particular year i Suppose you are given a grant to run a controlled experiment Explain how you would structure the experiment in order to estimate the causal effect of hours on sat ii Consider the more realistic case where students choose how much time to spend in a prepara tion course and you can only randomly sample sat and hours from the population Write the population model as sat 5 b0 1 b1hours 5 u where as usual in a model with an intercept we can assume E1u2 5 0 List at least two factors contained in u Are these likely to have positive or negative correlation with hours Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 56 PART 1 Regression Analysis with CrossSectional Data iii In the equation from part ii what should be the sign of b1 if the preparation course is effective iv In the equation from part ii what is the interpretation of b0 12 Consider the problem described at the end of Section 26 running a regression and only estimating an intercept i Given a sample 5yi i 5 1 2 c n6 let b 0 be the solution to min b0 a n i51 1yi 2 b02 2 Show that b 0 5 y that is the sample average minimizes the sum of squared residuals Hint You may use onevariable calculus or you can show the result directly by adding and subtract ing y inside the squared residual and then doing a little algebra ii Define residuals u i 5 yi 2 y Argue that these residuals always sum to zero Computer Exercises C1 The data in 401K are a subset of data analyzed by Papke 1995 to study the relationship between participation in a 401k pension plan and the generosity of the plan The variable prate is the per centage of eligible workers with an active account this is the variable we would like to explain The measure of generosity is the plan match rate mrate This variable gives the average amount the firm contributes to each workers plan for each 1 contribution by the worker For example if mrate 5 050 then a 1 contribution by the worker is matched by a 50 contribution by the firm i Find the average participation rate and the average match rate in the sample of plans ii Now estimate the simple regression equation prate 5 b 0 1 b 1 mrate and report the results along with the sample size and Rsquared iii Interpret the intercept in your equation Interpret the coefficient on mrate iv Find the predicted prate when mrate 5 35 Is this a reasonable prediction Explain what is happening here v How much of the variation in prate is explained by mrate Is this a lot in your opinion C2 The data set in CEOSAL2 contains information on chief executive officers for US corporations The variable salary is annual compensation in thousands of dollars and ceoten is prior number of years as company CEO i Find the average salary and the average tenure in the sample ii How many CEOs are in their first year as CEO that is ceoten 5 0 What is the longest tenure as a CEO iii Estimate the simple regression model log1salary2 5 b0 1 b1ceoten 1 u and report your results in the usual form What is the approximate predicted percentage increase in salary given one more year as a CEO C3 Use the data in SLEEP75 from Biddle and Hamermesh 1990 to study whether there is a tradeoff between the time spent sleeping per week and the time spent in paid work We could use either variable as the dependent variable For concreteness estimate the model sleep 5 b0 1 b1totwrk 1 u where sleep is minutes spent sleeping at night per week and totwrk is total minutes worked dur ing the week i Report your results in equation form along with the number of observations and R2 What does the intercept in this equation mean Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 57 ii If totwrk increases by 2 hours by how much is sleep estimated to fall Do you find this to be a large effect C4 Use the data in WAGE2 to estimate a simple regression explaining monthly salary wage in terms of IQ score IQ i Find the average salary and average IQ in the sample What is the sample standard deviation of IQ IQ scores are standardized so that the average in the population is 100 with a standard de viation equal to 15 ii Estimate a simple regression model where a onepoint increase in IQ changes wage by a con stant dollar amount Use this model to find the predicted increase in wage for an increase in IQ of 15 points Does IQ explain most of the variation in wage iii Now estimate a model where each onepoint increase in IQ has the same percentage effect on wage If IQ increases by 15 points what is the approximate percentage increase in predicted wage C5 For the population of firms in the chemical industry let rd denote annual expenditures on research and development and let sales denote annual sales both are in millions of dollars i Write down a model not an estimated equation that implies a constant elasticity between rd and sales Which parameter is the elasticity ii Now estimate the model using the data in RDCHEM Write out the estimated equation in the usual form What is the estimated elasticity of rd with respect to sales Explain in words what this elasticity means C6 We used the data in MEAP93 for Example 212 Now we want to explore the relationship between the math pass rate math10 and spending per student expend i Do you think each additional dollar spent has the same effect on the pass rate or does a dimin ishing effect seem more appropriate Explain ii In the population model math10 5 b0 1 b1 log1expend2 1 u argue that b110 is the percentage point change in math10 given a 10 increase in expend iii Use the data in MEAP93 to estimate the model from part ii Report the estimated equation in the usual way including the sample size and Rsquared iv How big is the estimated spending effect Namely if spending increases by 10 what is the estimated percentage point increase in math10 v One might worry that regression analysis can produce fitted values for math10 that are greater than 100 Why is this not much of a worry in this data set C7 Use the data in CHARITY obtained from Franses and Paap 2001 to answer the following questions i What is the average gift in the sample of 4268 people in Dutch guilders What percentage of people gave no gift ii What is the average mailings per year What are the minimum and maximum values iii Estimate the model gift 5 b0 1 b1mailsyear 1 u by OLS and report the results in the usual way including the sample size and Rsquared iv Interpret the slope coefficient If each mailing costs one guilder is the charity expected to make a net gain on each mailing Does this mean the charity makes a net gain on every mailing Explain v What is the smallest predicted charitable contribution in the sample Using this simple regres sion analysis can you ever predict zero for gift C8 To complete this exercise you need a software package that allows you to generate data from the uni form and normal distributions i Start by generating 500 observations on xithe explanatory variablefrom the uniform dis tribution with range 010 Most statistical packages have a command for the Uniform01 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 58 PART 1 Regression Analysis with CrossSectional Data distribution just multiply those observations by 10 What are the sample mean and sample standard deviation of the xi ii Randomly generate 500 errors ui from the Normal036 distribution If you generate a Normal01 as is commonly available simply multiply the outcomes by six Is the sample av erage of the ui exactly zero Why or why not What is the sample standard deviation of the ui iii Now generate the yi as yi 5 1 1 2xi 1 ui b0 1 b1xi 1 ui that is the population intercept is one and the population slope is two Use the data to run the regression of yi on xi What are your estimates of the intercept and slope Are they equal to the population values in the above equation Explain iv Obtain the OLS residuals u i and verify that equation 260 holds subject to rounding error v Compute the same quantities in equation 260 but use the errors ui in place of the residuals Now what do you conclude vi Repeat parts i ii and iii with a new sample of data starting with generating the xi Now what do you obtain for b 0 and b 1 Why are these different from what you obtained in part iii C9 Use the data in COUNTYMURDERS to answer this questions Use only the data for 1996 i How many counties had zero murders in 1996 How many counties had at least one execution What is the largest number of executions ii Estimate the equation murders 5 b0 1 b1execs 1 u by OLS and report the results in the usual way including sample size and Rsquared iii Interpret the slope coefficient reported in part ii Does the estimated equation suggest a deter rent effect of capital punishment iv What is the smallest number of murders that can be predicted by the equation What is the residual for a county with zero executions and zero murders v Explain why a simple regression analysis is not well suited for determining whether capital pun ishment has a deterrent effect on murders C10 The data set in CATHOLIC includes test score information on over 7000 students in the United States who were in eighth grade in 1988 The variables math12 and read12 are scores on twelfth grade stan dardized math and reading tests respectively i How many students are in the sample Find the means and standard deviations of math12 and read12 ii Run the simple regression of math12 on read12 to obtain the OLS intercept and slope estimates Report the results in the form math12 5 b 0 1 b 1read12 n 5 R2 5 where you fill in the values for b 0 and b 1 and also replace the question marks iii Does the intercept reported in part ii have a meaningful interpretation Explain iv Are you surprised by the b 1 that you found What about R2 v Suppose that you present your findings to a superintendent of a school district and the superintendent says Your findings show that to improve math scores we just need to improve reading scores so we should hire more reading tutors How would you respond to this comment Hint If you instead run the regression of read12 on math12 what would you expect to find Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 2 The Simple Regression Model 59 APPEndix 2A Minimizing the sum of squared residuals We show that the OLS estimates b 0 and b 1 do minimize the sum of squared residuals as asserted in Section 22 Formally the problem is to characterize the solutions b 0 and b 1 to the minimization problem min b0b1 a n i51 1yi 2 b0 2 b1xi2 2 where b0 and b1 are the dummy arguments for the optimization problem for simplicity call this function Q1b0 b12 By a fundamental result from multivariable calculus see Appendix A a nec essary condition for b 0 and b 1 to solve the minimization problem is that the partial derivatives of Q1b0 b12 with respect to b0 and b1 must be zero when evaluated at b 0 b 1 Q1b 0 b 12b0 5 0 and Q1b 0 b 12b1 5 0 Using the chain rule from calculus these two equations become 22 a n i51 1yi 2 b 0 2 b 1xi2 5 0 22 a n i51xi1yi 2 b 0 2 b 1xi2 5 0 These two equations are just 214 and 215 multiplied by 2n and therefore are solved by the same b 0 and b 1 How do we know that we have actually minimized the sum of squared residuals The first order conditions are necessary but not sufficient conditions One way to verify that we have minimized the sum of squared residuals is to write for any b0 and b1 Q1b0 b12 5 a n i51 3yi 2 b 0 2 b 1xi 1 1b 0 2 b02 1 1b 1 2 b12xi42 5 a n i51 1u i 1 1b 0 2 b02 1 1b 1 2 b12xi42 5 a n i51u 2 i 1 n1b 0 2 b02 2 1 1b 1 2 b12 2 a n i51x2 i 1 21b 0 2 b02 1b 1 2 b12 a n i51xi where we have used equations 230 and 231 The first term does not depend on b0 or b1 while the sum of the last three terms can be written as a n i51 3 1b 0 2 b02 1 1b 1 2 b12xi42 as can be verified by straightforward algebra Because this is a sum of squared terms the smallest it can be is zero Therefore it is smallest when b0 5 b 0 and b1 5 b 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 60 c h a p t e r 3 Multiple Regression Analysis Estimation I n Chapter 2 we learned how to use simple regression analysis to explain a dependent variable y as a function of a single independent variable x The primary drawback in using simple regression analysis for empirical work is that it is very difficult to draw ceteris paribus conclusions about how x affects y the key assumption SLR4that all other factors affecting y are uncorrelated with xis often unrealistic Multiple regression analysis is more amenable to ceteris paribus analysis because it allows us to explicitly control for many other factors that simultaneously affect the dependent variable This is important both for testing economic theories and for evaluating policy effects when we must rely on nonexperimental data Because multiple regression models can accommodate many explanatory variables that may be cor related we can hope to infer causality in cases where simple regression analysis would be misleading Naturally if we add more factors to our model that are useful for explaining y then more of the variation in y can be explained Thus multiple regression analysis can be used to build better models for predicting the dependent variable An additional advantage of multiple regression analysis is that it can incorporate fairly general functional form relationships In the simple regression model only one function of a single explana tory variable can appear in the equation As we will see the multiple regression model allows for much more flexibility Section 31 formally introduces the multiple regression model and further discusses the advan tages of multiple regression over simple regression In Section 32 we demonstrate how to esti mate the parameters in the multiple regression model using the method of ordinary least squares Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 61 In Sections 33 34 and 35 we describe various statistical properties of the OLS estimators includ ing unbiasedness and efficiency The multiple regression model is still the most widely used vehicle for empirical analysis in eco nomics and other social sciences Likewise the method of ordinary least squares is popularly used for estimating the parameters of the multiple regression model 31 Motivation for Multiple Regression 31a The Model with Two Independent Variables We begin with some simple examples to show how multiple regression analysis can be used to solve problems that cannot be solved by simple regression The first example is a simple variation of the wage equation introduced in Chapter 2 for obtaining the effect of education on hourly wage wage 5 b0 1 b1educ 1 b2exper 1 u 31 where exper is years of labor market experience Thus wage is determined by the two explanatory or independent variables education and experience and by other unobserved factors which are con tained in u We are still primarily interested in the effect of educ on wage holding fixed all other fac tors affecting wage that is we are interested in the parameter b1 Compared with a simple regression analysis relating wage to educ equation 31 effectively takes exper out of the error term and puts it explicitly in the equation Because exper appears in the equation its coefficient b2 measures the ceteris paribus effect of exper on wage which is also of some interest Not surprisingly just as with simple regression we will have to make assumptions about how u in 31 is related to the independent variables educ and exper However as we will see in Section 32 there is one thing of which we can be confident because 31 contains experience explicitly we will be able to measure the effect of education on wage holding experience fixed In a simple regression analysiswhich puts exper in the error termwe would have to assume that experience is uncorrelated with education a tenuous assumption As a second example consider the problem of explaining the effect of perstudent spending expend on the average standardized test score avgscore at the high school level Suppose that the average test score depends on funding average family income avginc and other unobserved factors avgscore 5 b0 1 b1expend 1 b2avginc 1 u 32 The coefficient of interest for policy purposes is b1 the ceteris paribus effect of expend on avgscore By including avginc explicitly in the model we are able to control for its effect on avgscore This is likely to be important because average family income tends to be correlated with perstudent spend ing spending levels are often determined by both property and local income taxes In simple regres sion analysis avginc would be included in the error term which would likely be correlated with expend causing the OLS estimator of b1 in the twovariable model to be biased In the two previous similar examples we have shown how observable factors other than the vari able of primary interest educ in equation 31 and expend in equation 32 can be included in a regression model Generally we can write a model with two independent variables as y 5 b0 1 b1x1 1 b2x2 1 u 33 where b0 is the intercept b1 measures the change in y with respect to x1 holding other factors fixed b2 measures the change in y with respect to x2 holding other factors fixed Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 62 Multiple regression analysis is also useful for generalizing functional relationships between variables As an example suppose family consumption cons is a quadratic function of family income inc cons 5 b0 1 b1inc 1 b2inc2 1 u 34 where u contains other factors affecting consumption In this model consumption depends on only one observed factor income so it might seem that it can be handled in a simple regression frame work But the model falls outside simple regression because it contains two functions of income inc and inc2 and therefore three parameters b0 b1 and b2 Nevertheless the consumption function is easily written as a regression model with two independent variables by letting x1 5 inc and x2 5 inc2 Mechanically there will be no difference in using the method of ordinary least squares intro duced in Section 32 to estimate equations as different as 31 and 34 Each equation can be writ ten as 33 which is all that matters for computation There is however an important difference in how one interprets the parameters In equation 31 b1 is the ceteris paribus effect of educ on wage The parameter b1 has no such interpretation in 34 In other words it makes no sense to measure the effect of inc on cons while holding inc2 fixed because if inc changes then so must inc2 Instead the change in consumption with respect to the change in incomethe marginal propensity to consume is approximated by Dcons Dinc b1 1 2b2inc See Appendix A for the calculus needed to derive this equation In other words the marginal effect of income on consumption depends on b2 as well as on b1 and the level of income This example shows that in any particular application the definitions of the independent variables are crucial But for the theoretical development of multiple regression we can be vague about such details We will study examples like this more completely in Chapter 6 In the model with two independent variables the key assumption about how u is related to x1 and x2 is E1u0x1 x22 5 0 35 The interpretation of condition 35 is similar to the interpretation of Assumption SLR4 for simple regression analysis It means that for any values of x1 and x2 in the population the average of the unobserved factors is equal to zero As with simple regression the important part of the assumption is that the expected value of u is the same for all combinations of x1 and x2 that this common value is zero is no assumption at all as long as the intercept b0 is included in the model see Section 21 How can we interpret the zero conditional mean assumption in the previous examples In equa tion 31 the assumption is E1u0educexper2 5 0 This implies that other factors affecting wage are not related on average to educ and exper Therefore if we think innate ability is part of u then we will need average ability levels to be the same across all combinations of education and experience in the working population This may or may not be true but as we will see in Section 33 this is the ques tion we need to ask in order to determine whether the method of ordinary least squares produces unbiased estimators The example measuring student performance equation 32 is similar to the wage equa tion The zero conditional mean assumption is E1u0expend avginc2 5 0 which means that other A simple model to explain city murder rates murdrate in terms of the probability of conviction prbconv and average sentence length avgsen is murdrate 5 b0 1 b1prbconv 1 b2avgsen 1 u What are some factors contained in u Do you think the key assumption 35 is likely to hold Exploring FurthEr 31 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 63 factors affecting test scoresschool or student characteristicsare on average unrelated to per student funding and average family income When applied to the quadratic consumption function in 34 the zero conditional mean assump tion has a slightly different interpretation Written literally equation 35 becomes E1u0incinc22 5 0 Since inc2 is known when inc is known including inc2 in the expectation is redundant E1u0incinc22 5 0 is the same as E1u0inc2 5 0 Nothing is wrong with putting inc2 along with inc in the expectation when stating the assumption but E1u0inc2 5 0 is more concise 31b The Model with k Independent Variables Once we are in the context of multiple regression there is no need to stop with two independent vari ables Multiple regression analysis allows many observed factors to affect y In the wage example we might also include amount of job training years of tenure with the current employer measures of abil ity and even demographic variables like the number of siblings or mothers education In the school funding example additional variables might include measures of teacher quality and school size The general multiple linear regression MLR model also called the multiple regression model can be written in the population as y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 p 1 bkxk 1 u 36 where b0 is the intercept b1 is the parameter associated with x1 b2 is the parameter associated with x2 and so on Since there are k independent variables and an intercept equation 36 contains k 1 unknown population parameters For shorthand purposes we will sometimes refer to the parameters other than the intercept as slope parameters even though this is not always literally what they are See equa tion 34 where neither b1 nor b2 is itself a slope but together they determine the slope of the rela tionship between consumption and income The terminology for multiple regression is similar to that for simple regression and is given in Table 31 Just as in simple regression the variable u is the error term or disturbance It contains factors other than x1 x2 c xk that affect y No matter how many explanatory variables we include in our model there will always be factors we cannot include and these are collectively contained in u When applying the general multiple regression model we must know how to interpret the param eters We will get plenty of practice now and in subsequent chapters but it is useful at this point to be reminded of some things we already know Suppose that CEO salary salary is related to firm sales sales and CEO tenure ceoten with the firm by log1salary2 5 b0 1 b1log1sales2 1 b2ceoten 1 b3ceoten2 1 u 37 This fits into the multiple regression model with k 5 3 by defining y 5 logsalary x1 5 log1sales2 x2 5 ceoten and x3 5 ceoten2 As we know from Chapter 2 the parameter b1 is the ceteris paribus TAblE 31 Terminology for Multiple Regression Y x1 x2 c xk Dependent variable Independent variables Explained variable Explanatory variables Response variable Control variables Predicted variable Predictor variables Regressand Regressors Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 64 elasticity of salary with respect to sales If b3 5 0 then 100b2 is approximately the ceteris paribus percentage increase in salary when ceoten increases by one year When b3 2 0 the effect of ceoten on salary is more complicated We will postpone a detailed treatment of general mod els with quadratics until Chapter 6 Equation 37 provides an important reminder about multiple regression analysis The term lin ear in a multiple linear regression model means that equation 36 is linear in the parameters bj Equation 37 is an example of a multiple regression model that while linear in the bj is a nonlinear relationship between salary and the variables sales and ceoten Many applications of multiple linear regression involve nonlinear relationships among the underlying variables The key assumption for the general multiple regression model is easy to state in terms of a con ditional expectation E1u0x1 x2 p xk2 5 0 38 At a minimum equation 38 requires that all factors in the unobserved error term be uncorrelated with the explanatory variables It also means that we have correctly accounted for the functional rela tionships between the explained and explanatory variables Any problem that causes u to be correlated with any of the independent variables causes 38 to fail In Section 33 we will show that assump tion 38 implies that OLS is unbiased and will derive the bias that arises when a key variable has been omitted from the equation In Chapters 15 and 16 we will study other reasons that might cause 38 to fail and show what can be done in cases where it does fail 32 Mechanics and Interpretation of Ordinary Least Squares We now summarize some computational and algebraic features of the method of ordinary least squares as it applies to a particular set of data We also discuss how to interpret the estimated equation 32a Obtaining the OLS Estimates We first consider estimating the model with two independent variables The estimated OLS equation is written in a form similar to the simple regression case y 5 b 0 1 b 1x1 1 b 2x2 39 where b 0 the estimate of b0 b 1 the estimate of b1 b 2 the estimate of b2 But how do we obtain b 0 b 1 and b 2 The method of ordinary least squares chooses the esti mates to minimize the sum of squared residuals That is given n observations on y x1 and x2 51xi1 xi2 yi2 i 5 1 2 p n6 the estimates b 0 b 1 and b 2 are chosen simultaneously to make a n i51 1yi 2 b 0 2 b 1xi1 2 b 2xi22 2 310 as small as possible To understand what OLS is doing it is important to master the meaning of the indexing of the independent variables in 310 The independent variables have two subscripts here i followed by either 1 or 2 The i subscript refers to the observation number Thus the sum in 310 is over all i 5 1 to n observations The second index is simply a method of distinguishing between different independent variables In the example relating wage to educ and exper xi1 5 educi is education for person i in the sample and xi2 5 experi is experience for person i The sum of squared residu als in equation 310 is g n i51 1wagei 2 b 0 2 b 1educi 2 b 2experi2 2 In what follows the i subscript Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 65 is reserved for indexing the observation number If we write xij then this means the ith observation on the jth independent variable Some authors prefer to switch the order of the observation num ber and the variable number so that x1i is observation i on variable one But this is just a matter of notational taste In the general case with k independent variables we seek estimates b 0 b 1 p b k in the equation y 5 b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk 311 The OLS estimates k 1 of them are chosen to minimize the sum of squared residuals a n i51 1yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 2 312 This minimization problem can be solved using multivariable calculus see Appendix 3A This leads to k 1 linear equations in k 1 unknowns b 0 b 1 p b k a n i51 1yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 a n i51xi11yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 a n i51xi21yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 a n i51xik1yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 313 These are often called the OLS first order conditions As with the simple regression model in Section 22 the OLS first order conditions can be obtained by the method of moments under assump tion 38 E1u2 5 0 and E1xju2 5 0 where j 5 1 2 p k The equations in 313 are the sample counterparts of these population moments although we have omitted the division by the sample size n For even moderately sized n and k solving the equations in 313 by hand calculations is tedi ous Nevertheless modern computers running standard statistics and econometrics software can solve these equations with large n and k very quickly There is only one slight caveat we must assume that the equations in 313 can be solved uniquely for the b j For now we just assume this as it is usually the case in wellspecified models In Section 33 we state the assumption needed for unique OLS estimates to exist see Assumption MLR3 As in simple regression analysis equation 311 is called the OLS regression line or the sample regression function SRF We will call b 0 the OLS intercept estimate and b 1 p b k the OLS slope estimates corresponding to the independent variables x1 x2 p xk To indicate that an OLS regression has been run we will either write out equation 311 with y and x1 p xk replaced by their variable names such as wage educ and exper or we will say that we ran an OLS regression of y on x1 x2 p xk or that we regressed y on x1 x2 p xk These are shorthand for saying that the method of ordinary least squares was used to obtain the OLS equation 311 Unless explicitly stated otherwise we always estimate an intercept along with the slopes 32b Interpreting the OLS Regression Equation More important than the details underlying the computation of the b j is the interpretation of the estimated equation We begin with the case of two independent variables y 5 b 0 1 b 1x1 1 b 2x2 314 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 66 The intercept b 0 in equation 314 is the predicted value of y when x1 5 0 and x2 5 0 Sometimes setting x1 and x2 both equal to zero is an interesting scenario in other cases it will not make sense Nevertheless the intercept is always needed to obtain a prediction of y from the OLS regression line as 314 makes clear The estimates b 1 and b 2 have partial effect or ceteris paribus interpretations From equation 314 we have Dy 5 b 1Dx1 1 b 2Dx2 so we can obtain the predicted change in y given the changes in x1 and x2 Note how the intercept has nothing to do with the changes in y In particular when x2 is held fixed so that Dx2 5 0 then Dy 5 b 1Dx1 holding x2 fixed The key point is that by including x2 in our model we obtain a coefficient on x1 with a ceteris paribus interpretation This is why multiple regression analysis is so useful Similarly Dy 5 b 2Dx2 holding x1 fixed ExamplE 31 Determinants of College Gpa The variables in GPA1 include the college grade point average colGPA high school GPA hsGPA and achievement test score ACT for a sample of 141 students from a large university both college and high school GPAs are on a fourpoint scale We obtain the following OLS regression line to pre dict college GPA from high school GPA and achievement test score colGPA 5 129 1 453 hsGPA 1 0094 ACT n 5 141 315 How do we interpret this equation First the intercept 129 is the predicted college GPA if hsGPA and ACT are both set as zero Since no one who attends college has either a zero high school GPA or a zero on the achievement test the intercept in this equation is not by itself meaningful More interesting estimates are the slope coefficients on hsGPA and ACT As expected there is a positive partial relationship between colGPA and hsGPA Holding ACT fixed another point on hsGPA is associated with 453 of a point on the college GPA or almost half a point In other words if we choose two students A and B and these students have the same ACT score but the high school GPA of Student A is one point higher than the high school GPA of Student B then we predict Student A to have a college GPA 453 higher than that of Student B This says nothing about any two actual people but it is our best prediction The sign on ACT implies that while holding hsGPA fixed a change in the ACT score of 10 points a very large change since the maximum ACT score is 36 and the average score in the sample is about 24 with a standard deviation less than threeaffects colGPA by less than onetenth of a point This is a small effect and it suggests that once high school GPA is accounted for the ACT score is not a strong predictor of college GPA Naturally there are many other factors that contribute to GPA but here we focus on statistics available for high school students Later after we discuss statistical inference we will show that not only is the coefficient on ACT practically small it is also statistically insignificant If we focus on a simple regression analysis relating colGPA to ACT only we obtain colGPA 5 240 1 0271 ACT n 5 141 thus the coefficient on ACT is almost three times as large as the estimate in 315 But this equation does not allow us to compare two people with the same high school GPA it corresponds to a different experiment We say more about the differences between multiple and simple regression later Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 67 The case with more than two independent variables is similar The OLS regression line is y 5 b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk 316 Written in terms of changes Dy 5 b 1Dx1 1 b 2Dx2 1 p 1 b kDxk 317 The coefficient on x1 measures the change in y due to a oneunit increase in x1 holding all other inde pendent variables fixed That is Dy 5 b 1Dx1 318 holding x2 x3 p xk fixed Thus we have controlled for the variables x2 x3 p xk when estimating the effect of x1 on y The other coefficients have a similar interpretation The following is an example with three independent variables ExamplE 32 Hourly Wage Equation Using the 526 observations on workers in WAGE1 we include educ years of education exper years of labor market experience and tenure years with the current employer in an equation explaining logwage The estimated equation is log1wage2 5 284 1 092 educ 1 0041 exper 1 022 tenure n 5 526 319 As in the simple regression case the coefficients have a percentage interpretation The only difference here is that they also have a ceteris paribus interpretation The coefficient 092 means that holding exper and tenure fixed another year of education is predicted to increase logwage by 092 which translates into an approximate 92 100092 increase in wage Alternatively if we take two people with the same levels of experience and job tenure the coefficient on educ is the proportionate dif ference in predicted wage when their education levels differ by one year This measure of the return to education at least keeps two important productivity factors fixed whether it is a good estimate of the ceteris paribus return to another year of education requires us to study the statistical properties of OLS see Section 33 32c On the Meaning of Holding Other Factors Fixed in Multiple Regression The partial effect interpretation of slope coefficients in multiple regression analysis can cause some confusion so we provide a further discussion now In Example 31 we observed that the coefficient on ACT measures the predicted difference in colGPA holding hsGPA fixed The power of multiple regression analysis is that it provides this ceteris paribus interpretation even though the data have not been collected in a ceteris paribus fashion In giv ing the coefficient on ACT a partial effect interpretation it may seem that we actually went out and sampled people with the same high school GPA but possibly with different ACT scores This is not the case The data are a random sample from a large university there were no restrictions placed on the sample values of hsGPA or ACT in obtaining the data Rarely do we have the luxury of holding certain variables fixed in obtaining our sample If we could collect a sample of individuals with the same high school GPA then we could perform a simple regression analysis relating colGPA to ACT Multiple regression effectively allows us to mimic this situation without restricting the values of any independ ent variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 68 The power of multiple regression analysis is that it allows us to do in nonexperimental environ ments what natural scientists are able to do in a controlled laboratory setting keep other factors fixed 32d Changing More Than One Independent Variable Simultaneously Sometimes we want to change more than one independent variable at the same time to find the result ing effect on the dependent variable This is easily done using equation 317 For example in equa tion 319 we can obtain the estimated effect on wage when an individual stays at the same firm for another year exper general workforce experience and tenure both increase by one year The total effect holding educ fixed is Dlog1wage2 5 0041 Dexper 1 022 Dtenure 5 0041 1 022 5 0261 or about 26 Since exper and tenure each increase by one year we just add the coefficients on exper and tenure and multiply by 100 to turn the effect into a percentage 32e OLS Fitted Values and Residuals After obtaining the OLS regression line 311 we can obtain a fitted or predicted value for each observation For observation i the fitted value is simply yi 5 b 0 1 b 1xi1 1 b 2xi2 1 p 1 b kxik 320 which is just the predicted value obtained by plugging the values of the independent variables for observation i into equation 311 We should not forget about the intercept in obtaining the fitted values otherwise the answer can be very misleading As an example if in 315 hsGPAi 5 35 and ACTi 5 24 colGPAi 5 129 1 4531352 1 00941242 5 3101 rounded to three places after the decimal Normally the actual value yi for any observation i will not equal the predicted value yi OLS minimizes the average squared prediction error which says nothing about the prediction error for any particular observation The residual for observation i is defined just as in the simple regression case u i 5 yi 2 yi 321 There is a residual for each observation If u i 0 then yi is below yi which means that for this observation yi is underpredicted If u i 0 then yi yi and yi is overpredicted The OLS fitted values and residuals have some important properties that are immediate exten sions from the single variable case In Example 31 the OLS fitted line explain ing college GPA in terms of high school GPA and ACT score is colGPA 5 129 1 453 hsGPA 1 0094 ACT If the average high school GPA is about 34 and the average ACT score is about 242 what is the average college GPA in the sample Exploring FurthEr 32 1 The sample average of the residuals is zero and so y 5 y 2 The sample covariance between each independent variable and the OLS residuals is zero Consequently the sample covariance between the OLS fitted values and the OLS residuals is zero 3 The point 1x1 x2 p xk y2 is always on the OLS regression line y 5 b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk The first two properties are immediate consequences of the set of equations used to obtain the OLS estimates The first equation in 313 says that the sum of the residuals is zero The remaining equa tions are of the form g n i51 xiju i 5 0 which implies that each inde pendent variable has zero sample covariance with u i Property 3 follows immediately from property 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 69 32f A Partialling Out Interpretation of Multiple Regression When applying OLS we do not need to know explicit formulas for the b j that solve the system of equations in 313 Nevertheless for certain derivations we do need explicit formulas for the b j These formulas also shed further light on the workings of OLS Consider again the case with k 5 2 independent variables y 5 b 0 1 b 1x1 1 b 2x2 For concrete ness we focus on b 1 One way to express b 1 is b 1 5 a a n i51ri1yiba a n i51r2 i1b 322 where the ri1 are the OLS residuals from a simple regression of x1 on x2 using the sample at hand We regress our first independent variable x1 on our second independent variable x2 and then obtain the residuals y plays no role here Equation 322 shows that we can then do a simple regression of y on r1 to obtain b 1 Note that the residuals ri1 have a zero sample average and so b 1 is the usual slope estimate from simple regression The representation in equation 322 gives another demonstration of b 1s partial effect interpre tation The residuals ri1 are the part of xi1 that is uncorrelated with xi2 Another way of saying this is that ri1 is xi1 after the effects of xi2 have been partialled out or netted out Thus b 1 measures the sam ple relationship between y and x1 after x2 has been partialled out In simple regression analysis there is no partialling out of other variables because no other vari ables are included in the regression Computer Exercise C5 steps you through the partialling out pro cess using the wage data from Example 32 For practical purposes the important thing is that b 1 in the equation y 5 b 0 1 b 1x1 1 b 2x2 measures the change in y given a oneunit increase in x1 holding x2 fixed In the general model with k explanatory variables b 1 can still be written as in equation 322 but the residuals ri1 come from the regression of x1 on x2 p xk Thus b 1 measures the effect of x1 on y after x2 p xk have been partialled or netted out In econometrics the general partialling out result is usually called the FrischWaugh theorem It has many uses in theoretical and applied econometrics We will see applications to time series regressions in Chapter 10 32g Comparison of Simple and Multiple Regression Estimates Two special cases exist in which the simple regression of y on x1 will produce the same OLS estimate on x1 as the regression of y on x1 and x2 To be more precise write the simple regression of y on x1 as y 5 b 0 1 b 1x1 and write the multiple regression as y 5 b 0 1 b 1x1 1 b 2x2 We know that the simple regression coefficient b 1 does not usually equal the multiple regression coefficient b 1 It turns out there is a simple relationship between b 1 and b 1 which allows for interesting comparisons between simple and multiple regression b 1 5 b 1 1 b 2d 1 323 where d 1 is the slope coefficient from the simple regression of xi2 on xi1 i 5 1 p n This equation shows how b 1 differs from the partial effect of x1 on y The confounding term is the partial effect of x2 on y times the slope in the sample regression of x2 on x1 See Section 3A4 in the chapter appendix for a more general verification The relationship between b 1 and b 1 also shows there are two distinct cases where they are equal 1 The partial effect of x2 on y is zero in the sample That is b 2 5 0 2 x1 and x2 are uncorrelated in the sample That is d 1 5 0 Even though simple and multiple regression estimates are almost never identical we can use the above formula to characterize why they might be either very different or quite similar For exam ple if b 2 is small we might expect the multiple and simple regression estimates of b1 to be similar Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 70 In Example 31 the sample correlation between hsGPA and ACT is about 0346 which is a nontrivial correlation But the coefficient on ACT is fairly little It is not surprising to find that the simple regres sion of colGPA on hsGPA produces a slope estimate of 482 which is not much different from the estimate 453 in 315 ExamplE 33 participation in 401k pension plans We use the data in 401K to estimate the effect of a plans match rate mrate on the participation rate prate in its 401k pension plan The match rate is the amount the firm contributes to a workers fund for each dollar the worker contributes up to some limit thus mrate 5 75 means that the firm contributes 75 for each dollar contributed by the worker The participation rate is the percentage of eligible workers having a 401k account The variable age is the age of the 401k plan There are 1534 plans in the data set the average prate is 8736 the average mrate is 732 and the average age is 132 Regressing prate on mrate age gives prate 5 8012 1 552 mrate 1 243 age n 5 1534 Thus both mrate and age have the expected effects What happens if we do not control for age The estimated effect of age is not trivial and so we might expect a large change in the estimated effect of mrate if age is dropped from the regression However the simple regression of prate on mrate yields prate 5 8308 1 586 mrate The simple regression estimate of the effect of mrate on prate is clearly different from the multiple regression estimate but the difference is not very big The sim ple regression estimate is only about 62 larger than the multiple regression estimate This can be explained by the fact that the sample correlation between mrate and age is only 12 In the case with k independent variables the simple regression of y on x1 and the multiple regres sion of y on x1 x2 c xk produce an identical estimate of x1 only if 1 the OLS coefficients on x2 through xk are all zero or 2 x1 is uncorrelated with each of x2 c xk Neither of these is very likely in practice But if the coefficients on x2 through xk are small or the sample correlations between x1 and the other independent variables are insubstantial then the simple and multiple regression esti mates of the effect of x1 on y can be similar 32h GoodnessofFit As with simple regression we can define the total sum of squares SST the explained sum of squares SSE and the residual sum of squares or sum of squared residuals SSR as SST a n i51 1yi 2 y2 2 324 SSE a n i51 1yi 2 y2 2 325 SSR a n i51u i 2 326 Using the same argument as in the simple regression case we can show that SST 5 SSE 1 SSR 327 In other words the total variation in 5yi6 is the sum of the total variations in 5yi6 and in 5u i6 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 71 Assuming that the total variation in y is nonzero as is the case unless yi is constant in the sample we can divide 327 by SST to get SSRSST 1 SSESST 5 1 Just as in the simple regression case the Rsquared is defined to be R2 SSESST 5 1 2 SSRSST 328 and it is interpreted as the proportion of the sample variation in yi that is explained by the OLS regres sion line By definition R2 is a number between zero and one R2 can also be shown to equal the squared correlation coefficient between the actual yi and the fitted values yi That is R2 5 a a n i51 1yi 2 y2 1yi 2 y 2 b 2 a a n i51 1yi 2 y2 2b a a n i51 1yi 2 y 2 2b 329 We have put the average of the yi in 329 to be true to the formula for a correlation coefficient we know that this average equals y because the sample average of the residuals is zero and yi 5 yi 1 u i ExamplE 34 Determinants of College Gpa From the grade point average regression that we did earlier the equation with R2 is colGPA 5 129 1 453 hsGPA 1 0094 ACT n 5 141 R2 5 176 This means that hsGPA and ACT together explain about 176 of the variation in college GPA for this sample of students This may not seem like a high percentage but we must remember that there are many other factorsincluding family background personality quality of high school education affinity for collegethat contribute to a students college performance If hsGPA and ACT explained almost all of the variation in colGPA then performance in college would be preordained by high school performance An important fact about R2 is that it never decreases and it usually increases when another inde pendent variable is added to a regression and the same set of observations is used for both regressions This algebraic fact follows because by definition the sum of squared residuals never increases when additional regressors are added to the model For example the last digit of ones social security num ber has nothing to do with ones hourly wage but adding this digit to a wage equation will increase the R2 by a little at least An important caveat to the previous assertion about Rsquared is that it assumes we do not have missing data on the explanatory variables If two regressions use different sets of observations then in general we cannot tell how the Rsquareds will compare even if one regression uses a subset of regressors For example suppose we have a full set of data on the variables y x1 and x2 but for some units in our sample data are missing on x3 Then we cannot say that the Rsquared from regressing y on x1 x2 will be less than that from regressing y on x1 x2 and x3 it could go either way Missing data can be an important practical issue and we will return to it in Chapter 9 The fact that R2 never decreases when any variable is added to a regression makes it a poor tool for deciding whether one variable or several variables should be added to a model The factor that should determine whether an explanatory variable belongs in a model is whether the explanatory Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 72 variable has a nonzero partial effect on y in the population We will show how to test this hypoth esis in Chapter 4 when we cover statistical inference We will also see that when used properly R2 allows us to test a group of variables to see if it is important for explaining y For now we use it as a goodnessoffit measure for a given model ExamplE 35 Explaining arrest Records CRIME1 contains data on arrests during the year 1986 and other information on 2725 men born in either 1960 or 1961 in California Each man in the sample was arrested at least once prior to 1986 The variable narr86 is the number of times the man was arrested during 1986 it is zero for most men in the sample 7229 and it varies from 0 to 12 The percentage of men arrested once during 1986 was 2051 The variable pcnv is the proportion not percentage of arrests prior to 1986 that led to conviction avgsen is average sentence length served for prior convictions zero for most people ptime86 is months spent in prison in 1986 and qemp86 is the number of quarters during which the man was employed in 1986 from zero to four A linear model explaining arrests is narr86 5 b0 1 b1pcnv 1 b2avgsen 1 b3ptime86 1 b4qemp86 1 u where pcnv is a proxy for the likelihood for being convicted of a crime and avgsen is a measure of ex pected severity of punishment if convicted The variable ptime86 captures the incarcerative effects of crime if an individual is in prison he cannot be arrested for a crime outside of prison Labor market opportunities are crudely captured by qemp86 First we estimate the model without the variable avgsen We obtain narr86 5 712 2 150 pcnv 2 034 ptime86 2 104 qemp86 n 5 2725 R2 5 0413 This equation says that as a group the three variables pcnv ptime86 and qemp86 explain about 41 of the variation in narr86 Each of the OLS slope coefficients has the anticipated sign An increase in the proportion of convictions lowers the predicted number of arrests If we increase pcnv by 50 a large increase in the probability of conviction then holding the other factors fixed Dnarr86 5 21501502 5 2075 This may seem unusual because an arrest cannot change by a fraction But we can use this value to obtain the predicted change in expected arrests for a large group of men For example among 100 men the predicted fall in arrests when pcnv increases by 50 is 275 Similarly a longer prison term leads to a lower predicted number of arrests In fact if ptime86 increases from 0 to 12 predicted arrests for a particular man fall by 0341122 5 408 Another quarter in which legal employment is reported lowers predicted arrests by 104 which would be 104 arrests among 100 men If avgsen is added to the model we know that R2 will increase The estimated equation is narr86 5 707 2 151 pcnv 1 0074 avgsen 2 037 ptime86 2 103 qemp86 n 5 2725 R2 5 0422 Thus adding the average sentence variable increases R2 from 0413 to 0422 a practically small ef fect The sign of the coefficient on avgsen is also unexpected it says that a longer average sentence length increases criminal activity Example 35 deserves a final word of caution The fact that the four explanatory variables included in the second regression explain only about 42 of the variation in narr86 does not necessarily mean Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 73 that the equation is useless Even though these variables collectively do not explain much of the vari ation in arrests it is still possible that the OLS estimates are reliable estimates of the ceteris paribus effects of each independent variable on narr86 As we will see whether this is the case does not directly depend on the size of R2 Generally a low R2 indicates that it is hard to predict individual outcomes on y with much accuracy something we study in more detail in Chapter 6 In the arrest example the small R2 reflects what we already suspect in the social sciences it is generally very dif ficult to predict individual behavior 32i Regression through the Origin Sometimes an economic theory or common sense suggests that b0 should be zero and so we should briefly mention OLS estimation when the intercept is zero Specifically we now seek an equation of the form y 5 b 1x1 1 b 2x2 1 p 1 b kxk 330 where the symbol over the estimates is used to distinguish them from the OLS estimates obtained along with the intercept as in 311 In 330 when x1 5 0 x2 5 0 p xk 5 0 the predicted value is zero In this case b 1 p b k are said to be the OLS estimates from the regression of y on x1 x2 p xk through the origin The OLS estimates in 330 as always minimize the sum of squared residuals but with the intercept set at zero You should be warned that the properties of OLS that we derived earlier no longer hold for regression through the origin In particular the OLS residuals no longer have a zero sample average Further if R2 is defined as 1 2 SSRSST where SST is given in 324 and SSR is now a n i511yi 2 b 1xi1 2 p 2 b kxik2 2 then R2 can actually be negative This means that the sample average y explains more of the variation in the yi than the explanatory variables Either we should include an intercept in the regression or conclude that the explanatory variables poorly explain y To always have a nonnegative Rsquared some economists prefer to calculate R2 as the squared correla tion coefficient between the actual and fitted values of y as in 329 In this case the average fitted value must be computed directly since it no longer equals y However there is no set rule on comput ing Rsquared for regression through the origin One serious drawback with regression through the origin is that if the intercept b0 in the popula tion model is different from zero then the OLS estimators of the slope parameters will be biased The bias can be severe in some cases The cost of estimating an intercept when b0 is truly zero is that the variances of the OLS slope estimators are larger 33 The Expected Value of the OLS Estimators We now turn to the statistical properties of OLS for estimating the parameters in an underlying population model In this section we derive the expected value of the OLS estimators In particu lar we state and discuss four assumptions which are direct extensions of the simple regression model assumptions under which the OLS estimators are unbiased for the population parameters We also explicitly obtain the bias in OLS when an important variable has been omitted from the regression You should remember that statistical properties have nothing to do with a particular sample but rather with the property of estimators when random sampling is done repeatedly Thus Sections 33 34 and 35 are somewhat abstract Although we give examples of deriving bias for particular models it is not meaningful to talk about the statistical properties of a set of estimates obtained from a single sample The first assumption we make simply defines the multiple linear regression MLR model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 74 Assumption MLR1 linear in parameters The model in the population can be written as y 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u 331 where b0 b1 p bk are the unknown parameters constants of interest and u is an unobserved random error or disturbance term Equation 331 formally states the population model sometimes called the true model to allow for the possibility that we might estimate a model that differs from 331 The key feature is that the model is linear in the parameters b0 b1 p bk As we know 331 is quite flexible because y and the independent variables can be arbitrary functions of the underlying variables of interest such as natu ral logarithms and squares see for example equation 37 Assumption MLR2 Random Sampling We have a random sample of n observations 5 1xi1 xi2 p xik yi2 i 5 1 2 p n6 following the population model in Assumption MLR1 Sometimes we need to write the equation for a particular observation i for a randomly drawn observation from the population we have yi 5 b0 1 b1xi1 1 b2xi2 1 p 1 bkxik 1 ui 332 Remember that i refers to the observation and the second subscript on x is the variable number For example we can write a CEO salary equation for a particular CEO i as log1salaryi2 5 b0 1 b1log1salesi2 1 b2ceoteni 1 b3ceoteni 2 1 ui 333 The term ui contains the unobserved factors for CEO i that affect his or her salary For applications it is usually easiest to write the model in population form as in 331 It contains less clutter and emphasizes the fact that we are interested in estimating a population relationship In light of model 331 the OLS estimators b 0 b 1 b 2 p b k from the regression of y on x1 p xk are now considered to be estimators of b0 b1 p bk In Section 32 we saw that OLS chooses the intercept and slope estimates for a particular sample so that the residuals average to zero and the sam ple correlation between each independent variable and the residuals is zero Still we did not include conditions under which the OLS estimates are well defined for a given sample The next assumption fills that gap Assumption MLR3 No perfect Collinearity In the sample and therefore in the population none of the independent variables is constant and there are no exact linear relationships among the independent variables Assumption MLR3 is more complicated than its counterpart for simple regression because we must now look at relationships between all independent variables If an independent variable in 331 is an exact linear combination of the other independent variables then we say the model suffers from perfect collinearity and it cannot be estimated by OLS It is important to note that Assumption MLR3 does allow the independent variables to be cor related they just cannot be perfectly correlated If we did not allow for any correlation among the b 0 b 1 c b k Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 75 independent variables then multiple regression would be of very limited use for econometric analysis For example in the model relating test scores to educational expenditures and average family income avgscore 5 b0 1 b1expend 1 b2avginc 1 u we fully expect expend and avginc to be correlated school districts with high average family incomes tend to spend more perstudent on education In fact the primary motivation for including avginc in the equation is that we suspect it is correlated with expend and so we would like to hold it fixed in the analysis Assumption MLR3 only rules out perfect correlation between expend and avginc in our sample We would be very unlucky to obtain a sample where perstudent expenditures are per fectly correlated with average family income But some correlation perhaps a substantial amount is expected and certainly allowed The simplest way that two independent variables can be perfectly correlated is when one vari able is a constant multiple of another This can happen when a researcher inadvertently puts the same variable measured in different units into a regression equation For example in estimating a relation ship between consumption and income it makes no sense to include as independent variables income measured in dollars as well as income measured in thousands of dollars One of these is redundant What sense would it make to hold income measured in dollars fixed while changing income measured in thousands of dollars We already know that different nonlinear functions of the same variable can appear among the regressors For example the model cons 5 b0 1 b1inc 1 b2inc2 1 u does not violate Assumption MLR3 even though x2 5 inc2 is an exact function of x1 5 inc inc2 is not an exact linear function of inc Including inc2 in the model is a useful way to generalize functional form unlike including income measured in dollars and in thousands of dollars Common sense tells us not to include the same explanatory variable measured in different units in the same regression equation There are also more subtle ways that one independent variable can be a multiple of another Suppose we would like to estimate an extension of a constant elasticity con sumption function It might seem natural to specify a model such as log1cons2 5 b0 1 b1log1inc2 1 b2log1inc22 1 u 334 where x1 5 log1inc2 and x2 5 log1inc22 Using the basic properties of the natural log see Appendix A log1inc22 5 2 log1inc2 That is x2 5 2x1 and naturally this holds for all observations in the sample This violates Assumption MLR3 What we should do instead is include 3log1inc2 42 not log1inc22 along with loginc This is a sensible extension of the constant elasticity model and we will see how to interpret such models in Chapter 6 Another way that independent variables can be perfectly collinear is when one independent vari able can be expressed as an exact linear function of two or more of the other independent variables For example suppose we want to estimate the effect of campaign spending on campaign outcomes For simplicity assume that each election has two candidates Let voteA be the percentage of the vote for Candidate A let expendA be campaign expenditures by Candidate A let expendB be campaign expenditures by Candidate B and let totexpend be total campaign expenditures the latter three vari ables are all measured in dollars It may seem natural to specify the model as voteA 5 b0 1 b1expendA 1 b2expendB 1 b3totexpend 1 u 335 in order to isolate the effects of spending by each candidate and the total amount of spending But this model violates Assumption MLR3 because x3 5 x1 1 x2 by definition Trying to interpret this equa tion in a ceteris paribus fashion reveals the problem The parameter of b1 in equation 335 is sup posed to measure the effect of increasing expenditures by Candidate A by one dollar on Candidate As vote holding Candidate Bs spending and total spending fixed This is nonsense because if expendB and totexpend are held fixed then we cannot increase expendA The solution to the perfect collinearity in 335 is simple drop any one of the three variables from the model We would probably drop totexpend and then the coefficient on expendA would Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 76 measure the effect of increasing expenditures by A on the percentage of the vote received by A hold ing the spending by B fixed The prior examples show that Assumption MLR3 can fail if we are not careful in specifying our model Assumption MLR3 also fails if the sample size n is too small in relation to the number of parameters being estimated In the general regression model in equation 331 there are k 1 1 parameters and MLR3 fails if n k 1 1 Intuitively this makes sense to estimate k 1 1 parameters we need at least k 1 1 observations Not surprisingly it is better to have as many observations as pos sible something we will see with our variance calculations in Section 34 If the model is carefully specified and n k 1 1 Assumption MLR3 can fail in rare cases due to bad luck in collecting the sample For example in a wage equation with education and experience as variables it is possible that we could obtain a ran dom sample where each individual has exactly twice as much education as years of experience This sce nario would cause Assumption MLR3 to fail but it can be considered very unlikely unless we have an extremely small sample size The final and most important assumption needed for unbiasedness is a direct extension of Assumption SLR4 Assumption MLR4 Zero Conditional mean The error u has an expected value of zero given any values of the independent variables In other words E1u0x1 x2 p xk2 5 0 336 One way that Assumption MLR4 can fail is if the functional relationship between the explained and explanatory variables is misspecified in equation 331 for example if we forget to include the quad ratic term inc2 in the consumption function cons 5 b0 1 b1inc 1 b2inc2 1 u when we estimate the model Another functional form misspecification occurs when we use the level of a variable when the log of the variable is what actually shows up in the population model or vice versa For example if the true model has logwage as the dependent variable but we use wage as the dependent variable in our regression analysis then the estimators will be biased Intuitively this should be pretty clear We will discuss ways of detecting functional form misspecification in Chapter 9 Omitting an important factor that is correlated with any of x1 x2 p xk causes Assumption MLR4 to fail also With multiple regression analysis we are able to include many factors among the explana tory variables and omitted variables are less likely to be a problem in multiple regression analysis than in simple regression analysis Nevertheless in any application there are always factors that due to data limitations or ignorance we will not be able to include If we think these factors should be controlled for and they are correlated with one or more of the independent variables then Assumption MLR4 will be violated We will derive this bias later There are other ways that u can be correlated with an explanatory variable In Chapters 9 and 15 we will discuss the problem of measurement error in an explanatory variable In Chapter 16 we cover the conceptually more difficult problem in which one or more of the explanatory variables is determined jointly with yas occurs when we view quantities and prices as being determined by the intersection of supply and demand curves We must postpone our study of these problems until we have a firm grasp of multiple regression analysis under an ideal set of assumptions When Assumption MLR4 holds we often say that we have exogenous explanatory variables If xj is correlated with u for any reason then xj is said to be an endogenous explanatory variable The In the previous example if we use as ex planatory variables expendA expendB and shareA where shareA 5 100 expendA totexpend is the percentage share of total campaign expenditures made by Candidate A does this violate Assumption MLR3 Exploring FurthEr 33 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 77 terms exogenous and endogenous originated in simultaneous equations analysis see Chapter 16 but the term endogenous explanatory variable has evolved to cover any case in which an explana tory variable may be correlated with the error term Before we show the unbiasedness of the OLS estimators under MLR1 to MLR4 a word of cau tion Beginning students of econometrics sometimes confuse Assumptions MLR3 and MLR4 but they are quite different Assumption MLR3 rules out certain relationships among the independent or explanatory variables and has nothing to do with the error u You will know immediately when car rying out OLS estimation whether or not Assumption MLR3 holds On the other hand Assumption MLR4the much more important of the tworestricts the relationship between the unobserved factors in u and the explanatory variables Unfortunately we will never know for sure whether the average value of the unobserved factors is unrelated to the explanatory variables But this is the criti cal assumption We are now ready to show unbiasedness of OLS under the first four multiple regression assump tions As in the simple regression case the expectations are conditional on the values of the explana tory variables in the sample something we show explicitly in Appendix 3A but not in the text UNbiaSEDNESS of olS Under Assumptions MLR1 through MLR4 E1b j2 5 bj j 5 0 1 p k 337 for any values of the population parameter bj In other words the OLS estimators are unbiased estimators of the population parameters thEorEm 31 In our previous empirical examples Assumption MLR3 has been satisfied because we have been able to compute the OLS estimates Furthermore for the most part the samples are randomly chosen from a welldefined population If we believe that the specified models are correct under the key Assumption MLR4 then we can conclude that OLS is unbiased in these examples Since we are approaching the point where we can use multiple regression in serious empirical work it is useful to remember the meaning of unbiasedness It is tempting in examples such as the wage equation in 319 to say something like 92 is an unbiased estimate of the return to educa tion As we know an estimate cannot be unbiased an estimate is a fixed number obtained from a particular sample which usually is not equal to the population parameter When we say that OLS is unbiased under Assumptions MLR1 through MLR4 we mean that the procedure by which the OLS estimates are obtained is unbiased when we view the procedure as being applied across all possible random samples We hope that we have obtained a sample that gives us an estimate close to the popu lation value but unfortunately this cannot be assured What is assured is that we have no reason to believe our estimate is more likely to be too big or more likely to be too small 33a Including Irrelevant Variables in a Regression Model One issue that we can dispense with fairly quickly is that of inclusion of an irrelevant variable or overspecifying the model in multiple regression analysis This means that one or more of the inde pendent variables is included in the model even though it has no partial effect on y in the population That is its population coefficient is zero To illustrate the issue suppose we specify the model as y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 u 338 and this model satisfies Assumptions MLR1 through MLR4 However x3 has no effect on y after x1 and x2 have been controlled for which means that b3 5 0 The variable x3 may or may not be Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 78 correlated with x1 or x2 all that matters is that once x1 and x2 are controlled for x3 has no effect on y In terms of conditional expectations E1y0x1 x2 x32 5 E1y0x1 x22 5 b0 1 b1x1 1 b2x2 Because we do not know that b3 5 0 we are inclined to estimate the equation including x3 y 5 b 0 1 b 1x1 1 b 2x2 1 b 3x3 339 We have included the irrelevant variable x3 in our regression What is the effect of including x3 in 339 when its coefficient in the population model 338 is zero In terms of the unbiasedness of b 1 and b 2 there is no effect This conclusion requires no special derivation as it follows immediately from Theorem 31 Remember unbiasedness means E1b j2 5 bj for any value of bj including bj 5 0 Thus we can conclude that E1b 02 5 b0 E1b 12 5 b1 E1b 22 5 b2 E1b 32 5 0 for any values of b0 b1 and b2 Even though b 3 itself will never be exactly zero its average value across all random samples will be zero The conclusion of the preceding example is much more general including one or more irrelevant variables in a multiple regression model or overspecifying the model does not affect the unbiased ness of the OLS estimators Does this mean it is harmless to include irrelevant variables No As we will see in Section 34 including irrelevant variables can have undesirable effects on the variances of the OLS estimators 33b Omitted Variable Bias The Simple Case Now suppose that rather than including an irrelevant variable we omit a variable that actually belongs in the true or population model This is often called the problem of excluding a relevant variable or underspecifying the model We claimed in Chapter 2 and earlier in this chapter that this problem generally causes the OLS estimators to be biased It is time to show this explicitly and just as importantly to derive the direction and size of the bias Deriving the bias caused by omitting an important variable is an example of misspecification analysis We begin with the case where the true population model has two explanatory variables and an error term y 5 b0 1 b1x1 1 b2x2 1 u 340 and we assume that this model satisfies Assumptions MLR1 through MLR4 Suppose that our primary interest is in b1 the partial effect of x1 on y For example y is hourly wage or log of hourly wage x1 is education and x2 is a measure of innate ability In order to get an unbiased estimator of b1 we should run a regression of y on x1 and x2 which gives unbiased estima tors of b0 b1 and b2 However due to our ignorance or data unavailability we estimate the model by excluding x2 In other words we perform a simple regression of y on x1 only obtaining the equation y 5 b 0 1 b 1x1 341 We use the symbol rather than to emphasize that b 1 comes from an underspecified model When first learning about the omitted variable problem it can be difficult to distinguish between the underlying true model 340 in this case and the model that we actually estimate which is cap tured by the regression in 341 It may seem silly to omit the variable x2 if it belongs in the model but often we have no choice For example suppose that wage is determined by wage 5 b0 1 b1educ 1 b2abil 1 u 342 Since ability is not observed we instead estimate the model wage 5 b0 1 b1educ 1 v where v 5 b2abil 1 u The estimator of b1 from the simple regression of wage on educ is what we are calling b 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 79 We derive the expected value of b 1 conditional on the sample values of x1 and x2 Deriving this expectation is not difficult because b 1 is just the OLS slope estimator from a simple regression and we have already studied this estimator extensively in Chapter 2 The difference here is that we must analyze its properties when the simple regression model is misspecified due to an omitted variable As it turns out we have done almost all of the work to derive the bias in the simple regression estimator of b 1 From equation 323 we have the algebraic relationship b 1 5 b 1 1 b 2d 1 where b 1 and b 2 are the slope estimators if we could have them from the multiple regression yi on xi1 xi2 i 5 1 p n 343 and d 1 is the slope from the simple regression xi2 on xi1 i 5 1 p n 344 Because d 1 depends only on the independent variables in the sample we treat it as fixed nonran dom when computing Eb 1 Further since the model in 340 satisfies Assumptions MLR1 through MLR4 we know that b 1 and b 2 would be unbiased for b1 and b2 respectively Therefore E1b 12 5 E1b 1 1 b 2d 12 5 E1b 12 1 E1b 22 d 1 5 b1 1 b2d 1 345 which implies the bias in b 1 is Bias1b 12 5 E1b 12 2 b1 5 b2d 1 346 Because the bias in this case arises from omitting the explanatory variable x2 the term on the right hand side of equation 346 is often called the omitted variable bias From equation 346 we see that there are two cases where b 1 is unbiased The first is pretty obvious if b2 5 0so that x2 does not appear in the true model 340then b 1 is unbiased We already know this from the simple regression analysis in Chapter 2 The second case is more interest ing If d 1 5 0 then b 1 is unbiased for b1 even if b2 2 0 Because d 1 is the sample covariance between x1 and x2 over the sample variance of x1 d 1 5 0 if and only if x1 and x2 are uncorrelated in the sample Thus we have the important conclusion that if x1 and x2 are uncorrelated in the sample then b 1 is unbiased This is not surprising in Section 32 we showed that the simple regression estimator b 1 and the multiple regression estimator b 1 are the same when x1 and x2 are uncorrelated in the sample We can also show that b 1 is unbiased without conditioning on the xi2 if E1x20x12 5 E1x22 then for estimating b1 leaving x2 in the error term does not violate the zero conditional mean assumption for the error once we adjust the intercept When x1 and x2 are correlated d 1 has the same sign as the correlation between x1 and x2 d 1 0 if x1 and x2 are positively correlated and d 1 0 if x1 and x2 are negatively correlated The sign of the bias in b 1 depends on the signs of both b2 and d 1 and is summarized in Table 32 for the four possible cases when there is bias Table 32 warrants careful study For example the bias in b 1 is positive if b2 0 x2 has a positive effect on y and x1 and x2 are positively correlated the bias is negative if b2 0 and x1 and x2 are negatively correlated and so on Table 32 summarizes the direction of the bias but the size of the bias is also very important A small bias of either sign need not be a cause for concern For example if the return to education in the population is 86 and the bias in the OLS estimator is 01 a tenth of one percentage point then TAblE 32 Summary of Bias in b1 When x2 Is Omitted in Estimating Equation 340 Corr1x1 x22 0 Corr1x1 x22 0 b2 0 Positive bias Negative bias b2 0 Negative bias Positive bias Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 80 we would not be very concerned On the other hand a bias on the order of three percentage points would be much more serious The size of the bias is determined by the sizes of b2 and d 1 In practice since b2 is an unknown population parameter we cannot be certain whether b2 is positive or negative Nevertheless we usually have a pretty good idea about the direction of the partial effect of x2 on y Further even though the sign of the correlation between x1 and x2 cannot be known if x2 is not observed in many cases we can make an educated guess about whether x1 and x2 are posi tively or negatively correlated In the wage equation 342 by definition more ability leads to higher productivity and therefore higher wages b2 0 Also there are reasons to believe that educ and abil are positively correlated on average individuals with more innate ability choose higher levels of education Thus the OLS estimates from the simple regression equation wage 5 b0 1 b1educ 1 v are on average too large This does not mean that the estimate obtained from our sample is too big We can only say that if we collect many random samples and obtain the simple regression estimates each time then the average of these estimates will be greater than b1 ExamplE 36 Hourly Wage Equation Suppose the model log1wage2 5 b0 1 b1educ 1 b2abil 1 u satisfies Assumptions MLR1 through MLR4 The data set in WAGE1 does not contain data on ability so we estimate b1 from the simple regression log1wage2 5 584 1 083 educ n 5 526 R2 5 186 347 This is the result from only a single sample so we cannot say that 083 is greater than b1 the true return to education could be lower or higher than 83 and we will never know for sure Nevertheless we know that the average of the estimates across all random samples would be too large As a second example suppose that at the elementary school level the average score for students on a standardized exam is determined by avgscore 5 b0 1 b1expend 1 b2povrate 1 u 348 where expend is expenditure perstudent and povrate is the poverty rate of the children in the school Using school district data we only have observations on the percentage of students with a passing grade and perstudent expenditures we do not have information on poverty rates Thus we estimate b1 from the simple regression of avgscore on expend We can again obtain the likely bias in b 1 First b2 is probably negative there is ample evidence that children living in poverty score lower on average on standardized tests Second the average expenditure perstudent is probably negatively correlated with the poverty rate The higher the pov erty rate the lower the average perstudent spending so that Corr1x1 x22 0 From Table 32 b 1 will have a positive bias This observation has important implications It could be that the true effect of spending is zero that is b1 5 0 However the simple regression estimate of b1 will usually be greater than zero and this could lead us to conclude that expenditures are important when they are not When reading and performing empirical work in economics it is important to master the termi nology associated with biased estimators In the context of omitting a variable from model 340 if E1b 12 b1 then we say that b 1 has an upward bias When E1b 12 b1 b 1 has a downward bias These definitions are the same whether b1 is positive or negative The phrase biased toward zero refers to cases where E1b 12 is closer to zero than is b1 Therefore if b1 is positive then b 1 is biased toward zero if it has a downward bias On the other hand if b1 0 then b 1 is biased toward zero if it has an upward bias Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 81 33c Omitted Variable Bias More General Cases Deriving the sign of omitted variable bias when there are multiple regressors in the estimated model is more difficult We must remember that correlation between a single explanatory variable and the error generally results in all OLS estimators being biased For example suppose the population model y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 u 349 satisfies Assumptions MLR1 through MLR4 But we omit x3 and estimate the model as y 5 b 0 1 b 1x1 1 b 2x2 350 Now suppose that x2 and x3 are uncorrelated but that x1 is correlated with x3 In other words x1 is correlated with the omitted variable but x2 is not It is tempting to think that while b 1 is probably biased based on the derivation in the previous subsection b 2 is unbiased because x2 is uncorrelated with x3 Unfortunately this is not generally the case both b 1 and b 2 will normally be biased The only exception to this is when x1 and x2 are also uncorrelated Even in the fairly simple model above it can be difficult to obtain the direction of bias in b 1 and b 2 This is because x1 x2 and x3 can all be pairwise correlated Nevertheless an approximation is often practically useful If we assume that x1 and x2 are uncorrelated then we can study the bias in b 1 as if x2 were absent from both the population and the estimated models In fact when x1 and x2 are uncorrelated it can be shown that E1b 12 5 b1 1 b3 a n i51 1xi1 2 x12xi3 a n i51 1xi1 2 x12 2 This is just like equation 345 but b3 replaces b2 and x3 replaces x2 in regression 344 Therefore the bias in b 1 is obtained by replacing b2 with b3 and x2 with x3 in Table 32 If b3 0 and Corr 1x1 x32 0 the bias in b 1 is positive and so on As an example suppose we add exper to the wage model wage 5 b0 1 b1educ 1 b2exper 1 b3abil 1 u If abil is omitted from the model the estimators of both b1 and b2 are biased even if we assume exper is uncorrelated with abil We are mostly interested in the return to education so it would be nice if we could conclude that b 1 has an upward or a downward bias due to omitted ability This conclusion is not possible without further assumptions As an approximation let us suppose that in addition to exper and abil being uncorrelated educ and exper are also uncorrelated In reality they are some what negatively correlated Since b3 0 and educ and abil are positively correlated b 1 would have an upward bias just as if exper were not in the model The reasoning used in the previous example is often followed as a rough guide for obtaining the likely bias in estimators in more complicated models Usually the focus is on the relationship between a particular explanatory variable say x1 and the key omitted factor Strictly speaking ignor ing all other explanatory variables is a valid practice only when each one is uncorrelated with x1 but it is still a useful guide Appendix 3A contains a more careful analysis of omitted variable bias with multiple explanatory variables 34 The Variance of the OLS Estimators We now obtain the variance of the OLS estimators so that in addition to knowing the central ten dencies of the b j we also have a measure of the spread in its sampling distribution Before finding the variances we add a homoskedasticity assumption as in Chapter 2 We do this for two reasons Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 82 First the formulas are simplified by imposing the constant error variance assumption Second in Section 35 we will see that OLS has an important efficiency property if we add the homoskedasticity assumption In the multiple regression framework homoskedasticity is stated as follows Assumption MLR5 Homoskedasticity The error u has the same variance given any value of the explanatory variables In other words Var1u0x1 p x k2 5 s2 Assumption MLR5 means that the variance in the error term u conditional on the explanatory vari ables is the same for all combinations of outcomes of the explanatory variables If this assumption fails then the model exhibits heteroskedasticity just as in the twovariable case In the equation wage 5 b0 1 b1educ 1 b2exper 1 b3tenure 1 u homoskedasticity requires that the variance of the unobserved error u does not depend on the levels of education experience or tenure That is Var1u0educ exper tenure2 5 s2 If this variance changes with any of the three explanatory variables then heteroskedasticity is present Assumptions MLR1 through MLR5 are collectively known as the GaussMarkov assumptions for crosssectional regression So far our statements of the assumptions are suitable only when applied to crosssectional analysis with random sampling As we will see the GaussMarkov assump tions for time series analysis and for other situations such as panel data analysis are more difficult to state although there are many similarities In the discussion that follows we will use the symbol x to denote the set of all independent vari ables 1x1 p xk2 Thus in the wage regression with educ exper and tenure as independent variables x 5 1educ exper tenure2 Then we can write Assumptions MLR1 and MLR4 as E1y0x2 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk and Assumption MLR5 is the same as Var1y0x2 5 s2 Stating the assumptions in this way clearly illustrates how Assumption MLR5 differs greatly from Assumption MLR4 Assumption MLR4 says that the expected value of y given x is linear in the parameters but it certainly depends on x1 x2 c xk Assumption MLR5 says that the variance of y given x does not depend on the values of the independent variables We can now obtain the variances of the b j where we again condition on the sample values of the independent variables The proof is in the appendix to this chapter SampliNG VaRiaNCES of tHE olS SlopE EStimatoRS Under Assumptions MLR1 through MLR5 conditional on the sample values of the independent variables Var1b j2 5 s2 SSTj11 2 R2 j 2 r 351 for j 5 1 2 p k where SSTj 5 a n i511xij 2 xj2 2 is the total sample variation in xj and R2 j is the Rsquared from regressing xj on all other independent variables and including an intercept thEorEm 32 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 83 The careful reader may be wondering whether there is a simple formula for the variance of b j where we do not condition on the sample outcomes of the explanatory variables The answer is none that is useful The formula in 351 is a highly nonlinear function of the xij making averaging out across the population distribution of the explanatory variables virtually impossible Fortunately for any practical purpose equation 351 is what we want Even when we turn to approximate large sample properties of OLS in Chapter 5 it turns out that 351 estimates the quantity we need for largesample analysis provided Assumptions MLR1 through MLR5 hold Before we study equation 351 in more detail it is important to know that all of the Gauss Markov assumptions are used in obtaining this formula Whereas we did not need the homoskedastic ity assumption to conclude that OLS is unbiased we do need it to justify equation 351 The size of Var1b j2 is practically important A larger variance means a less precise estimator and this translates into larger confidence intervals and less accurate hypotheses tests as we will see in Chapter 4 In the next subsection we discuss the elements comprising 351 34a The Components of the OLS Variances Multicollinearity Equation 351 shows that the variance of b j depends on three factors s2 SSTj and R2 j Remember that the index j simply denotes any one of the independent variables such as education or poverty rate We now consider each of the factors affecting Var1b j2 in turn The Error Variance s2 From equation 351 a larger s2 means larger sampling variances for the OLS estimators This is not at all surprising more noise in the equation a larger s2 makes it more difficult to estimate the partial effect of any of the independent variables on y and this is re flected in higher variances for the OLS slope estimators Because s2 is a feature of the population it has nothing to do with the sample size It is the one component of 351 that is unknown We will see later how to obtain an unbiased estimator of s2 For a given dependent variable y there is really only one way to reduce the error variance and that is to add more explanatory variables to the equation take some factors out of the error term Unfortunately it is not always possible to find additional legitimate factors that affect y The Total Sample Variation in xj SSTj From equation 351 we see that the larger the total variation in xj is the smaller is Var1b j2 Thus everything else being equal for estimating bj we prefer to have as much sample variation in xj as possible We already discovered this in the simple regression case in Chapter 2 Although it is rarely possible for us to choose the sample values of the independent variables there is a way to increase the sample variation in each of the independent vari ables increase the sample size In fact when one randomly samples from a population SSTj increases without bound as the sample size increasesroughly as a linear function of n This is the component of the variance that systematically depends on the sample size When SSTj is small Var1b j2 can get very large but a small SSTj is not a violation of Assumption MLR3 Technically as SSTj goes to zero Var1b j2 approaches infinity The extreme case of no sam ple variation in xj SSTj 5 0 is not allowed by Assumption MLR3 The Linear Relationships among the Independent Variables R 2 j The term R2 j in equation 351 is the most difficult of the three components to understand This term does not appear in simple regression analysis because there is only one independent variable in such cases It is important to see that this Rsquared is distinct from the Rsquared in the regression of y on x1 x2 p xk R2 j is obtained from a regression involving only the independent variables in the original model where xj plays the role of a dependent variable Consider first the k 5 2 case y 5 b0 1 b1x1 1 b2x2 1 u Then Var1b 12 5 s23SST111 2 R2 12 4 where R2 1 is the Rsquared from the simple regression of x1 on x2 and an intercept as always Because Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 84 the Rsquared measures goodnessoffit a value of R2 1 close to one indicates that x2 explains much of the variation in x1 in the sample This means that x1 and x2 are highly correlated As R2 1 increases to one Var1b 12 gets larger and larger Thus a high degree of linear relationship between x1 and x2 can lead to large variances for the OLS slope estimators A similar argument applies to b 2 See Figure 31 for the relationship between Var1b 12 and the Rsquared from the regression of x1 on x2 In the general case R2 j is the proportion of the total variation in xj that can be explained by the other independent variables appearing in the equation For a given s2 and SSTj the smallest Var1b j2 is obtained when R2 j 5 0 which happens if and only if xj has zero sample correlation with every other independent variable This is the best case for estimating bj but it is rarely encountered The other extreme case R2 j 5 1 is ruled out by Assumption MLR3 because R2 j 5 1 means that in the sample xj is a perfect linear combination of some of the other independent variables in the regression A more relevant case is when R2 j is close to one From equation 351 and Figure 31 we see that this can cause Var1b j2 to be large Var1b j2 S as R2 j S 1 High but not perfect cor relation between two or more independent variables is called multicollinearity Before we discuss the multicollinearity issue further it is important to be very clear on one thing A case where R2 j is close to one is not a violation of Assumption MLR3 Since multicollinearity violates none of our assumptions the problem of multicollinearity is not really well defined When we say that multicollinearity arises for estimating bj when R2 j is close to one we put close in quotation marks because there is no absolute number that we can cite to conclude that multicollinearity is a problem For example R2 j 5 9 means that 90 of the sample variation in xj can be explained by the other independent variables in the regression model Unquestionably this means that xj has a strong linear relationship to the other independent variables But whether this translates into a Var1b j2 that is too large to be useful depends on the sizes of s2 and SSTj As we will see in Chapter 4 for statistical inference what ultimately matters is how big b j is in relation to its standard deviation Var 1 ˆ 0 1 R1 2 FiguRE 31 Var1b 12 as a function of R 2 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 85 Just as a large value of R2 j can cause a large Var1b j2 so can a small value of SSTj Therefore a small sample size can lead to large sampling variances too Worrying about high degrees of cor relation among the independent variables in the sample is really no different from worrying about a small sample size both work to increase Var1b j2 The famous University of Wisconsin econometri cian Arthur Goldberger reacting to econometricians obsession with multicollinearity has tongue in cheek coined the term micronumerosity which he defines as the problem of small sample size For an engaging discussion of multicollinearity and micronumerosity see Goldberger 1991 Although the problem of multicollinearity cannot be clearly defined one thing is clear every thing else being equal for estimating bj it is better to have less correlation between xj and the other independent variables This observation often leads to a discussion of how to solve the multicol linearity problem In the social sciences where we are usually passive collectors of data there is no good way to reduce variances of unbiased estimators other than to collect more data For a given data set we can try dropping other independent variables from the model in an effort to reduce multicol linearity Unfortunately dropping a variable that belongs in the population model can lead to bias as we saw in Section 33 Perhaps an example at this point will help clarify some of the issues raised concerning multicol linearity Suppose we are interested in estimating the effect of various school expenditure categories on student performance It is likely that expenditures on teacher salaries instructional materials ath letics and so on are highly correlated wealthier schools tend to spend more on everything and poorer schools spend less on everything Not surprisingly it can be difficult to estimate the effect of any particular expenditure category on student performance when there is little variation in one category that cannot largely be explained by variations in the other expenditure categories this leads to high R2 j for each of the expenditure variables Such multicollinearity problems can be mitigated by col lecting more data but in a sense we have imposed the problem on ourselves we are asking questions that may be too subtle for the available data to answer with any precision We can probably do much better by changing the scope of the analysis and lumping all expenditure categories together since we would no longer be trying to estimate the partial effect of each separate category Another important point is that a high degree of correlation between certain independent vari ables can be irrelevant as to how well we can estimate other parameters in the model For example consider a model with three independent variables y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 u where x2 and x3 are highly correlated Then Var1b 22 and Var1b 32 may be large But the amount of correlation between x2 and x3 has no direct effect on Var1b 12 In fact if x1 is uncorrelated with x2 and x3 then R2 1 5 0 and Var1b 12 5 s2SST1 regard less of how much correlation there is between x2 and x3 If b1 is the parameter of interest we do not really care about the amount of correlation between x2 and x3 The previous observation is important because economists often include many control variables in order to isolate the causal effect of a particular variable For example in looking at the relation ship between loan approval rates and percentage of minorities in a neighborhood we might include variables like average income average housing value measures of creditworthiness and so on because these factors need to be accounted for in order to draw causal conclusions about discrimina tion Income housing prices and creditworthiness are generally highly correlated with each other Suppose you postulate a model explain ing final exam score in terms of class at tendance Thus the dependent variable is final exam score and the key explanatory variable is number of classes attended To control for student abilities and efforts out side the classroom you include among the explanatory variables cumulative GPA SAT score and measures of high school perfor mance Someone says You cannot hope to learn anything from this exercise because cumulative GPA SAT score and high school performance are likely to be highly collinear What should be your response Exploring FurthEr 34 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 86 But high correlations among these controls do not make it more difficult to determine the effects of discrimination Some researchers find it useful to compute statistics intended to determine the severity of mul ticollinearity in a given application Unfortunately it is easy to misuse such statistics because as we have discussed we cannot specify how much correlation among explanatory variables is too much Some multicollinearity diagnostics are omnibus statistics in the sense that they detect a strong linear relationship among any subset of explanatory variables For reasons that we just saw such statistics are of questionable value because they might reveal a problem simply because two control variables whose coefficients we do not care about are highly correlated Probably the most common omnibus multicollinearity statistic is the socalled condition number which is defined in terms of the full data matrix and is beyond the scope of this text See for example Belsley Kuh and Welsh 1980 Somewhat more useful but still prone to misuse are statistics for individual coefficients The most common of these is the variance inflation factor VIF which is obtained directly from equa tion 351 The VIF for slope coefficient j is simply VIFj 5 111 2 R2 j 2 precisely the term in Var1b j2 that is determined by correlation between xj and the other explanatory variables We can write Var1b j2 in equation 351 as Var1b j2 5 s2 SSTj VIFj which shows that VIFj is the factor by which Var1b j2 is higher because xj is not uncorrelated with the other explanatory variables Because VIFj is a function of R2 jindeed Figure 31 is essentially a graph of VIF1our previous discussion can be cast entirely in terms of the VIF For example if we had the choice we would like VIFj to be smaller other things equal But we rarely have the choice If we think certain explanatory variables need to be included in a regression to infer causality of xj then we are hesitant to drop them and whether we think VIFj is too high cannot really affect that decision If say our main interest is in the causal effect of x1 on y then we should ignore entirely the VIFs of other coefficients Finally setting a cutoff value for VIF above which we conclude multicol linearity is a problem is arbitrary and not especially helpful Sometimes the value 10 is chosen if VIFj is above 10 equivalently R2 j is above 9 then we conclude that multicollinearity is a problem for estimating bj But a VIFj above 10 does not mean that the standard deviation of b j is too large to be useful because the standard deviation also depends on s and SSTj and the latter can be increased by increasing the sample size Therefore just as with looking at the size of R2 j directly looking at the size of VIFj is of limited use although one might want to do so out of curiosity 34b Variances in Misspecified Models The choice of whether to include a particular variable in a regression model can be made by analyzing the tradeoff between bias and variance In Section 33 we derived the bias induced by leaving out a relevant variable when the true model contains two explanatory variables We continue the analysis of this model by comparing the variances of the OLS estimators Write the true population model which satisfies the GaussMarkov assumptions as y 5 b0 1 b1x1 1 b2x2 1 u We consider two estimators of b1 The estimator b 1 comes from the multiple regression y 5 b 0 1 b 1x1 1 b 2x2 352 In other words we include x2 along with x1 in the regression model The estimator b 1 is obtained by omitting x2 from the model and running a simple regression of y on x1 y 5 b 0 1 b 1x1 353 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 87 When b2 2 0 equation 353 excludes a relevant variable from the model and as we saw in Section 33 this induces a bias in b 1 unless x1 and x2 are uncorrelated On the other hand b 1 is unbi ased for b1 for any value of b2 including b2 5 0 It follows that if bias is used as the only criterion b 1 is preferred to b 1 The conclusion that b 1 is always preferred to b 1 does not carry over when we bring variance into the picture Conditioning on the values of x1 and x2 in the sample we have from 351 Var1b 12 5 s23SST111 2 R2 12 4 354 where SST1 is the total variation in x1 and R2 1 is the Rsquared from the regression of x1 on x2 Further a simple modification of the proof in Chapter 2 for twovariable regression shows that Var1b 12 5 s2SST1 355 Comparing 355 to 354 shows that Var1b 12 is always smaller than Var1b 12 unless x1 and x2 are uncorrelated in the sample in which case the two estimators b 1 and b 1 are the same Assuming that x1 and x2 are not uncorrelated we can draw the following conclusions 1 When b2 2 0 b 1 is biased b 1 is unbiased and Var1b 12 Var1b 12 2 When b2 5 0 b 1 and b 1 are both unbiased and Var1b 12 Var1b 12 From the second conclusion it is clear that b 1 is preferred if b2 5 0 Intuitively if x2 does not have a partial effect on y then including it in the model can only exacerbate the multicollinearity problem which leads to a less efficient estimator of b1 A higher variance for the estimator of b1 is the cost of including an irrelevant variable in a model The case where b2 2 0 is more difficult Leaving x2 out of the model results in a biased estimator of b1 Traditionally econometricians have suggested comparing the likely size of the bias due to omit ting x2 with the reduction in the variancesummarized in the size of R2 1to decide whether x2 should be included However when b2 2 0 there are two favorable reasons for including x2 in the model The most important of these is that any bias in b 1 does not shrink as the sample size grows in fact the bias does not necessarily follow any pattern Therefore we can usefully think of the bias as being roughly the same for any sample size On the other hand Var1b 12 and Var1b 12 both shrink to zero as n gets large which means that the multicollinearity induced by adding x2 becomes less important as the sample size grows In large samples we would prefer b 1 The other reason for favoring b 1 is more subtle The variance formula in 355 is conditional on the values of xi1 and xi2 in the sample which provides the best scenario for b 1 When b2 2 0 the vari ance of b 1 conditional only on x1 is larger than that presented in 355 Intuitively when b2 2 0 and x2 is excluded from the model the error variance increases because the error effectively contains part of x2 But the expression in equation 355 ignores the increase in the error variance because it will treat both regressors as nonrandom For practical purposes the s2 term in equation 355 increases when x2 is dropped from the equation A full discussion of the proper conditioning argument when computing the OLS variances would lead us too far astray Suffice it to say that equation 355 is too generous when it comes to measuring the precision of b 1 Fortunately statistical packages report the proper variance estimator and so we need not worry about the subtleties in the theoretical formulas After reading the next subsection you might want to study Problems 14 and 15 for further insight 34c Estimating s2 Standard Errors of the OLS Estimators We now show how to choose an unbiased estimator of s2 which then allows us to obtain unbiased estimators of Var1b j2 Because s2 5 E1u22 an unbiased estimator of s2 is the sample average of the squared errors n21g n i51 u2 i Unfortunately this is not a true estimator because we do not observe the ui Nevertheless recall that the errors can be written as ui 5 yi 2 b0 2 b1xi1 2 b2xi2 2 p 2 bkxik and so the reason Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 88 we do not observe the ui is that we do not know the bj When we replace each bj with its OLS estima tor we get the OLS residuals u i 5 yi 2 b 0 2 b 1xi1 2 b 2xi2 2 p 2 b kxik It seems natural to estimate s2 by replacing ui with the u i In the simple regression case we saw that this leads to a biased estimator The unbiased estimator of s2 in the general multiple regression case is s 2 5 a a n i51u 2 i b1n 2 k 2 12 5 SSR1n 2 k 2 12 356 We already encountered this estimator in the k 5 1 case in simple regression The term n k 1 in 356 is the degrees of freedom df for the general OLS problem with n observations and k independent variables Since there are k 1 parameters in a regression model with k independent variables and an intercept we can write df 5 n 2 1k 1 12 5 1number of observations2 2 1number of estimated parameters2 357 This is the easiest way to compute the degrees of freedom in a particular application count the num ber of parameters including the intercept and subtract this amount from the number of observations In the rare case that an intercept is not estimated the number of parameters decreases by one Technically the division by n k 1 in 356 comes from the fact that the expected value of the sum of squared residuals is E1SSR2 5 1n 2 k 2 12s2 Intuitively we can figure out why the degrees of freedom adjustment is necessary by returning to the first order conditions for the OLS estimators These can be written g n i51u i 5 0 and g n i51xiju i 5 0 where j 5 1 2 p k Thus in obtaining the OLS estimates k 1 1 restrictions are imposed on the OLS residuals This means that given n k 1 1 of the residuals the remaining k 1 1 residuals are known there are only n 2 k 1 1 degrees of freedom in the residuals This can be contrasted with the errors ui which have n degrees of freedom in the sample For reference we summarize this discussion with Theorem 33 We proved this theorem for the case of simple regression analysis in Chapter 2 see Theorem 23 A general proof that requires matrix algebra is provided in Appendix E Unbiased estimation of s2 Under the GaussMarkov assumptions MLR1 through MLR5 E1s 22 5 s2 Theorem 33 The positive square root of s 2 denoted s is called the standard error of the regression SER The SER is an estimator of the standard deviation of the error term This estimate is usually reported by regression packages although it is called different things by different packages In addition to SER s is also called the standard error of the estimate and the root mean squared error Note that s can either decrease or increase when another independent variable is added to a regression for a given sample This is because although SSR must fall when another explanatory variable is added the degrees of freedom also falls by one Because SSR is in the numerator and df is in the denominator we cannot tell beforehand which effect will dominate For constructing confidence intervals and conducting tests in Chapter 4 we will need to estimate the standard deviation of b j which is just the square root of the variance sd1b j2 5 s3SSTj11 2 R2 j 2 412 Since s is unknown we replace it with its estimator s This gives us the standard error of b j se1b j2 5 s 3SSTj11 2 R2 j 2 412 358 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 89 Just as the OLS estimates can be obtained for any given sample so can the standard errors Since se1b j2 depends on s the standard error has a sampling distribution which will play a role in Chapter 4 We should emphasize one thing about standard errors Because 358 is obtained directly from the variance formula in 351 and because 351 relies on the homoskedasticity Assumption MLR5 it follows that the standard error formula in 358 is not a valid estimator of sd1b j2 if the errors exhibit heteroskedasticity Thus while the presence of heteroskedasticity does not cause bias in the b j it does lead to bias in the usual formula for Var1b j2 which then invalidates the standard errors This is important because any regression package computes 358 as the default standard error for each coef ficient with a somewhat different representation for the intercept If we suspect heteroskedasticity then the usual OLS standard errors are invalid and some corrective action should be taken We will see in Chapter 8 what methods are available for dealing with heteroskedasticity For some purposes it is helpful to write se1b j2 5 s nsd1xj21 2 R2 j 359 in which we take sd1xj2 5 n21g n i511xij 2 xj2 2 to be the sample standard deviation where the total sum of squares is divided by n rather than n 1 The importance of equation 359 is that it shows how the sample size n directly affects the standard errors The other three terms in the formulas sd1xj2 and R2 jwill change with different samples but as n gets large they settle down to constants Therefore we can see from equation 359 that the standard errors shrink to zero at the rate 1n This formula demonstrates the value of getting more data the precision of the b j increases as n increases By contrast recall that unbiasedness holds for any sample size subject to being able to compute the estimators We will talk more about large sample properties of OLS in Chapter 5 35 Efficiency of OLS The GaussMarkov Theorem In this section we state and discuss the important GaussMarkov Theorem which justifies the use of the OLS method rather than using a variety of competing estimators We know one justification for OLS already under Assumptions MLR1 through MLR4 OLS is unbiased However there are many unbiased estimators of the bj under these assumptions for example see Problem 13 Might there be other unbiased estimators with variances smaller than the OLS estimators If we limit the class of competing estimators appropriately then we can show that OLS is best within this class Specifically we will argue that under Assumptions MLR1 through MLR5 the OLS estimator b j for bj is the best linear unbiased estimator BLUE To state the theorem we need to understand each component of the acronym BLUE First we know what an estimator is it is a rule that can be applied to any sample of data to produce an estimate We also know what an unbiased estimator is in the current context an estimator say b 1 of bj is an unbiased estimator of bj if E1b j2 5 bj for any b0 b1 p bk What about the meaning of the term linear In the current context an estimator b j of bj is lin ear if and only if it can be expressed as a linear function of the data on the dependent variable b j 5 a n i51 wijyi 360 where each wij can be a function of the sample values of all the independent variables The OLS esti mators are linear as can be seen from equation 322 Finally how do we define best For the current theorem best is defined as having the smallest variance Given two unbiased estimators it is logical to prefer the one with the smallest variance see Appendix C Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 90 Now let b 0 b 1 c b k denote the OLS estimators in model 331 under Assumptions MLR1 through MLR5 The GaussMarkov Theorem says that for any estimator b j that is linear and unbi ased Var1b j2 Var1b j2 and the inequality is usually strict In other words in the class of linear unbiased estimators OLS has the smallest variance under the five GaussMarkov assumptions Actually the theorem says more than this If we want to estimate any linear function of the bj then the corresponding linear combination of the OLS estimators achieves the smallest variance among all linear unbiased estimators We conclude with a theorem which is proven in Appendix 3A GaUSSmaRkoV tHEoREm Under Assumptions MLR1 through MLR5 b 0 b 1 c bk are the best linear unbiased estimators BLUEs of b0 b1 p bk respectively thEorEm 34 It is because of this theorem that Assumptions MLR1 through MLR5 are known as the GaussMarkov assumptions for crosssectional analysis The importance of the GaussMarkov Theorem is that when the standard set of assumptions holds we need not look for alternative unbiased estimators of the form in 360 none will be better than OLS Equivalently if we are presented with an estimator that is both linear and unbiased then we know that the variance of this estimator is at least as large as the OLS variance no additional cal culation is needed to show this For our purposes Theorem 34 justifies the use of OLS to estimate multiple regression models If any of the GaussMarkov assumptions fail then this theorem no longer holds We already know that failure of the zero conditional mean assumption Assumption MLR4 causes OLS to be biased so Theorem 34 also fails We also know that heteroskedasticity failure of Assumption MLR5 does not cause OLS to be biased However OLS no longer has the smallest variance among linear unbiased estimators in the presence of heteroskedasticity In Chapter 8 we analyze an estimator that improves upon OLS when we know the brand of heteroskedasticity 36 Some Comments on the Language of Multiple Regression Analysis It is common for beginners and not unheard of for experienced empirical researchers to report that they estimated an OLS model While we can usually figure out what someone means by this state ment it is important to understand that it is wrongon more than just an aesthetic leveland reflects a misunderstanding about the components of a multiple regression analysis The first thing to remember is that ordinary least squares OLS is an estimation method not a model A model describes an underlying population and depends on unknown parameters The linear model that we have been studying in this chapter can be writtenin the populationas y 5 b0 1 b1x1 1 p 1 bkxk 1 u 361 where the parameters are the bj Importantly we can talk about the meaning of the bj without ever looking at data It is true we cannot hope to learn much about the bj without data but the interpreta tion of the bj is obtained from the linear model in equation 361 Once we have a sample of data we can estimate the parameters While it is true that we have so far only discussed OLS as a possibility there are actually many more ways to use the data than we can even list We have focused on OLS due to its widespread use which is justified by using the statisti cal considerations we covered previously in this chapter But the various justifications for OLS rely on the assumptions we have made MLR1 through MLR5 As we will see in later chapters under Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 91 different assumptions different estimation methods are preferredeven though our model can still be represented by equation 361 Just a few examples include weighted least squares in Chapter 8 least absolute deviations in Chapter 9 and instrumental variables in Chapter 15 One might argue that the discussion here is overlay pedantic and that the phrase estimating an OLS model should be taken as a useful shorthand for I estimated a linear model by OLS This stance has some merit but we must remember that we have studied the properties of the OLS estima tors under different assumptions For example we know OLS is unbiased under the first four Gauss Markov assumptions but it has no special efficiency properties without Assumption MLR5 We have also seen through the study of the omitted variables problem that OLS is biased if we do not have Assumption MLR4 The problem with using imprecise language is that it leads to vagueness on the most important considerations what assumptions are being made on the underlying linear model The issue of the assumptions we are using is conceptually different from the estimator we wind up applying Ideally one writes down an equation like 361 with variable names that are easy to decipher such as math4 5 b0 1 b1classize4 1 b2math3 1 b3log1income2 1 b4motheduc 1 b5 fatheduc 1 u 362 if we are trying to explain outcomes on a fourthgrade math test Then in the context of equation 362 one includes a discussion of whether it is reasonable to maintain Assumption MLR4 focus ing on the factors that might still be in u and whether more complicated functional relationships are needed a topic we study in detail in Chapter 6 Next one describes the data source which ideally is obtained via random sampling as well as the OLS estimates obtained from the sample A proper way to introduce a discussion of the estimates is to say I estimated equation 362 by ordinary least squares Under the assumption that no important variables have been omitted from the equation and assuming random sampling the OLS estimator of the class size effect b1 is unbiased If the error term u has constant variance the OLS estimator is actually best linear unbiased As we will see in Chapters 4 and 5 we can often say even more about OLS Of course one might want to admit that while controlling for thirdgrade math score family income and parents education might account for important differences across students it might not be enoughfor example u can include motivation of the student or parentsin which case OLS might be biased A more subtle reason for being careful in distinguishing between an underlying population model and an estimation method used to estimate a model is that estimation methods such as OLS can be used essentially as an exercise in curve fitting or prediction without explicitly worrying about an underlying model and the usual statistical properties of unbiasedness and efficiency For example we might just want to use OLS to estimate a line that allows us to predict future college GPA for a set of high school students with given characteristics Summary 1 The multiple regression model allows us to effectively hold other factors fixed while examining the ef fects of a particular independent variable on the dependent variable It explicitly allows the independ ent variables to be correlated 2 Although the model is linear in its parameters it can be used to model nonlinear relationships by ap propriately choosing the dependent and independent variables 3 The method of ordinary least squares is easily applied to estimate the multiple regression model Each slope estimate measures the partial effect of the corresponding independent variable on the dependent variable holding all other independent variables fixed Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 92 PART 1 Regression Analysis with CrossSectional Data 4 R2 is the proportion of the sample variation in the dependent variable explained by the independent variables and it serves as a goodnessoffit measure It is important not to put too much weight on the value of R2 when evaluating econometric models 5 Under the first four GaussMarkov assumptions MLR1 through MLR4 the OLS estimators are un biased This implies that including an irrelevant variable in a model has no effect on the unbiasedness of the intercept and other slope estimators On the other hand omitting a relevant variable causes OLS to be biased In many circumstances the direction of the bias can be determined 6 Under the five GaussMarkov assumptions the variance of an OLS slope estimator is given by Var1b j2 5 s23SSTj11 2 R2 j 2 4 As the error variance s2 increases so does Var1b j2 while Var1b j2 de creases as the sample variation in xj SSTj increases The term R2 j measures the amount of collinearity between xj and the other explanatory variables As R2 j approaches one Var1b j2 is unbounded 7 Adding an irrelevant variable to an equation generally increases the variances of the remaining OLS estimators because of multicollinearity 8 Under the GaussMarkov assumptions MLR1 through MLR5 the OLS estimators are the best lin ear unbiased estimators BLUEs 9 Beginning in Chapter 4 we will use the standard errors of the OLS coefficients to compute confi dence intervals for the population parameters and to obtain test statistics for testing hypotheses about the population parameters Therefore in reporting regression results we now include the standard errors along with the associated OLS estimates In equation form standard errors are usually put in parentheses below the OLS estimates and the same convention is often used in tables of OLS output The GaussMarkov assuMpTions The following is a summary of the five GaussMarkov assumptions that we used in this chapter Remem ber the first four were used to establish unbiasedness of OLS whereas the fifth was added to derive the usual variance formulas and to conclude that OLS is best linear unbiased assumption MLr1 Linear in parameters The model in the population can be written as y 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u where b0 b1 p bk are the unknown parameters constants of interest and u is an unobserved random er ror or disturbance term assumption MLr2 random sampling We have a random sample of n observations 5 1xi1 xi2 p xik yi2 i 5 1 2 p n6 following the population model in Assumption MLR1 assumption MLr3 no perfect Collinearity In the sample and therefore in the population none of the independent variables is constant and there are no exact linear relationships among the independent variables assumption MLr4 Zero Conditional Mean The error u has an expected value of zero given any values of the independent variables In other words E1u0x1 x2 p xk2 5 0 assumption MLr5 homoskedasticity The error u has the same variance given any value of the explanatory variables In other words Var1u0x1 p xk2 5 s2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 93 Key Terms Best Linear Unbiased Estimator BLUE Biased Toward Zero Ceteris Paribus Degrees of Freedom df Disturbance Downward Bias Endogenous Explanatory Variable Error Term Excluding a Relevant Variable Exogenous Explanatory Variable Explained Sum of Squares SSE First Order Conditions FrischWaugh Theorem GaussMarkov Assumptions GaussMarkov Theorem Inclusion of an Irrelevant Variable Intercept Micronumerosity Misspecification Analysis Multicollinearity Multiple Linear Regression MLR Model Multiple Regression Analysis OLS Intercept Estimate OLS Regression Line OLS Slope Estimate Omitted Variable Bias Ordinary Least Squares Overspecifying the Model Partial Effect Perfect Collinearity Population Model Residual Residual Sum of Squares Sample Regression Function SRF Slope Parameter Standard Deviation of b j Standard Error of b j Standard Error of the Regression SER Sum of Squared Residuals SSR Total Sum of Squares SST True Model Underspecifying the Model Upward Bias Variance Inflation Factor VIF Problems 1 Using the data in GPA2 on 4137 college students the following equation was estimated by OLS colgpa 5 1392 2 0135 hsperc 1 00148 sat n 5 4137 R2 5 273 where colgpa is measured on a fourpoint scale hsperc is the percentile in the high school graduating class defined so that for example hsperc 5 means the top 5 of the class and sat is the combined math and verbal scores on the student achievement test i Why does it make sense for the coefficient on hsperc to be negative ii What is the predicted college GPA when hsperc 20 and sat 1050 iii Suppose that two high school graduates A and B graduated in the same percentile from high school but Student As SAT score was 140 points higher about one standard deviation in the sample What is the predicted difference in college GPA for these two students Is the differ ence large iv Holding hsperc fixed what difference in SAT scores leads to a predicted colgpa difference of 50 or onehalf of a grade point Comment on your answer 2 The data in WAGE2 on working men was used to estimate the following equation educ 5 1036 2 094 sibs 1 131 meduc 1 210 feduc n 5 722 R2 5 214 where educ is years of schooling sibs is number of siblings meduc is mothers years of schooling and feduc is fathers years of schooling i Does sibs have the expected effect Explain Holding meduc and feduc fixed by how much does sibs have to increase to reduce predicted years of education by one year A noninteger answer is acceptable here ii Discuss the interpretation of the coefficient on meduc iii Suppose that Man A has no siblings and his mother and father each have 12 years of education Man B has no siblings and his mother and father each have 16 years of education What is the predicted difference in years of education between B and A Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 94 PART 1 Regression Analysis with CrossSectional Data 3 The following model is a simplified version of the multiple regression model used by Biddle and Hamermesh 1990 to study the tradeoff between time spent sleeping and working and to look at other factors affecting sleep sleep 5 b0 1 b1totwrk 1 b2educ 1 b3age 1 u where sleep and totwrk total work are measured in minutes per week and educ and age are measured in years See also Computer Exercise C3 in Chapter 2 i If adults trade off sleep for work what is the sign of b1 ii What signs do you think b2 and b3 will have iii Using the data in SLEEP75 the estimated equation is sleep 5 363825 2 148 totwrk 2 1113 educ 1 220 age n 5 706 R2 5 113 If someone works five more hours per week by how many minutes is sleep predicted to fall Is this a large tradeoff iv Discuss the sign and magnitude of the estimated coefficient on educ v Would you say totwrk educ and age explain much of the variation in sleep What other factors might affect the time spent sleeping Are these likely to be correlated with totwrk 4 The median starting salary for new law school graduates is determined by log1salary2 5 b0 1 b1LSAT 1 b2GPA 1 b3log1libvol2 1 b4log1cost2 1 b5rank 1 u where LSAT is the median LSAT score for the graduating class GPA is the median college GPA for the class libvol is the number of volumes in the law school library cost is the annual cost of attending law school and rank is a law school ranking with rank 5 1 being the best i Explain why we expect b5 0 ii What signs do you expect for the other slope parameters Justify your answers iii Using the data in LAWSCH85 the estimated equation is log1salary2 5 834 1 0047 LAST 1 248 GPA 1 095 log1libvol2 1 038 log1cost2 2 0033 rank n 5 136 R2 5 842 What is the predicted ceteris paribus difference in salary for schools with a median GPA differ ent by one point Report your answer as a percentage iv Interpret the coefficient on the variable loglibvol v Would you say it is better to attend a higher ranked law school How much is a difference in ranking of 20 worth in terms of predicted starting salary 5 In a study relating college grade point average to time spent in various activities you distribute a sur vey to several students The students are asked how many hours they spend each week in four activi ties studying sleeping working and leisure Any activity is put into one of the four categories so that for each student the sum of hours in the four activities must be 168 i In the model GPA 5 b0 1 b1study 1 b2sleep 1 b3work 1 b4leisure 1 u does it make sense to hold sleep work and leisure fixed while changing study ii Explain why this model violates Assumption MLR3 iii How could you reformulate the model so that its parameters have a useful interpretation and it satisfies Assumption MLR3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 95 6 Consider the multiple regression model containing three independent variables under Assumptions MLR1 through MLR4 y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 u You are interested in estimating the sum of the parameters on x1 and x2 call this u1 5 b0 1 b1 i Show that u 1 5 b 1 1 b 2 is an unbiased estimator of u1 ii Find Var1u 12 in terms of Var1b 12Var1b 22 and Corr1b 1 b 22 7 Which of the following can cause OLS estimators to be biased i Heteroskedasticity ii Omitting an important variable iii A sample correlation coefficient of 95 between two independent variables both included in the model 8 Suppose that average worker productivity at manufacturing firms avgprod depends on two factors average hours of training avgtrain and average worker ability avgabil avgprod 5 b0 1 b1avgtrain 1 b2avgabil 1 u Assume that this equation satisfies the GaussMarkov assumptions If grants have been given to firms whose workers have less than average ability so that avgtrain and avgabil are negatively correlated what is the likely bias in b 1 obtained from the simple regression of avgprod on avgtrain 9 The following equation describes the median housing price in a community in terms of amount of pollution nox for nitrous oxide and the average number of rooms in houses in the community rooms log1price2 5 b0 1 b1log1nox2 1 b2rooms 1 u i What are the probable signs of b1 and b2 What is the interpretation of b1 Explain ii Why might nox or more precisely lognox and rooms be negatively correlated If this is the case does the simple regression of logprice on lognox produce an upward or a downward biased estimator of b1 iii Using the data in HPRICE2 the following equations were estimated log1price2 5 1171 2 1043 log1nox2 n 5 506 R2 5 264 log1price2 5 923 2 718 log1nox2 1 306 rooms n 5 506 R2 5 514 Is the relationship between the simple and multiple regression estimates of the elasticity of price with respect to nox what you would have predicted given your answer in part ii Does this mean that 718 is definitely closer to the true elasticity than 1043 10 Suppose that you are interested in estimating the ceteris paribus relationship between y and x1 For this purpose you can collect data on two control variables x2 and x3 For concreteness you might think of y as final exam score x1 as class attendance x2 as GPA up through the previous semester and x3 as SAT or ACT score Let b 1 be the simple regression estimate from y on x1 and let b 1 be the multiple regression estimate from y on x1 x2 x3 i If x1 is highly correlated with x2 and x3 in the sample and x2 and x3 have large partial effects on y would you expect b 1 and b 1 to be similar or very different Explain ii If x1 is almost uncorrelated with x2 and x3 but x2 and x3 are highly correlated will b 1 and b 1 tend to be similar or very different Explain iii If x1 is highly correlated with x2 and x3 and x2 and x3 have small partial effects on y would you expect se1b 12 or se1b 12 to be smaller Explain iv If x1 is almost uncorrelated with x2 and x3 x2 and x3 have large partial effects on y and x2 and x3 are highly correlated would you expect se1b 12 or se1b 12 to be smaller Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 96 PART 1 Regression Analysis with CrossSectional Data 11 Suppose that the population model determining y is y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 u and this model satisfies Assumptions MLR1 through MLR4 However we estimate the model that omits x3 Let b 0 b 1 and b 2 be the OLS estimators from the regression of y on x1 and x2 Show that the expected value of b 1 given the values of the independent variables in the sample is E1b 12 5 b1 1 b3 a n i51ri1xi3 a n i51r2 i1 where the ri1 are the OLS residuals from the regression of x1 on x2 Hint The formula for b 1 comes from equation 322 Plug yi 5 b0 1 b1xi1 1 b2xi2 1 b3xi3 1 ui into this equation After some algebra take the expectation treating xi3 and ri1 as nonrandom 12 The following equation represents the effects of tax revenue mix on subsequent employment growth for the population of counties in the United States growth 5 b0 1 b1shareP 1 b2shareI 1 b3shareS 1 other factors where growth is the percentage change in employment from 1980 to 1990 shareP is the share of prop erty taxes in total tax revenue shareI is the share of income tax revenues and shareS is the share of sales tax revenues All of these variables are measured in 1980 The omitted share shareF includes fees and miscellaneous taxes By definition the four shares add up to one Other factors would include expenditures on education infrastructure and so on all measured in 1980 i Why must we omit one of the tax share variables from the equation ii Give a careful interpretation of b1 13 i Consider the simple regression model y 5 b0 1 b1x 1 u under the first four GaussMarkov assumptions For some function gx for example g1x2 5 x2 or g1x2 5 log11 1 x22 define zi 5 g1xi2 Define a slope estimator as b 1 5 a a n i51 1zi 2 z2yiba a n i51 1zi 2 z2xib Show that b 1 is linear and unbiased Remember because Eux 0 you can treat both xi and zi as nonrandom in your derivation ii Add the homoskedasticity assumption MLR5 Show that Var1b 12 5 s2a a n i51 1zi 2 z2 2ba a n i51 1zi 2 z2xib 2 iii Show directly that under the GaussMarkov assumptions Var1b 12 Var1b 12 where b 1 is the OLS estimator Hint The CauchySchwartz inequality in Appendix B implies that an21 a n i51 1zi 2 z2 1xi 2 x2 b 2 an21 a n i51 1zi 2 z2 2ban21 a n i51 1xi 2 x2 2b notice that we can drop x from the sample covariance 14 Suppose you have a sample of size n on three variables y x1 and x2 and you are primarily interested in the effect of x1 on y Let b 1 be the coefficient on x1 from the simple regression and b 1 the coefficient on x1 from the regression y on x1 x2 The standard errors reported by any regression package are se1b 12 5 s SST1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 97 se1b 12 5 s SST1 VIF1 where s is the SER from the simple regression s is the SER from the multiple regression VIF1 5 111 2 R2 12 and R2 1 is the Rsquared from the regression of x1 on x2 Explain why se1b 12 can be smaller or larger than se1b 12 15 The following estimated equations use the data in MLB1 which contains information on major league baseball salaries The dependent variable lsalary is the log of salary The two explanatory variables are years in the major leagues years and runs batted in per year rbisyr lsalary 5 12373 1 1770 years 10982 101322 n 5 353 SSR 5 326196 SER 5 964 R2 5 337 lsalary 5 11861 1 0904 years 1 0302 rbisyr 10842 101182 100202 n 5 353 SSR 5 198475 SER 5 753 R2 5 597 i How many degrees of freedom are in each regression How come the SER is smaller in the sec ond regression than the first ii The sample correlation coefficient between years and rbisyr is about 0487 Does this make sense What is the variance inflation factor there is only one for the slope coefficients in the multiple re gression Would you say there is little moderate or strong collinearity between years and rbisyr iii How come the standard error for the coefficient on years in the multiple regression is lower than its counterpart in the simple regression 16 The following equations were estimated using the data in LAWSCH85 lsalary 5 990 2 0041 rank 1 294 GPA 1242 100032 10692 n 5 142 R2 5 8238 lsalary 5 986 2 0038 rank 1 295 GPA 1 00017 age 1292 100042 10832 1000362 n 5 99 R2 5 8036 How can it be that the Rsquared is smaller when the variable age is added to the equation Computer Exercises C1 A problem of interest to health officials and others is to determine the effects of smoking during pregnancy on infant health One measure of infant health is birth weight a birth weight that is too low can put an infant at risk for contracting various illnesses Since factors other than cigarette smoking that affect birth weight are likely to be correlated with smoking we should take those factors into ac count For example higher income generally results in access to better prenatal care as well as better nutrition for the mother An equation that recognizes this is bwght 5 b0 1 b1cigs 1 b2faminc 1 u i What is the most likely sign for b2 ii Do you think cigs and faminc are likely to be correlated Explain why the correlation might be positive or negative Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 98 PART 1 Regression Analysis with CrossSectional Data iii Now estimate the equation with and without faminc using the data in BWGHT Report the re sults in equation form including the sample size and Rsquared Discuss your results focusing on whether adding faminc substantially changes the estimated effect of cigs on bwght C2 Use the data in HPRICE1 to estimate the model price 5 b0 1 b1sqrft 1 b2bdrms 1 u where price is the house price measured in thousands of dollars i Write out the results in equation form ii What is the estimated increase in price for a house with one more bedroom holding square footage constant iii What is the estimated increase in price for a house with an additional bedroom that is 140 square feet in size Compare this to your answer in part ii iv What percentage of the variation in price is explained by square footage and number of bedrooms v The first house in the sample has sqrft 5 2438 and bdrms 5 4 Find the predicted selling price for this house from the OLS regression line vi The actual selling price of the first house in the sample was 300000 so price 5 300 Find the residual for this house Does it suggest that the buyer underpaid or overpaid for the house C3 The file CEOSAL2 contains data on 177 chief executive officers and can be used to examine the effects of firm performance on CEO salary i Estimate a model relating annual salary to firm sales and market value Make the model of the constant elasticity variety for both independent variables Write the results out in equation form ii Add profits to the model from part i Why can this variable not be included in logarithmic form Would you say that these firm performance variables explain most of the variation in CEO salaries iii Add the variable ceoten to the model in part ii What is the estimated percentage return for another year of CEO tenure holding other factors fixed iv Find the sample correlation coefficient between the variables logmktval and profits Are these variables highly correlated What does this say about the OLS estimators C4 Use the data in ATTEND for this exercise i Obtain the minimum maximum and average values for the variables atndrte priGPA and ACT ii Estimate the model atndrte 5 b0 1 b1priGPA 1 b2ACT 1 u and write the results in equation form Interpret the intercept Does it have a useful meaning iii Discuss the estimated slope coefficients Are there any surprises iv What is the predicted atndrte if priGPA 5 365 and ACT 5 20 What do you make of this result Are there any students in the sample with these values of the explanatory variables v If Student A has priGPA 5 31 and ACT 5 21 and Student B has priGPA 5 21 and ACT 5 26 what is the predicted difference in their attendance rates C5 Confirm the partialling out interpretation of the OLS estimates by explicitly doing the partialling out for Example 32 This first requires regressing educ on exper and tenure and saving the residuals r1 Then regress logwage on r1 Compare the coefficient on r1 with the coefficient on educ in the regres sion of logwage on educ exper and tenure C6 Use the data set in WAGE2 for this problem As usual be sure all of the following regressions contain an intercept i Run a simple regression of IQ on educ to obtain the slope coefficient say d 1 ii Run the simple regression of logwage on educ and obtain the slope coefficient b 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 99 iii Run the multiple regression of logwage on educ and IQ and obtain the slope coefficients b 1 and b 2 respectively iv Verify that b 1 5 b 1 1 b 2d 1 C7 Use the data in MEAP93 to answer this question i Estimate the model math10 5 b0 1 b1log1expend2 1 b2lnchprg 1 u and report the results in the usual form including the sample size and Rsquared Are the signs of the slope coefficients what you expected Explain ii What do you make of the intercept you estimated in part i In particular does it make sense to set the two explanatory variables to zero Hint Recall that log112 5 0 iii Now run the simple regression of math10 on logexpend and compare the slope coefficient with the estimate obtained in part i Is the estimated spending effect now larger or smaller than in part i iv Find the correlation between lexpend 5 log1expend2 and lnchprg Does its sign make sense to you v Use part iv to explain your findings in part iii C8 Use the data in DISCRIM to answer this question These are ZIP codelevel data on prices for vari ous items at fastfood restaurants along with characteristics of the zip code population in New Jersey and Pennsylvania The idea is to see whether fastfood restaurants charge higher prices in areas with a larger concentration of blacks i Find the average values of prpblck and income in the sample along with their standard devia tions What are the units of measurement of prpblck and income ii Consider a model to explain the price of soda psoda in terms of the proportion of the popula tion that is black and median income psoda 5 b0 1 b1prpblck 1 b2income 1 u Estimate this model by OLS and report the results in equation form including the sample size and Rsquared Do not use scientific notation when reporting the estimates Interpret the coef ficient on prpblck Do you think it is economically large iii Compare the estimate from part ii with the simple regression estimate from psoda on prpblck Is the discrimination effect larger or smaller when you control for income iv A model with a constant price elasticity with respect to income may be more appropriate Report estimates of the model log1psoda2 5 b0 1 b1prpblck 1 b2log1income2 1 u If prpblck increases by 20 20 percentage points what is the estimated percentage change in psoda Hint The answer is 2xx where you fill in the xx v Now add the variable prppov to the regression in part iv What happens to b prpblck vi Find the correlation between logincome and prppov Is it roughly what you expected vii Evaluate the following statement Because logincome and prppov are so highly correlated they have no business being in the same regression C9 Use the data in CHARITY to answer the following questions i Estimate the equation gift 5 b0 1 b1mailsyear 1 b2giftlast 1 b3propresp 1 u by OLS and report the results in the usual way including the sample size and Rsquared How does the Rsquared compare with that from the simple regression that omits giftlast and propresp Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 100 PART 1 Regression Analysis with CrossSectional Data ii Interpret the coefficient on mailsyear Is it bigger or smaller than the corresponding simple regression coefficient iii Interpret the coefficient on propresp Be careful to notice the units of measurement of propresp iv Now add the variable avggift to the equation What happens to the estimated effect of mails year v In the equation from part iv what has happened to the coefficient on giftlast What do you think is happening C10 Use the data in HTV to answer this question The data set includes information on wages education parents education and several other variables for 1230 working men in 1991 i What is the range of the educ variable in the sample What percentage of men completed twelfth grade but no higher grade Do the men or their parents have on average higher levels of education ii Estimate the regression model educ 5 b0 1 b1motheduc 1 b2fatheduc 1 u by OLS and report the results in the usual form How much sample variation in educ is ex plained by parents education Interpret the coefficient on motheduc iii Add the variable abil a measure of cognitive ability to the regression from part ii and report the results in equation form Does ability help to explain variations in education even after controlling for parents education Explain iv Requires calculus Now estimate an equation where abil appears in quadratic form educ 5 b0 1 b1motheduc 1 b2fatheduc 1 b3abil 1 b4abil2 1 u Using the estimates b 3 and b 4 use calculus to find the value of abil call it abil where educ is minimized The other coefficients and values of parents education variables have no effect we are holding parents education fixed Notice that abil is measured so that negative values are permissible You might also verify that the second derivative is positive so that you do indeed have a minimum v Argue that only a small fraction of men in the sample have ability less than the value calcu lated in part iv Why is this important vi If you have access to a statistical program that includes graphing capabilities use the estimates in part iv to graph the relationship between the predicted education and abil Set motheduc and fatheduc at their average values in the sample 1218 and 1245 respectively C11 Use the data in MEAPSINGLE to study the effects of singleparent households on student math per formance These data are for a subset of schools in southeast Michigan for the year 2000 The socio economic variables are obtained at the ZIP code level where ZIP code is assigned to schools based on their mailing addresses i Run the simple regression of math4 on pctsgle and report the results in the usual format Inter pret the slope coefficient Does the effect of single parenthood seem large or small ii Add the variables lmedinc and free to the equation What happens to the coefficient on pctsgle Explain what is happening iii Find the sample correlation between lmedinc and free Does it have the sign you expect vi Does the substantial correlation between lmedinc and free mean that you should drop one from the regression to better estimate the causal effect of single parenthood on student performance Explain v Find the variance inflation factors VIFs for each of the explanatory variables appearing in the regression in part ii Which variable has the largest VIF Does this knowledge affect the model you would use to study the causal effect of single parenthood on math performance Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 101 C12 The data in ECONMATH contain grade point averages and standardized test scores along with performance in an introductory economics course for students at a large public university The vari able to be explained is score the final score in the course measured as a percentage i How many students received a perfect score for the course What was the average score Find the means and standard deviations of actmth and acteng and discuss how they compare ii Estimate a linear equation relating score to colgpa actmth and acteng where colgpa is mea sured at the beginning of the term Report the results in the usual form iii Would you say the math or English ACT score is a better predictor of performance in the eco nomics course Explain iv Discuss the size of the Rsquared in the regression APPEndix 3A 3a1 Derivation of the First order Conditions in equation 313 The analysis is very similar to the simple regression case We must characterize the solutions to the problem min b0 b1p bk a n i51 1yi 2 b0 2 b1xi1 2 p 2 bkxik2 2 Taking the partial derivatives with respect to each of the bj see Appendix A evaluating them at the solutions and setting them equal to zero gives 22 a n i51 1yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 2 2 a n i51xij1yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 for all j 5 1 p k Canceling the 22 gives the first order conditions in 313 3a2 Derivation of equation 322 To derive 322 write xi1 in terms of its fitted value and its residual from the regression of x1 on x2 p xk xi1 5 xi1 1 ri1 for all i 5 1 p n Now plug this into the second equation in 313 a n i51 1xi1 1 ri12 1yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 363 By the definition of the OLS residual u i since xi1 is just a linear function of the explanatory variables xi2 p xik it follows that g n i51xi1u i 5 0 Therefore equation 363 can be expressed as a n i51ri11yi 2 b 0 2 b 1xi1 2 p 2 b kxik2 5 0 364 Since the ri1 are the residuals from regressing x1 on x2 c xk g n i51 xijri1 5 0 for all j 5 2 k Therefore 364 is equivalent to g n i51ri11yi 2 b 1xi12 5 0 Finally we use the fact that g n i51xi1ri1 5 0 which means that b 1 solves a n i51ri11yi 2 b 1ri12 5 0 Now straightforward algebra gives 322 provided of course that g n i51r2 i1 0 this is ensured by Assumption MLR3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 102 PART 1 Regression Analysis with CrossSectional Data 3a3 proof of Theorem 31 We prove Theorem 31 for b 1 the proof for the other slope parameters is virtually identical See Appendix E for a more succinct proof using matrices Under Assumption MLR3 the OLS estimators exist and we can write b 1 as in 322 Under Assumption MLR1 we can write yi as in 332 substitute this for yi in 322 Then using g n i51 ri1 5 0 g n i51 xijri1 5 0 for all j 5 2 p k and g n i51 xi1ri1 5 g n i51 r2 i1 we have b 1 5 b1 1 a a n i51ri1uiba a n i51r2 i1b 365 Now under Assumptions MLR2 and MLR4 the expected value of each ui given all independent variables in the sample is zero Since the ri1 are just functions of the sample independent variables it follows that E1b 10X2 5 b1 1 a a n i51ri1E1ui0X2 ba a n i51r2 i1b 5 b1 1 a a n i51ri1 0ba a n i51r2 i1b 5 b1 where X denotes the data on all independent variables and E1b 10X2 is the expected value of b 1 given xi1 p xik for all i 5 1 p n This completes the proof 3a4 General omitted variable Bias We can derive the omitted variable bias in the general model in equation 331 under the first four GaussMarkov assumptions In particular let the b j j 5 0 1 p k be the OLS estimators from the regression using the full set of explanatory variables Let the b j j 5 0 1 p k 2 1 be the OLS esti mators from the regression that leaves out xk Let d j j 5 1 p k 2 1 be the slope coefficient on xj in the auxiliary regression of xik on xi1 xi2 p xik21 i 5 1 p n A useful fact is that b j 5 b j 1 b kd j 366 This shows explicitly that when we do not control for xk in the regression the estimated partial effect of xj equals the partial effect when we include xk plus the partial effect of xk on y times the partial relationship between the omitted variable xk and xj j k Conditional on the entire set of explanatory variables X we know that the b j are all unbiased for the corresponding bj j 5 1 p k Further since d j is just a function of X we have E1b j0X2 5 E1b j0X2 1 E1b k0X2 d j 5 bj 1 bkd j 367 Equation 367 shows that b j is biased for bj unless bk 5 0in which case xk has no partial effect in the populationor d j equals zero which means that xik and xij are partially uncorrelated in the sample The key to obtaining equation 367 is equation 366 To show equation 366 we can use equation 322 a couple of times For simplicity we look at j 5 1 Now b 1 is the slope coef ficient in the simple regression of yi on ri1 i 5 1 p n where the ri1 are the OLS residuals from the regression of xi1 on xi2 xi3 p xik21 Consider the numerator of the expression for b 1 g n i51 ri1yi But for each i we can write yi 5 b 0 1 b 1xi1 1 p 1 b kxik 1 u i and plug in for yi Now by properties of the OLS residuals the ri1 have zero sample average and are uncorrelated with xi2 xi3 p xik21 in the sample Similarly the u i have zero sample average and zero sample correlation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 3 Multiple Regression Analysis Estimation 103 with xi1 xi2 c xik It follows that the ri1 and u i are uncorrelated in the sample since the ri1 are just linear combinations of xi1 xi2 c xik21 So a n i51 ri1yi 5 b 1a a n i51 ri1xi1b 1 b ka a n i51 ri1xikb 368 Now g n i51 ri1xi1 5 g n i51 r2 i1 which is also the denominator of b 1 Therefore we have shown that b 1 5 b 1 1 b ka a n i51 ri1xikba a n i51 r2 i1b 5 b 1 1 b kd 1 This is the relationship we wanted to show 3a5 proof of Theorem 32 Again we prove this for j 5 1 Write b 1 as in equation 365 Now under MLR5 Var1ui0X2 5 s2 for all i 5 1 c n Under random sampling the ui are independent even conditional on X and the ri1 are nonrandom conditional on X Therefore Var1b 10X2 5 a a n i51r2 i1 Var1ui0X2 ba a n i51r2 i1b 2 5 a a n i51r2 i1s2ba a n i51r2 i1b 2 5 s2a a n i51r2 i1b Now since g n i51r2 i1 is the sum of squared residuals from regressing x1 on x2 p xk g n i51r2 i1 5 SST111 2 R2 12 This completes the proof 3a6 proof of Theorem 34 We show that for any other linear unbiased estimator b 1 of b1 Var1b 12 Var1b 12 where b 1 is the OLS estimator The focus on j 5 1 is without loss of generality For b 1 as in equation 360 we can plug in for yi to obtain b 1 5 b0 a n i51wi1 1 b1 a n i51wi1xi1 1 b2 a n i51wi1xi2 1 p 1 bk a n i51wi1xik 1 a n i51wi1ui Now since the wi1 are functions of the xij E1b 10X2 5 b0 a n i51wi1 1 b1 a n i51wi1xi1 1 b2 a n i51wi1xi2 1 p 1 bk a n i51wi1xik 1 a n i51wi1E1ui0X2 5 b0 a n i51wi1 1 b1 a n i51wi1xi1 1 b2 a n i51wi1xi2 1 p 1 bk a n i51wi1xik because E1ui0X2 5 0 for all i 5 1 p n under MLR2 and MLR4 Therefore for E1b 10X2 to equal b1 for any values of the parameters we must have a n i51wi1 5 0 a n i51wi1xi1 5 1 a n i51wi1xij 5 0 j 5 2 p k 369 Now let ri1 be the residuals from the regression of xi1 on xi2 p xik Then from 369 it follows that a n i51wi1ri1 5 1 370 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 104 PART 1 Regression Analysis with CrossSectional Data because xi1 5 xi1 1 ri1 and g n i51wi1xi1 5 0 Now consider the difference between Var1b 10X2 and Var1b 10X2 under MLR1 through MLR5 s2 a n i51w2 i1 2 s2a a n i51r2 i1b 371 Because of 370 we can write the difference in 371 without s2 as a n i51w2 i1 2 a a n i51wi1ri1b 2 a a n i51r2 i1b 372 But 372 is simply a n i51 1wi1 2 g 1ri12 2 373 where g 1 5 1 g n i51wi1ri121 g n i51r2 i12 as can be seen by squaring each term in 373 summing and then canceling terms Because 373 is just the sum of squared residuals from the simple regression of wi1 on ri1remember that the sample average of ri1 is zero373 must be nonnegative This completes the proof Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 105 c h a p t e r 4 Multiple Regression Analysis Inference T his chapter continues our treatment of multiple regression analysis We now turn to the problem of testing hypotheses about the parameters in the population regression model We begin in Section 41 by finding the distributions of the OLS estimators under the added assumption that the population error is normally distributed Sections 42 and 43 cover hypothesis testing about indi vidual parameters while Section 44 discusses how to test a single hypothesis involving more than one parameter We focus on testing multiple restrictions in Section 45 and pay particular attention to determining whether a group of independent variables can be omitted from a model 41 Sampling Distributions of the OLS Estimators Up to this point we have formed a set of assumptions under which OLS is unbiased we have also derived and discussed the bias caused by omitted variables In Section 34 we obtained the variances of the OLS estimators under the GaussMarkov assumptions In Section 35 we showed that this vari ance is smallest among linear unbiased estimators Knowing the expected value and variance of the OLS estimators is useful for describing the precision of the OLS estimators However in order to perform statistical inference we need to know more than just the first two moments of b j we need to know the full sampling distribution of the b j Even under the GaussMarkov assumptions the distribution of b j can have virtually any shape When we condition on the values of the independent variables in our sample it is clear that the sampling distributions of the OLS estimators depend on the underlying distribution of the errors To make the sampling distributions of the b j tractable we now assume that the unobserved error is nor mally distributed in the population We call this the normality assumption Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 106 Assumption MLR6 Normality The population error u is independent of the explanatory variables x1 x2 p xk and is normally distrib uted with zero mean and variance s2 u Normal10s22 Assumption MLR6 is much stronger than any of our previous assumptions In fact since u is independent of the xj under MLR6 E1u0x1 p xk2 5 E1u2 5 0 and Var1u0x1 p xk2 5 Var1u2 5 s2 Thus if we make Assumption MLR6 then we are necessarily assuming MLR4 and MLR5 To emphasize that we are assuming more than before we will refer to the full set of Assumptions MLR1 through MLR6 For crosssectional regression applications Assumptions MLR1 through MLR6 are called the classical linear model CLM assumptions Thus we will refer to the model under these six assumptions as the classical linear model It is best to think of the CLM assumptions as containing all of the GaussMarkov assumptions plus the assumption of a normally distributed error term Under the CLM assumptions the OLS estimators b 0 b 1 p b k have a stronger efficiency prop erty than they would under the GaussMarkov assumptions It can be shown that the OLS estimators are the minimum variance unbiased estimators which means that OLS has the smallest variance among unbiased estimators we no longer have to restrict our comparison to estimators that are linear in the yi This property of OLS under the CLM assumptions is discussed further in Appendix E A succinct way to summarize the population assumptions of the CLM is y0x Normal1b0 1 b1x1 1 b2x2 1 p 1 bkxks22 where x is again shorthand for 1x1 p xk2 Thus conditional on x y has a normal distribution with mean linear in x1 p xk and a constant variance For a single independent variable x this situation is shown in Figure 41 fylx x1 Eyx 5 0 1 1x x2 x3 y normal distributions x Figure 41 The homoskedastic normal distribution with a single explanatory variable Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 107 The argument justifying the normal distribution for the errors usually runs something like this Because u is the sum of many different unobserved factors affecting y we can invoke the central limit theorem CLT see Appendix C to conclude that u has an approximate normal distribution This argument has some merit but it is not without weaknesses First the factors in u can have very dif ferent distributions in the population for example ability and quality of schooling in the error in a wage equation Although the CLT can still hold in such cases the normal approximation can be poor depending on how many factors appear in u and how different their distributions are A more serious problem with the CLT argument is that it assumes that all unobserved factors affect y in a separate additive fashion Nothing guarantees that this is so If u is a complicated func tion of the unobserved factors then the CLT argument does not really apply In any application whether normality of u can be assumed is really an empirical matter For example there is no theorem that says wage conditional on educ exper and tenure is normally dis tributed If anything simple reasoning suggests that the opposite is true since wage can never be less than zero it cannot strictly speaking have a normal distribution Further because there are minimum wage laws some fraction of the population earns exactly the minimum wage which also violates the normality assumption Nevertheless as a practical matter we can ask whether the conditional wage distribution is close to being normal Past empirical evidence suggests that normality is not a good assumption for wages Often using a transformation especially taking the log yields a distribution that is closer to normal For example something like logprice tends to have a distribution that looks more normal than the distribution of price Again this is an empirical issue We will discuss the consequences of nonnormality for statistical inference in Chapter 5 There are some applications where MLR6 is clearly false as can be demonstrated with simple introspection Whenever y takes on just a few values it cannot have anything close to a normal dis tribution The dependent variable in Example 35 provides a good example The variable narr86 the number of times a young man was arrested in 1986 takes on a small range of integer values and is zero for most men Thus narr86 is far from being normally distributed What can be done in these cases As we will see in Chapter 5and this is importantnonnormality of the errors is not a serious problem with large sample sizes For now we just make the normality assumption Normality of the error term translates into normal sampling distributions of the OLS estimators Normal SampliNg DiStributioNS Under the CLM assumptions MLR1 through MLR6 conditional on the sample values of the indepen dent variables b j Normal3bjVar1b j2 4 41 where Var1b j2 was given in Chapter 3 equation 351 Therefore 1b j 2 bj2sd1b j2 Normal 1012 Theorem 41 The proof of 41 is not that difficult given the properties of normally distributed random variables in Appendix B Each b j can be written as b j 5 bj 1 g n i51wijui where wij 5 rijSSRj rij is the ith residual from the regression of the xj on all the other independent variables and SSRj is the sum of squared residuals from this regression see equation 362 Since the wij depend only on the independent vari ables they can be treated as nonrandom Thus b j is just a linear combination of the errors in the sam ple 5ui i 5 1 2 p n6 Under Assumption MLR6 and the random sampling Assumption MLR2 the errors are independent identically distributed Normal10s22 random variables An important fact about independent normal random variables is that a linear combination of such random variables is Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 108 normally distributed see Appendix B This basically completes the proof In Section 33 we showed that E1b j2 5 bj and we derived Var1b j2 in Section 34 there is no need to rederive these facts The second part of this theorem follows imme diately from the fact that when we standardize a nor mal random variable by subtracting off its mean and dividing by its standard deviation we end up with a standard normal random variable The conclusions of Theorem 41 can be strengthened In addition to 41 any linear combination of the b 0 b 1 p b k is also normally distributed and any subset of the b j has a joint normal distribu tion These facts underlie the testing results in the remainder of this chapter In Chapter 5 we will show that the normality of the OLS estimators is still approximately true in large samples even with out normality of the errors 42 Testing Hypotheses about a Single Population Parameter The t Test This section covers the very important topic of testing hypotheses about any single parameter in the population regression function The population model can be written as y 5 b0 1 b1x1 1 p 1 bkxk 1 u 42 and we assume that it satisfies the CLM assumptions We know that OLS produces unbiased estima tors of the bj In this section we study how to test hypotheses about a particular bj For a full under standing of hypothesis testing one must remember that the bj are unknown features of the population and we will never know them with certainty Nevertheless we can hypothesize about the value of bj and then use statistical inference to test our hypothesis In order to construct hypotheses tests we need the following result t DiStributioN for the StaNDarDizeD eStimatorS Under the CLM assumptions MLR1 through MLR6 1b j 2 bj2se1b j2 tn2k21 5 tdf 43 where k 1 1 is the number of unknown parameters in the population model y 5 b0 1 b1x1 1 p 1 bk xk 1 u k slope parameters and the intercept b0 and n 2 k 2 1 is the degrees of freedom df Theorem 42 This result differs from Theorem 41 in some notable respects Theorem 41 showed that under the CLM assumptions 1b j 2 bj2sd1b j2 Normal1012 The t distribution in 43 comes from the fact that the constant s in sd1b j2 has been replaced with the random variable s The proof that this leads to a t distribution with n 2 k 2 1 degrees of freedom is difficult and not especially instructive Essentially the proof shows that 43 can be written as the ratio of the standard normal random vari able 1b j 2 bj2sd1b j2 over the square root of s 2s2 These random variables can be shown to be inde pendent and 1n 2 k 2 12s 2s2 x2 n2k21 The result then follows from the definition of a t random variable see Section B5 Theorem 42 is important in that it allows us to test hypotheses involving the bj In most applica tions our primary interest lies in testing the null hypothesis H0 bj 5 0 44 Suppose that u is independent of the explanatory variables and it takes on the values 22 21 0 1 and 2 with equal prob ability of 15 Does this violate the Gauss Markov assumptions Does this violate the CLM assumptions exploring FurTher 41 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 109 where j corresponds to any of the k independent variables It is important to understand what 44 means and to be able to describe this hypothesis in simple language for a particular application Since bj measures the partial effect of xj on the expected value of y after controlling for all other inde pendent variables 44 means that once x1 x2 p xj21 xj11 p xk have been accounted for xj has no effect on the expected value of y We cannot state the null hypothesis as xj does have a partial effect on y because this is true for any value of bj other than zero Classical testing is suited for testing sim ple hypotheses like 44 As an example consider the wage equation log1wage2 5 b0 1 b1educ 1 b2exper 1 b3tenure 1 u The null hypothesis H0 b2 5 0 means that once education and tenure have been accounted for the number of years in the workforce exper has no effect on hourly wage This is an economically inter esting hypothesis If it is true it implies that a persons work history prior to the current employment does not affect wage If b2 0 then prior work experience contributes to productivity and hence to wage You probably remember from your statistics course the rudiments of hypothesis testing for the mean from a normal population This is reviewed in Appendix C The mechanics of testing 44 in the multiple regression context are very similar The hard part is obtaining the coefficient estimates the standard errors and the critical values but most of this work is done automatically by economet rics software Our job is to learn how regression output can be used to test hypotheses of interest The statistic we use to test 44 against any alternative is called the t statistic or the t ratio of b j and is defined as tbj b jse1b j2 45 We have put the in quotation marks because as we will see shortly a more general form of the t statistic is needed for testing other hypotheses about bj For now it is important to know that 45 is suitable only for testing 44 For particular applications it is helpful to index t statistics using the name of the independent variable for example teduc would be the t statistic for b educ The t statistic for b j is simple to compute given b j and its standard error In fact most regression packages do the division for you and report the t statistic along with each coefficient and its standard error Before discussing how to use 45 formally to test H0 bj 5 0 it is useful to see why tbj has fea tures that make it reasonable as a test statistic to detect bj 2 0 First since se1b j2 is always positive tbj has the same sign as b j if b j is positive then so is tbj and if b j is negative so is tbj Second for a given value of se1b j2 a larger value of b j leads to larger values of tbj If b j becomes more negative so does tbj Since we are testing H0 bj 5 0 it is only natural to look at our unbiased estimator of bj b j for guidance In any interesting application the point estimate b j will never exactly be zero whether or not H0 is true The question is How far is b j from zero A sample value of b j very far from zero pro vides evidence against H0 bj 5 0 However we must recognize that there is a sampling error in our estimate b j so the size of b j must be weighed against its sampling error Since the standard error of b j is an estimate of the standard deviation of b j tbj measures how many estimated standard deviations b j is away from zero This is precisely what we do in testing whether the mean of a population is zero using the standard t statistic from introductory statistics Values of tbj sufficiently far from zero will result in a rejection of H0 The precise rejection rule depends on the alternative hypothesis and the chosen significance level of the test Determining a rule for rejecting 44 at a given significance levelthat is the probability of rejecting H0 when it is truerequires knowing the sampling distribution of tbj when H0 is true From Theorem 42 we know this to be tn2k21 This is the key theoretical result needed for testing 44 Before proceeding it is important to remember that we are testing hypotheses about the popula tion parameters We are not testing hypotheses about the estimates from a particular sample Thus it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 110 never makes sense to state a null hypothesis as H0 b 1 5 0 or even worse as H0 237 5 0 when the estimate of a parameter is 237 in the sample We are testing whether the unknown population value b1 is zero Some treatments of regression analysis define the t statistic as the absolute value of 45 so that the t statistic is always positive This practice has the drawback of making testing against onesided alternatives clumsy Throughout this text the t statistic always has the same sign as the corresponding OLS coefficient estimate 42a Testing against OneSided Alternatives to determine a rule for rejecting H0 we need to decide on the relevant alternative hypothesis First consider a onesided alternative of the form H1 bj 0 46 When we state the alternative as in equation 46 we are really saying that the null hypothesis is H0 bj 0 For example if bj is the coefficient on education in a wage regression we only care about detecting that bj is different from zero when bj is actually positive You may remember from introduc tory statistics that the null value that is hardest to reject in favor of 46 is bj 5 0 In other words if we reject the null bj 5 0 then we automatically reject bj 0 Therefore it suffices to act as if we are testing H0 bj 5 0 against H1 bj 0 effectively ignoring bj 0 and that is the approach we take in this book How should we choose a rejection rule We must first decide on a significance level level for short or the probability of rejecting H0 when it is in fact true For concreteness suppose we have decided on a 5 significance level as this is the most popular choice Thus we are willing to mistak enly reject H0 when it is true 5 of the time Now while tbj has a t distribution under H0so that it has zero meanunder the alternative bj 0 the expected value of tbj is positive Thus we are look ing for a sufficiently large positive value of tbj in order to reject H0 bj 5 0 in favor of H1 bj 0 Negative values of tbj provide no evidence in favor of H1 The definition of sufficiently large with a 5 significance level is the 95th percentile in a t distribution with n 2 k 2 1 degrees of freedom denote this by c In other words the rejection rule is that H0 is rejected in favor of H1 at the 5 significance level if tbj c 47 By our choice of the critical value c rejection of H0 will occur for 5 of all random samples when H0 is true The rejection rule in 47 is an example of a onetailed test To obtain c we only need the sig nificance level and the degrees of freedom For example for a 5 level test and with n 2 k 2 1 5 28 degrees of freedom the critical value is c 5 1701 If tbj 1701 then we fail to reject H0 in favor of 46 at the 5 level Note that a negative value for tbj no matter how large in absolute value leads to a failure in rejecting H0 in favor of 46 See Figure 42 The same procedure can be used with other significance levels For a 10 level test and if df 5 21 the critical value is c 5 1323 For a 1 significance level and if df 5 21 c 5 2518 All of these critical values are obtained directly from Table G2 You should note a pattern in the critical val ues As the significance level falls the critical value increases so that we require a larger and larger value of tbj in order to reject H0 Thus if H0 is rejected at say the 5 level then it is automatically rejected at the 10 level as well It makes no sense to reject the null hypothesis at say the 5 level and then to redo the test to determine the outcome at the 10 level As the degrees of freedom in the t distribution get large the t distribution approaches the standard normal distribution For example when n 2 k 2 1 5 120 the 5 critical value for the onesided alternative 47 is 1658 compared with the standard normal value of 1645 These are close enough for practical purposes for degrees of freedom greater than 120 one can use the standard normal criti cal values Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 111 example 41 hourly Wage equation Using the data in WAGE1 gives the estimated equation log1wage2 5 284 1 092 educ 1 0041 exper 1 022 tenure 11042 10072 100172 10032 n 5 526 R2 5 316 where standard errors appear in parentheses below the estimated coefficients We will follow this con vention throughout the text This equation can be used to test whether the return to exper controlling for educ and tenure is zero in the population against the alternative that it is positive Write this as H0 bexper 5 0 versus H1 bexper 0 In applications indexing a parameter by its associated variable name is a nice way to label parameters since the numerical indices that we use in the general model are arbitrary and can cause confusion Remember that bexper denotes the unknown population param eter It is nonsense to write H0 0041 5 0 or H0 b exper 5 0 Since we have 522 degrees of freedom we can use the standard normal critical values The 5 critical value is 1645 and the 1 critical value is 2326 The t statistic for b exper is texper 5 00410017 241 and so b exper or exper is statistically significant even at the 1 level We also say that b exper is statis tically greater than zero at the 1 significance level The estimated return for another year of experience holding tenure and education fixed is not especially large For example adding three more years increases logwage by 3100412 5 0123 so wage is only about 12 higher Nevertheless we have persuasively shown that the partial effect of experience is positive in the population 0 1701 rejection region area 05 Figure 42 5 rejection rule for the alternative H1 bj 0 with 28 df Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 112 The onesided alternative that the parameter is less than zero H1 bj 0 48 also arises in applications The rejection rule for alternative 48 is just the mirror image of the previ ous case Now the critical value comes from the left tail of the t distribution In practice it is easiest to think of the rejection rule as tbj 2c 49 where c is the critical value for the alternative H1 bj 0 For simplicity we always assume c is positive since this is how critical values are reported in t tables and so the critical value 2c is a negative number For example if the significance level is 5 and the degrees of freedom is 18 then c 5 1734 and so H0 bj 5 0 is rejected in favor of H1 bj 0 at the 5 level if tbj 21734 It is important to remember that to reject H0 against the negative alternative 48 we must get a negative t statistic A positive t ratio no matter how large provides no evidence in favor of 48 The rejection rule is illustrated in Figure 43 Let community loan approval rates be deter mined by apprate 5 b0 1 b1percmin 1 b2avginc 1 b3avgwlth 1 b4avgdebt 1 u where percmin is the percentage minority in the community avginc is average income avgwlth is average wealth and avgdebt is some measure of average debt obligations how do you state the null hypothesis that there is no difference in loan rates across neighborhoods due to racial and ethnic composition when average income aver age wealth and average debt have been controlled for how do you state the alter native that there is discrimination against minorities in loan approval rates exploring FurTher 42 0 1734 rejection region area 05 Figure 43 5 rejection rule for the alternative H1 bj 0 with 18 df Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 113 example 42 Student performance and School Size There is much interest in the effect of school size on student performance See for example The New York Times Magazine 52895 One claim is that everything else being equal students at smaller schools fare better than those at larger schools This hypothesis is assumed to be true even after accounting for differences in class sizes across schools The file MEAP93 contains data on 408 high schools in Michigan for the year 1993 We can use these data to test the null hypothesis that school size has no effect on standardized test scores against the alternative that size has a negative effect Performance is measured by the percentage of students receiving a passing score on the Michigan Educational Assessment Program MEAP standardized tenthgrade math test math10 School size is measured by student enrollment enroll The null hypothesis is H0 benroll 5 0 and the alternative is H1 benroll 0 For now we will control for two other factors average annual teacher compensation totcomp and the number of staff per one thou sand students staff Teacher compensation is a measure of teacher quality and staff size is a rough measure of how much attention students receive The estimated equation with standard errors in parentheses is math10 5 2274 1 00046 totcomp 1 048 staff 2 00020 enroll 161132 1000102 10402 1000222 n 5 408 R2 5 0541 The coefficient on enroll 200020 is in accordance with the conjecture that larger schools hamper performance higher enrollment leads to a lower percentage of students with a passing tenthgrade math score The coefficients on totcomp and staff also have the signs we expect The fact that enroll has an estimated coefficient different from zero could just be due to sampling error to be convinced of an effect we need to conduct a t test Since n 2 k 2 1 5 408 2 4 5 404 we use the standard normal critical value At the 5 level the critical value is 2165 the t statistic on enroll must be less than 2165 to reject H0 at the 5 level The t statistic on enroll is 20002000022 291 which is larger than 2165 we fail to reject H0 in favor of H1 at the 5 level In fact the 15 critical value is 2104 and since 291 2104 we fail to reject H0 even at the 15 level We conclude that enroll is not statistically significant at the 15 level The variable totcomp is statistically significant even at the 1 significance level because its t statistic is 46 On the other hand the t statistic for staff is 12 and so we cannot reject H0 bstaff 5 0 against H1 bstaff 0 even at the 10 significance level The critical value is c 5 128 from the standard normal distribution To illustrate how changing functional form can affect our conclusions we also estimate the model with all independent variables in logarithmic form This allows for example the school size effect to diminish as school size increases The estimated equation is math10 5 220766 1 2116 log1totcomp2 1 398 log1staff2 2 129 log1enroll2 148702 14062 14192 10692 n 5 408 R2 5 0654 The t statistic on logenroll is about 2187 since this is below the 5 critical value 2165 we reject H0 blog1enroll2 5 0 in favor of H1 blog1enroll2 0 at the 5 level In Chapter 2 we encountered a model where the dependent variable appeared in its original form called level form while the independent variable appeared in log form called levellog model The interpretation of the parameters is the same in the multiple regression context except of course that we can give the parameters a ceteris paribus interpretation Holding totcomp and staff fixed we have Dmath10 5 21293Dlog1enroll2 4 so that Dmath10 211291002 1Denroll2 20131Denroll2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 114 Once again we have used the fact that the change in logenroll when multiplied by 100 is approxi mately the percentage change in enroll Thus if enrollment is 10 higher at a school math10 is pre dicted to be 0131102 5 013 percentage points lower math10 is measured as a percentage Which model do we prefer the one using the level of enroll or the one using logenroll In the levellevel model enrollment does not have a statistically significant effect but in the levellog model it does This translates into a higher Rsquared for the levellog model which means we explain more of the variation in math10 by using enroll in logarithmic form 65 to 54 The levellog model is preferred because it more closely captures the relationship between math10 and enroll We will say more about using Rsquared to choose functional form in Chapter 6 42b TwoSided Alternatives In applications it is common to test the null hypothesis H0 bj 5 0 against a twosided alternative that is H1 bj 2 0 410 Under this alternative xj has a ceteris paribus effect on y without specifying whether the effect is posi tive or negative This is the relevant alternative when the sign of bj is not well determined by theory or common sense Even when we know whether bj is positive or negative under the alternative a twosided test is often prudent At a minimum using a twosided alternative prevents us from look ing at the estimated equation and then basing the alternative on whether b j is positive or negative Using the regression estimates to help us formulate the null or alternative hypotheses is not allowed 0 206 rejection region area 025 206 rejection region area 025 Figure 44 5 rejection rule for the alternative H1 bj 2 0 with 25 df Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 115 because classical statistical inference presumes that we state the null and alternative about the popula tion before looking at the data For example we should not first estimate the equation relating math performance to enrollment note that the estimated effect is negative and then decide the relevant alternative is H1 benroll 0 When the alternative is twosided we are interested in the absolute value of the t statistic The rejection rule for H0 bj 5 0 against 410 is 0tbj 0 c 411 where 00 denotes absolute value and c is an appropriately chosen critical value To find c we again specify a significance level say 5 For a twotailed test c is chosen to make the area in each tail of the t distribution equal 25 In other words c is the 975th percentile in the t distribution with n 2 k 2 1 degrees of freedom When n 2 k 2 1 5 25 the 5 critical value for a twosided test is c 5 2060 Figure 44 provides an illustration of this distribution When a specific alternative is not stated it is usually considered to be twosided In the remainder of this text the default will be a twosided alternative and 5 will be the default significance level When carrying out empirical econometric analysis it is always a good idea to be explicit about the alternative and the significance level If H0 is rejected in favor of 410 at the 5 level we usually say that xj is statistically significant or statistically different from zero at the 5 level If H0 is not rejected we say that xj is statistically insignificant at the 5 level example 43 Determinants of College gpa We use the data in GPA1 to estimate a model explaining college GPA colGPA with the average number of lectures missed per week skipped as an additional explanatory variable The estimated model is colGPA 5 139 1 412 hsGPA 1 015 ACT 2 083 skipped 1332 10942 10112 10262 n 5 141 R2 5 234 We can easily compute t statistics to see which variables are statistically significant using a two sided alternative in each case The 5 critical value is about 196 since the degrees of freedom 1141 2 4 5 1372 is large enough to use the standard normal approximation The 1 critical value is about 258 The t statistic on hsGPA is 438 which is significant at very small significance levels Thus we say that hsGPA is statistically significant at any conventional significance level The t statistic on ACT is 136 which is not statistically significant at the 10 level against a twosided alternative The coefficient on ACT is also practically small a 10point increase in ACT which is large is predicted to increase colGPA by only 15 points Thus the variable ACT is practically as well as statistically insignificant The coefficient on skipped has a t statistic of 2083026 5 2319 so skipped is statistically significant at the 1 significance level 1319 2582 This coefficient means that another lecture missed per week lowers predicted colGPA by about 083 Thus holding hsGPA and ACT fixed the predicted difference in colGPA between a student who misses no lectures per week and a student who misses five lectures per week is about 42 Remember that this says nothing about specific students rather 42 is the estimated average across a subpopulation of students In this example for each variable in the model we could argue that a onesided alternative is appropriate The variables hsGPA and skipped are very significant using a twotailed test and have the signs that we expect so there is no reason to do a onetailed test On the other hand against a one sided alternative 1b3 02 ACT is significant at the 10 level but not at the 5 level This does not change the fact that the coefficient on ACT is pretty small Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 116 42c Testing Other Hypotheses about bj Although H0 bj 5 0 is the most common hypothesis we sometimes want to test whether bj is equal to some other given constant Two common examples are bj 5 1 and bj 5 21 Generally if the null is stated as H0 bj 5 aj 412 where aj is our hypothesized value of bj then the appropriate t statistic is t 5 1b j 2 aj2se1b j2 As before t measures how many estimated standard deviations b j is away from the hypothesized value of bj The general t statistic is usefully written as t 5 1estimate 2 hypothesized value2 standard error 413 Under 412 this t statistic is distributed as tn2k21 from Theorem 42 The usual t statistic is obtained when aj 5 0 We can use the general t statistic to test against onesided or twosided alternatives For example if the null and alternative hypotheses are H0 bj 5 1 and H1 bj 1 then we find the critical value for a onesided alternative exactly as before the difference is in how we compute the t statistic not in how we obtain the appropriate c We reject H0 in favor of H1 if t c In this case we would say that b j is statistically greater than one at the appropriate significance level example 44 Campus Crime and enrollment Consider a simple model relating the annual number of crimes on college campuses crime to student enrollment enroll log1crime2 5 b0 1 b1log1enroll2 1 u This is a constant elasticity model where b1 is the elasticity of crime with respect to enrollment It is not much use to test H0 b1 5 0 as we expect the total number of crimes to increase as the size of the campus increases A more interesting hypothesis to test would be that the elasticity of crime with respect to enrollment is one H0 b1 5 1 This means that a 1 increase in enrollment leads to on average a 1 increase in crime A noteworthy alternative is H1 b1 1 which implies that a 1 increase in enrollment increases campus crime by more than 1 If b1 1 then in a relative sense not just an absolute sensecrime is more of a problem on larger campuses One way to see this is to take the exponential of the equation crime 5 exp1b02enrollb1exp1u2 See Appendix A for properties of the natural logarithm and exponential functions For b0 5 0 and u 5 0 this equation is graphed in Figure 45 for b1 1 b1 5 1 and b1 1 We test b1 5 1 against b1 1 using data on 97 colleges and universities in the United States for the year 1992 contained in the data file CAMPUS The data come from the FBIs Uniform Crime Reports and the average number of campus crimes in the sample is about 394 while the average enrollment is about 16076 The estimated equation with estimates and standard errors rounded to two decimal places is log1crime2 5 2663 1 127 log1enroll2 11032 10112 414 n 5 97 R2 5 585 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 117 The estimated elasticity of crime with respect to enroll 127 is in the direction of the alterna tive b1 1 But is there enough evidence to conclude that b1 1 We need to be careful in testing this hypothesis especially because the statistical output of standard regression packages is much more complex than the simplified output reported in equation 414 Our first instinct might be to construct the t statistic by taking the coefficient on logenroll and dividing it by its standard error which is the t statistic reported by a regression package But this is the wrong statistic for testing H0 b1 5 1 The correct t statistic is obtained from 413 we subtract the hypothesized value unity from the estimate and divide the result by the standard error of b 1 t 5 1127 2 1211 5 2711 245 The onesided 5 critical value for a t distribution with 97 2 2 5 95 df is about 166 using df 5 1202 so we clearly reject b1 5 1 in favor of b1 1 at the 5 level In fact the 1 critical value is about 237 and so we reject the null in favor of the alternative at even the 1 level We should keep in mind that this analysis holds no other factors constant so the elasticity of 127 is not necessarily a good estimate of ceteris paribus effect It could be that larger enrollments are correlated with other factors that cause higher crime larger schools might be located in higher crime areas We could control for this by collecting data on crime rates in the local city For a twosided alternative for example H0 bj 5 21 H1 b1 2 21 we still compute the t statis tic as in 413 t 5 1b j 1 12se1b j2 notice how subtracting 21 means adding 1 The rejection rule is the usual one for a twosided test reject H0 if 0t0 c where c is a twotailed critical value If H0 is rejected we say that b j is statistically different from negative one at the appropriate significance level 0 crime enroll 1 1 1 1 1 1 0 Figure 45 Graph of crime 5 enroll b1 for b1 1 b1 5 1 and b1 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 118 example 45 housing prices and air pollution For a sample of 506 communities in the Boston area we estimate a model relating median housing price price in the community to various community characteristics nox is the amount of nitrogen oxide in the air in parts per million dist is a weighted distance of the community from five employ ment centers in miles rooms is the average number of rooms in houses in the community and stratio is the average studentteacher ratio of schools in the community The population model is log1price2 5 b0 1 b1log1nox2 1 b2log1dist2 1 b3rooms 1 b4stratio 1 u Thus b1 is the elasticity of price with respect to nox We wish to test H0 b1 5 21 against the alterna tive H1 b1 2 21 The t statistic for doing this test is t 5 1b 1 1 12se1b 12 Using the data in HPRICE2 the estimated model is log1price2 5 1108 2 954 log1nox2 2 134 log1dist2 1 255 rooms 2 052 stratio 10322 11172 10432 10192 10062 n 5 506 R2 5 581 The slope estimates all have the anticipated signs Each coefficient is statistically different from zero at very small significance levels including the coefficient on lognox But we do not want to test that b1 5 0 The null hypothesis of interest is H0 b1 5 21 with corresponding t statistic 12954 1 12117 5 393 There is little need to look in the t table for a critical value when the t statistic is this small the estimated elasticity is not statistically different from 21 even at very large significance levels Controlling for the factors we have included there is little evidence that the elas ticity is different from 21 42d Computing pValues for t Tests So far we have talked about how to test hypotheses using a classical approach after stating the alter native hypothesis we choose a significance level which then determines a critical value Once the critical value has been identified the value of the t statistic is compared with the critical value and the null is either rejected or not rejected at the given significance level Even after deciding on the appropriate alternative there is a component of arbitrariness to the classical approach which results from having to choose a significance level ahead of time Different researchers prefer different significance levels depending on the particular application There is no correct significance level Committing to a significance level ahead of time can hide useful information about the outcome of a hypothesis test For example suppose that we wish to test the null hypothesis that a parameter is zero against a twosided alternative and with 40 degrees of freedom we obtain a t statistic equal to 185 The null hypothesis is not rejected at the 5 level since the t statistic is less than the twotailed critical value of c 5 2021 A researcher whose agenda is not to reject the null could simply report this outcome along with the estimate the null hypothesis is not rejected at the 5 level Of course if the t statistic or the coefficient and its standard error are reported then we can also determine that the null hypothesis would be rejected at the 10 level since the 10 critical value is c 5 1684 Rather than testing at different significance levels it is more informative to answer the follow ing question Given the observed value of the t statistic what is the smallest significance level at which the null hypothesis would be rejected This level is known as the pvalue for the test see Appendix C In the previous example we know the pvalue is greater than 05 since the null is not rejected at the 5 level and we know that the pvalue is less than 10 since the null is rejected at the 10 level We obtain the actual pvalue by computing the probability that a t random variable with 40 df is larger than 185 in absolute value That is the pvalue is the significance level of the test when we use the value of the test statistic 185 in the above example as the critical value for the test This pvalue is shown in Figure 46 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 119 Because a pvalue is a probability its value is always between zero and one In order to compute pvalues we either need extremely detailed printed tables of the t distributionwhich is not very practicalor a computer program that computes areas under the probability density function of the t distribution Most modern regression packages have this capability Some packages compute pvalues routinely with each OLS regression but only for certain hypotheses If a regression package reports a pvalue along with the standard OLS output it is almost certainly the pvalue for testing the null hypothesis H0 bj 5 0 against the twosided alternative The pvalue in this case is P1 0T0 0t0 2 415 where for clarity we let T denote a t distributed random variable with n 2 k 2 1 degrees of freedom and let t denote the numerical value of the test statistic The pvalue nicely summarizes the strength or weakness of the empirical evidence against the null hypothesis Perhaps its most useful interpretation is the following the pvalue is the probability of observing a t statistic as extreme as we did if the null hypothesis is true This means that small pvalues are evidence against the null large pvalues provide little evidence against H0 For exam ple if the pvalue 5 50 reported always as a decimal not a percentage then we would observe a value of the t statistic as extreme as we did in 50 of all random samples when the null hypothesis is true this is pretty weak evidence against H0 In the example with df 5 40 and t 5 185 the pvalue is computed as pvalue 5 P1 0T0 1852 5 2P1T 1852 5 2103592 5 0718 where P1T 1852 is the area to the right of 185 in a t distribution with 40 df This value was com puted using the econometrics package Stata it is not available in Table G2 This means that if the 0 185 area 0359 185 area 0359 area 9282 Figure 46 Obtaining the pvalue against a twosided alternative when t 5 185 and df 5 40 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 120 null hypothesis is true we would observe an absolute value of the t statistic as large as 185 about 72 percent of the time This provides some evidence against the null hypothesis but we would not reject the null at the 5 significance level The previous example illustrates that once the pvalue has been computed a classical test can be carried out at any desired level If a denotes the significance level of the test in decimal form then H0 is rejected if pvalue a otherwise H0 is not rejected at the 100a level Computing pvalues for onesided alternatives is also quite simple Suppose for example that we test H0 bj 5 0 against H1 bj 0 If b j 0 then computing a pvalue is not important we know that the pvalue is greater than 50 which will never cause us to reject H0 in favor of H1 If b j 0 then t 0 and the pvalue is just the probability that a t random variable with the appropriate df exceeds the value t Some regression packages only compute pvalues for twosided alternatives But it is sim ple to obtain the onesided pvalue just divide the twosided pvalue by 2 If the alternative is H1 bj 0 it makes sense to compute a pvalue if b j 0 and hence t 0 pvalue 5 P1T t2 5 P1T 0t0 2 because the t distribution is symmetric about zero Again this can be obtained as onehalf of the pvalue for the twotailed test Because you will quickly become familiar with the magnitudes of t statistics that lead to statistical significance especially for large sample sizes it is not always crucial to report pvalues for t statistics But it does not hurt to report them Further when we discuss F testing in Section 45 we will see that it is important to compute pvalues because critical val ues for F tests are not so easily memorized 42e A Reminder on the Language of Classical Hypothesis Testing When H0 is not rejected we prefer to use the language we fail to reject H0 at the x level rather than H0 is accepted at the x level We can use Example 45 to illustrate why the former statement is preferred In this example the estimated elasticity of price with respect to nox is 2954 and the t statistic for testing H0 bnox 5 21 is t 5 393 therefore we cannot reject H0 But there are many other values for bnox more than we can count that cannot be rejected For example the t statistic for H0 bnox 5 29 is 12954 1 92117 5 2462 and so this null is not rejected either Clearly bnox 5 21 and bnox 5 29 cannot both be true so it makes no sense to say that we accept either of these hypotheses All we can say is that the data do not allow us to reject either of these hypotheses at the 5 significance level 42f Economic or Practical versus Statistical Significance Because we have emphasized statistical significance throughout this section now is a good time to remember that we should pay attention to the magnitude of the coefficient estimates in addition to the size of the t statistics The statistical significance of a variable xj is determined entirely by the size of tbj whereas the economic significance or practical significance of a variable is related to the size and sign of b j Recall that the t statistic for testing H0 bj 5 0 is defined by dividing the estimate by its stand ard error tbj 5 b jse1b j2 Thus tbj can indicate statistical significance either because b j is large or because se1b j2 is small It is important in practice to distinguish between these reasons for statis tically significant t statistics Too much focus on statistical significance can lead to the false conclu sion that a variable is important for explaining y even though its estimated effect is modest Suppose you estimate a regression model and obtain b 1 5 56 and pvalue 5 086 for testing h0 b1 5 0 against h1 b1 2 0 What is the pvalue for testing h0 b1 5 0 against h1 b1 0 exploring FurTher 43 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 121 example 46 participation rates in 401k plans In Example 33 we used the data on 401k plans to estimate a model describing participation rates in terms of the firms match rate and the age of the plan We now include a measure of firm size the total number of firm employees totemp The estimated equation is prate 5 8029 1 544 mrate 1 269 age 2 00013 totemp 10782 10522 10452 1000042 n 5 1534 R2 5 100 The smallest t statistic in absolute value is that on the variable totemp t 5 20001300004 5 2325 and this is statistically significant at very small significance levels The twotailed pvalue for this t statistic is about 001 Thus all of the variables are statistically significant at rather small signifi cance levels How big in a practical sense is the coefficient on totemp Holding mrate and age fixed if a firm grows by 10000 employees the participation rate falls by 100001000132 5 13 percentage points This is a huge increase in number of employees with only a modest effect on the participation rate Thus although firm size does affect the participation rate the effect is not practically very large The previous example shows that it is especially important to interpret the magnitude of the coef ficient in addition to looking at t statistics when working with large samples With large sample sizes parameters can be estimated very precisely standard errors are often quite small relative to the coefficient estimates which usually results in statistical significance Some researchers insist on using smaller significance levels as the sample size increases partly as a way to offset the fact that standard errors are getting smaller For example if we feel comfortable with a 5 level when n is a few hundred we might use the 1 level when n is a few thousand Using a smaller significance level means that economic and statistical significance are more likely to coin cide but there are no guarantees In the previous example even if we use a significance level as small as 1 onetenth of 1 we would still conclude that totemp is statistically significant Many researchers are also willing to entertain larger significance levels in applications with small sample sizes reflecting the fact that it is harder to find significance with smaller sample sizes Smaller sample sizes lead to less precise estimators and the critical values are larger in magnitude two factors that make it harder to find statistical significance Unfortunately ones willingness to consider higher significance levels can depend on ones underlying agenda example 47 effect of Job training on firm Scrap rates The scrap rate for a manufacturing firm is the number of defective itemsproducts that must be discardedout of every 100 produced Thus for a given number of items produced a decrease in the scrap rate reflects higher worker productivity We can use the scrap rate to measure the effect of worker training on productivity Using the data in JTRAIN but only for the year 1987 and for nonunionized firms we obtain the following estimated equation log1scrap2 5 1246 2 029 hrsemp 2 962 log1sales2 1 761 log1employ2 15692 10232 14532 14072 n 5 29 R2 5 262 The variable hrsemp is annual hours of training per employee sales is annual firm sales in dollars and employ is the number of firm employees For 1987 the average scrap rate in the sample is about 46 and the average of hrsemp is about 89 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 122 The main variable of interest is hrsemp One more hour of training per employee lowers logscrap by 029 which means the scrap rate is about 29 lower Thus if hrsemp increases by 5each employee is trained 5 more hours per yearthe scrap rate is estimated to fall by 51292 5 145 This seems like a reasonably large effect but whether the additional training is worthwhile to the firm depends on the cost of training and the benefits from a lower scrap rate We do not have the numbers needed to do a cost benefit analysis but the estimated effect seems nontrivial What about the statistical significance of the training variable The t statistic on hrsemp is 2029023 5 2126 and now you probably recognize this as not being large enough in magnitude to conclude that hrsemp is statistically significant at the 5 level In fact with 29 2 4 5 25 degrees of freedom for the onesided alternative H1 bhrsemp 0 the 5 critical value is about 2171 Thus using a strict 5 level test we must conclude that hrsemp is not statistically significant even using a onesided alternative Because the sample size is pretty small we might be more liberal with the significance level The 10 critical value is 2132 and so hrsemp is almost significant against the onesided alternative at the 10 level The pvalue is easily computed as P1T25 21262 5 110 This may be a low enough pvalue to conclude that the estimated effect of training is not just due to sampling error but opinions would legitimately differ on whether a onesided pvalue of 11 is sufficiently small Remember that large standard errors can also be a result of multicollinearity high correlation among some of the independent variables even if the sample size seems fairly large As we dis cussed in Section 34 there is not much we can do about this problem other than to collect more data or change the scope of the analysis by dropping or combining certain independent variables As in the case of a small sample size it can be hard to precisely estimate partial effects when some of the explanatory variables are highly correlated Section 45 contains an example We end this section with some guidelines for discussing the economic and statistical significance of a variable in a multiple regression model 1 Check for statistical significance If the variable is statistically significant discuss the magnitude of the coefficient to get an idea of its practical or economic importance This latter step can require some care depending on how the independent and dependent variables appear in the equation In particular what are the units of measurement Do the variables appear in logarithmic form 2 If a variable is not statistically significant at the usual levels 10 5 or 1 you might still ask if the variable has the expected effect on y and whether that effect is practically large If it is large you should compute a pvalue for the t statistic For small sample sizes you can sometimes make a case for pvalues as large as 20 but there are no hard rules With large pvalues that is small t statistics we are treading on thin ice because the practically large estimates may be due to sampling error a different random sample could result in a very different estimate 3 It is common to find variables with small t statistics that have the wrong sign For practical pur poses these can be ignored we conclude that the variables are statistically insignificant A signif icant variable that has the unexpected sign and a practically large effect is much more troubling and difficult to resolve One must usually think more about the model and the nature of the data to solve such problems Often a counterintuitive significant estimate results from the omission of a key variable or from one of the important problems we will discuss in Chapters 9 and 15 43 Confidence Intervals Under the CLM assumptions we can easily construct a confidence interval CI for the population parameter bj Confidence intervals are also called interval estimates because they provide a range of likely values for the population parameter and not just a point estimate Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 123 Using the fact that 1b j 2 bj2se1b j2 has a t distribution with n 2 k 2 1 degrees of freedom see 43 simple manipulation leads to a CI for the unknown bj a 95 confidence interval given by b j 6 c se1b j2 416 where the constant c is the 975th percentile in a tn2k21 distribution More precisely the lower and upper bounds of the confidence interval are given by b j b j 2 c se1b j2 and bj b j 1 c se1b j2 respectively At this point it is useful to review the meaning of a confidence interval If random samples were obtained over and over again with b j and bj computed each time then the unknown population value bj would lie in the interval b j bj for 95 of the samples Unfortunately for the single sample that we use to construct the CI we do not know whether bj is actually contained in the interval We hope we have obtained a sample that is one of the 95 of all samples where the interval estimate contains bj but we have no guarantee Constructing a confidence interval is very simple when using current computing technology Three quantities are needed b j se1b j2 and c The coefficient estimate and its standard error are reported by any regression package To obtain the value c we must know the degrees of freedom n 2 k 2 1 and the level of confidence95 in this case Then the value for c is obtained from the tn2k21 distribution As an example for df 5 n 2 k 2 1 5 25 a 95 confidence interval for any bj is given by 3b j 2 206 se1b j2 b j 1 206 se1b j2 4 When n 2 k 2 1 120 the tn2k21 distribution is close enough to normal to use the 975th per centile in a standard normal distribution for constructing a 95 CI b j 6 196 se1b j2 In fact when n 2 k 2 1 50 the value of c is so close to 2 that we can use a simple rule of thumb for a 95 con fidence interval b j plus or minus two of its standard errors For small degrees of freedom the exact percentiles should be obtained from the t tables It is easy to construct confidence intervals for any other level of confidence For exam ple a 90 CI is obtained by choosing c to be the 95th percentile in the tn2k21 distribution When df 5 n 2 k 2 1 5 25 c 5 171 and so the 90 CI is b j 6 171 se1b j2 which is necessarily nar rower than the 95 CI For a 99 CI c is the 995th percentile in the t25 distribution When df 5 25 the 99 CI is roughly b j 6 279 se1b j2 which is inevitably wider than the 95 CI Many modern regression packages save us from doing any calculations by reporting a 95 CI along with each coefficient and its standard error Once a confidence interval is constructed it is easy to carry out twotailed hypotheses tests If the null hypothesis is H0 bj 5 aj then H0 is rejected against H1 bj 2 aj at say the 5 significance level if and only if aj is not in the 95 confidence interval example 48 model of rD expenditures Economists studying industrial organization are interested in the relationship between firm size often measured by annual salesand spending on research and development RD Typically a constant elasticity model is used One might also be interested in the ceteris paribus effect of the profit marginthat is profits as a percentage of saleson RD spending Using the data in RDCHEM on 32 US firms in the chemical industry we estimate the following equation with standard errors in parentheses below the coefficients log1rd2 5 2438 1 1084 log1sales2 1 0217 profmarg 1472 10602 101282 n 5 32 R2 5 918 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 124 The estimated elasticity of RD spending with respect to firm sales is 1084 so that holding profit mar gin fixed a 1 increase in sales is associated with a 1084 increase in RD spending Incidentally RD and sales are both measured in millions of dollars but their units of measurement have no effect on the elasticity estimate We can construct a 95 confidence interval for the sales elasticity once we note that the estimated model has n 2 k 2 1 5 32 2 2 2 1 5 29 degrees of freedom From Table G2 we find the 975th percentile in a t29 distribution c 5 2045 Thus the 95 confidence interval for blog1sales2 is 1084 6 060120452 or about 961121 That zero is well outside this interval is hardly surpris ing we expect RD spending to increase with firm size More interesting is that unity is included in the 95 confidence interval for blog1sales2 which means that we cannot reject H0 blog1sales2 5 1 against H1 blog1sales2 2 1 at the 5 significance level In other words the estimated RDsales elasticity is not statistically different from 1 at the 5 level The estimate is not practically different from 1 either The estimated coefficient on profmarg is also positive and the 95 confidence interval for the population parameter bprofmarg is 0217 6 0128120452 or about 200450479 In this case zero is included in the 95 confidence interval so we fail to reject H0 bprofmarg 5 0 against H1 bprofmarg 2 0 at the 5 level Nevertheless the t statistic is about 170 which gives a twosided pvalue of about 10 and so we would conclude that profmarg is statistically significant at the 10 level against the twosided alternative or at the 5 level against the onesided alternative H1 bprofmarg 0 Plus the economic size of the profit margin coefficient is not trivial holding sales fixed a one percentage point increase in profmarg is estimated to increase RD spending by 100102172 22 A com plete analysis of this example goes beyond simply stating whether a particular value zero in this case is or is not in the 95 confidence interval You should remember that a confidence interval is only as good as the underlying assump tions used to construct it If we have omitted important factors that are correlated with the explana tory variables then the coefficient estimates are not reliable OLS is biased If heteroskedasticity is presentfor instance in the previous example if the variance of logrd depends on any of the explanatory variablesthen the standard error is not valid as an estimate of sd1b j2 as we discussed in Section 34 and the confidence interval computed using these standard errors will not truly be a 95 CI We have also used the normality assumption on the errors in obtaining these CIs but as we will see in Chapter 5 this is not as important for applications involving hundreds of observations 44 Testing Hypotheses about a Single Linear Combination of the Parameters The previous two sections have shown how to use classical hypothesis testing or confidence intervals to test hypotheses about a single bj at a time In applications we must often test hypotheses involving more than one of the population parameters In this section we show how to test a single hypothesis involving more than one of the bj Section 45 shows how to test multiple hypotheses To illustrate the general approach we will consider a simple model to compare the returns to edu cation at junior colleges and fouryear colleges for simplicity we refer to the latter as universities Kane and Rouse 1995 provide a detailed analysis of the returns to two and fouryear colleges The population includes working people with a high school degree and the model is log1wage2 5 b0 1 b1 jc 1 b2univ 1 b3exper 1 u 417 where jc 5 number of years attending a twoyear college univ 5 number of years at a fouryear college exper 5 months in the workforce Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 125 Note that any combination of junior college and fouryear college is allowed including jc 5 0 and univ 5 0 The hypothesis of interest is whether one year at a junior college is worth one year at a university this is stated as H0 b1 5 b2 418 Under H0 another year at a junior college and another year at a university lead to the same ceteris paribus percentage increase in wage For the most part the alternative of interest is onesided a year at a junior college is worth less than a year at a university This is stated as H1 b1 b2 419 The hypotheses in 418 and 419 concern two parameters b1 and b2 a situation we have not faced yet We cannot simply use the individual t statistics for b 1 and b 2 to test H0 However concep tually there is no difficulty in constructing a t statistic for testing 418 To do so we rewrite the null and alternative as H0 b1 2 b2 5 0 and H1 b1 2 b2 0 respectively The t statistic is based on whether the estimated difference b 1 2 b 2 is sufficiently less than zero to warrant rejecting 418 in favor of 419 To account for the sampling error in our estimators we standardize this difference by dividing by the standard error t 5 b 1 2 b 2 se1b 1 2 b 22 420 Once we have the t statistic in 420 testing proceeds as before We choose a significance level for the test and based on the df obtain a critical value Because the alternative is of the form in 419 the rejection rule is of the form t 2c where c is a positive value chosen from the appropriate t dis tribution Or we compute the t statistic and then compute the pvalue see Section 42 The only thing that makes testing the equality of two different parameters more difficult than testing about a single bj is obtaining the standard error in the denominator of 420 Obtaining the numerator is trivial once we have performed the OLS regression Using the data in TWOYEAR which comes from Kane and Rouse 1995 we estimate equation 417 log1wage2 5 1472 1 0667 jc 1 0769 univ 1 0049 exper 10212 100682 100232 100022 421 n 5 6763 R2 5 222 It is clear from 421 that jc and univ have both economically and statistically significant effects on wage This is certainly of interest but we are more concerned about testing whether the estimated dif ference in the coefficients is statistically significant The difference is estimated as b 1 2 b 2 5 20102 so the return to a year at a junior college is about one percentage point less than a year at a university Economically this is not a trivial difference The difference of 20102 is the numerator of the t sta tistic in 420 Unfortunately the regression results in equation 421 do not contain enough information to obtain the standard error of b 1 2 b 2 It might be tempting to claim that se1b 1 2 b 22 5 se1b 12 2 se1b 22 but this is not true In fact if we reversed the roles of b 1 and b 2 we would wind up with a negative standard error of the difference using the difference in standard errors Standard errors must always be positive because they are estimates of standard deviations Although the standard error of the dif ference b 1 2 b 2 certainly depends on se1b 12 and se1b 22 it does so in a somewhat complicated way To find se1b 1 2 b 22 we first obtain the variance of the difference Using the results on variances in Appendix B we have Var1b 1 2 b 22 5 Var1b 12 1 Var1b 22 2 2 Cov1b 1 b 22 422 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 126 Observe carefully how the two variances are added together and twice the covariance is then sub tracted The standard deviation of b 1 2 b 2 is just the square root of 422 and since 3se1b 12 42 is an unbiased estimator of Var1b 12 and similarly for 3se1b 22 42 we have se1b 1 2 b 22 5 53se1b 12 42 1 3se1b 22 42 2 2s12612 423 where s12 denotes an estimate of Cov1b 1b 22 We have not displayed a formula for Cov1b 1b 22 Some regression packages have features that allow one to obtain s12 in which case one can compute the standard error in 423 and then the t statistic in 420 Appendix E shows how to use matrix algebra to obtain s12 Some of the more sophisticated econometrics programs include special commands that can be used for testing hypotheses about linear combinations Here we cover an approach that is simple to compute in virtually any statistical package Rather than trying to compute se1b 1 2 b 22 from 423 it is much easier to estimate a different model that directly delivers the standard error of interest Define a new parameter as the difference between b1 and b2 u1 5 b1 2 b2 Then we want to test H0 u1 5 0 against H1 u1 0 424 The t statistic in 420 in terms of u 1 is just t 5 u 1se1u 12 The challenge is finding se1u 12 We can do this by rewriting the model so that u1 appears directly on one of the independent variables Because u1 5 b1 2 b2 we can also write b1 5 u1 1 b2 Plugging this into 417 and rear ranging gives the equation log1wage2 5 b0 1 1u1 1 b22jc 1 b2univ 1 b3exper 1 u 5 b0 1 u1 jc 1 b21jc 1 univ2 1 b3exper 1 u 425 The key insight is that the parameter we are interested in testing hypotheses about u1 now multiplies the variable jc The intercept is still b0 and exper still shows up as being multiplied by b3 More importantly there is a new variable multiplying b2 namely jc 1 univ Thus if we want to directly estimate u1 and obtain the standard error of u 1 then we must construct the new variable jc 1 univ and include it in the regression model in place of univ In this example the new variable has a natural interpretation it is total years of college so define totcoll 5 jc 1 univ and write 425 as log1wage2 5 b0 1 u1 jc 1 b2totcoll 1 b3exper 1 u 426 The parameter b1 has disappeared from the model while u1 appears explicitly This model is really just a different way of writing the original model The only reason we have defined this new model is that when we estimate it the coefficient on jc is u 1 and more importantly se1u 12 is reported along with the estimate The t statistic that we want is the one reported by any regression package on the variable jc not the variable totcoll When we do this with the 6763 observations used earlier the result is log1wage2 5 1472 2 0102 jc 1 0769 totcoll 1 0049 exper 10212 100692 100232 100022 427 n 5 6763 R2 5 222 The only number in this equation that we could not get from 421 is the standard error for the esti mate 20102 which is 0069 The t statistic for testing 418 is 201020069 5 2148 Against the onesided alternative 419 the pvalue is about 070 so there is some but not strong evidence against 418 The intercept and slope estimate on exper along with their standard errors are the same as in 421 This fact must be true and it provides one way of checking whether the transformed equation has been properly estimated The coefficient on the new variable totcoll is the same as the coefficient on univ in 421 and the standard error is also the same We know that this must happen by compar ing 417 and 425 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 127 It is quite simple to compute a 95 confidence interval for u1 5 b1 2 b2 Using the standard normal approximation the CI is obtained as usual u 1 6 196 se1u 12 which in this case leads to 20102 6 0135 The strategy of rewriting the model so that it contains the parameter of interest works in all cases and is easy to implement See Computer Exercises C1 and C3 for other examples 45 Testing Multiple Linear Restrictions The F Test The t statistic associated with any OLS coefficient can be used to test whether the corresponding unknown parameter in the population is equal to any given constant which is usually but not always zero We have just shown how to test hypotheses about a single linear combination of the bj by rear ranging the equation and running a regression using transformed variables But so far we have only covered hypotheses involving a single restriction Frequently we wish to test multiple hypotheses about the underlying parameters b0 b1 p bk We begin with the leading case of testing whether a set of independent variables has no partial effect on a dependent variable 45a Testing Exclusion Restrictions We already know how to test whether a particular variable has no partial effect on the dependent variable use the t statistic Now we want to test whether a group of variables has no effect on the dependent variable More precisely the null hypothesis is that a set of variables has no effect on y once another set of variables has been controlled As an illustration of why testing significance of a group of variables is useful we consider the following model that explains major league baseball players salaries log1salary2 5 b0 1 b1years 1 b2gamesyr 1 b3bavg 1 b4hrunsyr 1 b5rbisyr 1 u 428 where salary is the 1993 total salary years is years in the league gamesyr is average games played per year bavg is career batting average for example bavg 5 250 hrunsyr is home runs per year and rbisyr is runs batted in per year Suppose we want to test the null hypothesis that once years in the league and games per year have been controlled for the statistics measuring performancebavg hrunsyr and rbisyrhave no effect on salary Essentially the null hypothesis states that productivity as measured by baseball statistics has no effect on salary In terms of the parameters of the model the null hypothesis is stated as H0 b3 5 0 b4 5 0 b5 5 0 429 The null 429 constitutes three exclusion restrictions if 429 is true then bavg hrunsyr and rbisyr have no effect on logsalary after years and gamesyr have been controlled for and therefore should be excluded from the model This is an example of a set of multiple restrictions because we are putting more than one restriction on the parameters in 428 we will see more general examples of multiple restrictions later A test of multiple restrictions is called a multiple hypotheses test or a joint hypotheses test What should be the alternative to 429 If what we have in mind is that performance statistics matter even after controlling for years in the league and games per year then the appropriate alterna tive is simply H1 H0 is not true 430 The alternative 430 holds if at least one of b3 b4 or b5 is different from zero Any or all could be different from zero The test we study here is constructed to detect any violation of H0 It is also valid when the alternative is something like H1 b3 0 or b4 0 or b5 0 but it will not be the best Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 128 possible test under such alternatives We do not have the space or statistical background necessary to cover tests that have more power under multiple onesided alternatives How should we proceed in testing 429 against 430 It is tempting to test 429 by using the t statistics on the variables bavg hrunsyr and rbisyr to determine whether each variable is individually significant This option is not appropriate A particular t statistic tests a hypothesis that puts no restric tions on the other parameters Besides we would have three outcomes to contend withone for each t statistic What would constitute rejection of 429 at say the 5 level Should all three or only one of the three t statistics be required to be significant at the 5 level These are hard questions and fortunately we do not have to answer them Furthermore using separate t statistics to test a multiple hypothesis like 429 can be very misleading We need a way to test the exclusion restrictions jointly To illustrate these issues we estimate equation 428 using the data in MLB1 This gives log1salary2 5 1119 1 0689 years 1 0126 gamesyr 10292 101212 100262 1 00098 bavg 1 0144 hrunsyr 1 0108 rbisyr 431 1001102 101612 100722 n 5 353 SSR 5 183186 R2 5 6278 where SSR is the sum of squared residuals We will use this later We have left several terms after the decimal in SSR and Rsquared to facilitate future comparisons Equation 431 reveals that whereas years and gamesyr are statistically significant none of the variables bavg hrunsyr and rbisyr has a statistically significant t statistic against a twosided alternative at the 5 significance level The t statistic on rbisyr is the closest to being significant its twosided pvalue is 134 Thus based on the three t statistics it appears that we cannot reject H0 This conclusion turns out to be wrong To see this we must derive a test of multiple restrictions whose distribution is known and tabulated The sum of squared residuals now turns out to provide a very convenient basis for testing multiple hypotheses We will also show how the Rsquared can be used in the special case of testing for exclusion restrictions Knowing the sum of squared residuals in 431 tells us nothing about the truth of the hypoth esis in 429 However the factor that will tell us something is how much the SSR increases when we drop the variables bavg hrunsyr and rbisyr from the model Remember that because the OLS estimates are chosen to minimize the sum of squared residuals the SSR always increases when vari ables are dropped from the model this is an algebraic fact The question is whether this increase is large enough relative to the SSR in the model with all of the variables to warrant rejecting the null hypothesis The model without the three variables in question is simply log1salary2 5 b0 1 b1years 1 b2gamesyr 1 u 432 In the context of hypothesis testing equation 432 is the restricted model for testing 429 model 428 is called the unrestricted model The restricted model always has fewer parameters than the unrestricted model When we estimate the restricted model using the data in MLB1 we obtain log1salary2 5 1122 1 0713 years 1 0202 gamesyr 1112 101252 100132 n 5 353 SSR 5 198311 R2 5 5971 433 As we surmised the SSR from 433 is greater than the SSR from 431 and the Rsquared from the restricted model is less than the Rsquared from the unrestricted model What we need to decide is whether the increase in the SSR in going from the unrestricted model to the restricted model 183186 to 198311 is large enough to warrant rejection of 429 As with all testing the answer depends on the significance level of the test But we cannot carry out the test at a chosen significance level until we Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 129 have a statistic whose distribution is known and can be tabulated under H0 Thus we need a way to combine the information in the two SSRs to obtain a test statistic with a known distribution under H0 Because it is no more difficult we might as well derive the test for the general case Write the unrestricted model with k independent variables as y 5 b0 1 b1x1 1 p 1 bkxk 1 u 434 the number of parameters in the unrestricted model is k 1 1 Remember to add one for the inter cept Suppose that we have q exclusion restrictions to test that is the null hypothesis states that q of the variables in 434 have zero coefficients For notational simplicity assume that it is the last q variables in the list of independent variables xk2q11 p xk The order of the variables of course is arbitrary and unimportant The null hypothesis is stated as H0 bk2q11 5 0 p bk 5 0 435 which puts q exclusion restrictions on the model 434 The alternative to 435 is simply that it is false this means that at least one of the parameters listed in 435 is different from zero When we impose the restrictions under H0 we are left with the restricted model y 5 b0 1 b1x1 1 p 1 bk2q xk2q 1 u 436 In this subsection we assume that both the unrestricted and restricted models contain an intercept since that is the case most widely encountered in practice Now for the test statistic itself Earlier we suggested that looking at the relative increase in the SSR when moving from the unrestricted to the restricted model should be informative for testing the hypothesis 435 The F statistic or F ratio is defined by F 1SSRr 2 SSRur2q SSRur1n 2 k 2 12 437 where SSRr is the sum of squared residuals from the restricted model and SSRur is the sum of squared residu als from the unrestricted model You should immediately notice that since SSRr can be no smaller than SSRur the F statistic is always nonnegative and almost always strictly positive Thus if you compute a negative F statistic then some thing is wrong the order of the SSRs in the numera tor of F has usually been reversed Also the SSR in the denominator of F is the SSR from the unrestricted model The easiest way to remember where the SSRs appear is to think of F as measuring the relative increase in SSR when moving from the unrestricted to the restricted model The difference in SSRs in the numerator of F is divided by q which is the number of restric tions imposed in moving from the unrestricted to the restricted model q independent variables are dropped Therefore we can write q 5 numerator degrees of freedom 5 dfr 2 dfur 438 which also shows that q is the difference in degrees of freedom between the restricted and unre stricted models Recall that df 5 number of observations 2 number of estimated parameters2 Consider relating individual performance on a standardized test score to a variety of other variables School factors include average class size perstudent expendi tures average teacher compensation and total school enrollment Other variables specific to the student are family income mothers education fathers education and number of siblings The model is score 5 b0 1 b1classize 1 b2expend 1 b3tchcomp 1 b4enroll 1 b5faminc 1 b6motheduc 1 b7fatheduc 1 b8siblings 1 u State the null hypothesis that student specific variables have no effect on stan dardized test performance once school related factors have been controlled for What are k and q for this example Write down the restricted version of the model exploring FurTher 44 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 130 Since the restricted model has fewer parametersand each model is estimated using the same n observationsdfr is always greater than dfur The SSR in the denominator of F is divided by the degrees of freedom in the unrestricted model n 2 k 2 1 5 denominator degrees of freedom 5 dfur 439 In fact the denominator of F is just the unbiased estimator of s2 5 Var1u2 in the unrestricted model In a particular application computing the F statistic is easier than wading through the somewhat cumbersome notation used to describe the general case We first obtain the degrees of freedom in the unrestricted model dfur Then we count how many variables are excluded in the restricted model this is q The SSRs are reported with every OLS regression and so forming the F statistic is simple In the major league baseball salary regression n 5 353 and the full model 428 contains six parameters Thus n 2 k 2 1 5 dfur 5 353 2 6 5 347 The restricted model 432 contains three fewer independent variables than 428 and so q 5 3 Thus we have all of the ingredients to com pute the F statistic we hold off doing so until we know what to do with it To use the F statistic we must know its sampling distribution under the null in order to choose critical values and rejection rules It can be shown that under H0 and assuming the CLM assump tions hold F is distributed as an F random variable with qn 2 k 2 1 degrees of freedom We write this as F Fqn2k21 The distribution of Fqn2k21 is readily tabulated and available in statistical tables see Table G3 and even more importantly in statistical software We will not derive the F distribution because the mathematics is very involved Basically it can be shown that equation 437 is actually the ratio of two independent chisquare random variables divided by their respective degrees of freedom The numerator chisquare random variable has q degrees of freedom and the chisquare in the denominator has n 2 k 2 1 degrees of freedom This is the definition of an F distributed random variable see Appendix B It is pretty clear from the definition of F that we will reject H0 in favor of H1 when F is suffi ciently large How large depends on our chosen significance level Suppose that we have decided on a 5 level test Let c be the 95th percentile in the Fqn2k21 distribution This critical value depends on q the numerator df and n 2 k 2 1 the denominator df It is important to keep the numerator and denominator degrees of freedom straight The 10 5 and 1 critical values for the F distribution are given in Table G3 The rejection rule is simple Once c has been obtained we reject H0 in favor of H1 at the chosen significance level if F c 440 With a 5 significance level q 5 3 and n 2 k 2 1 5 60 the critical value is c 5 276 We would reject H0 at the 5 level if the computed value of the F statistic exceeds 276 The 5 critical value and rejection region are shown in Figure 47 For the same degrees of freedom the 1 critical value is 413 In most applications the numerator degrees of freedom q will be notably smaller than the denominator degrees of freedom 1n 2 k 2 12 Applications where n 2 k 2 1 is small are unlikely to be successful because the parameters in the unrestricted model will probably not be precisely esti mated When the denominator df reaches about 120 the F distribution is no longer sensitive to it This is entirely analogous to the t distribution being well approximated by the standard normal dis tribution as the df gets large Thus there is an entry in the table for the denominator df 5 and this is what we use with large samples because n 2 k 2 1 is then large A similar statement holds for a very large numerator df but this rarely occurs in applications If H0 is rejected then we say that xk2q11 p xk are jointly statistically significant or just jointly significant at the appropriate significance level This test alone does not allow us to say which of the variables has a partial effect on y they may all affect y or maybe only one affects y If Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 131 the null is not rejected then the variables are jointly insignificant which often justifies dropping them from the model For the major league baseball example with three numerator degrees of freedom and 347 denom inator degrees of freedom the 5 critical value is 260 and the 1 critical value is 378 We reject H0 at the 1 level if F is above 378 we reject at the 5 level if F is above 260 We are now in a position to test the hypothesis that we began this section with after control ling for years and gamesyr the variables bavg hrunsyr and rbisyr have no effect on players sala ries In practice it is easiest to first compute 1SSRr 2 SSRur2SSRur and to multiply the result by 1n 2 k 2 12q the reason the formula is stated as in 437 is that it makes it easier to keep the numerator and denominator degrees of freedom straight Using the SSRs in 431 and 433 we have F 5 1198311 2 1831862 183186 347 3 955 This number is well above the 1 critical value in the F distribution with 3 and 347 degrees of free dom and so we soundly reject the hypothesis that bavg hrunsyr and rbisyr have no effect on salary The outcome of the joint test may seem surprising in light of the insignificant t statistics for the three variables What is happening is that the two variables hrunsyr and rbisyr are highly corre lated and this multicollinearity makes it difficult to uncover the partial effect of each variable this is reflected in the individual t statistics The F statistic tests whether these variables including bavg are jointly significant and multicollinearity between hrunsyr and rbisyr is much less relevant for testing this hypothesis In Problem 16 you are asked to reestimate the model while dropping rbisyr in which 0 276 area 05 area 95 rejection region Figure 47 The 5 critical value and rejection region in an F360 distribution Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 132 case hrunsyr becomes very significant The same is true for rbisyr when hrunsyr is dropped from the model The F statistic is often useful for testing exclusion of a group of variables when the variables in the group are highly correlated For example suppose we want to test whether firm performance affects the salaries of chief executive officers There are many ways to measure firm performance and it probably would not be clear ahead of time which measures would be most important Since meas ures of firm performance are likely to be highly correlated hoping to find individually significant measures might be asking too much due to multicollinearity But an F test can be used to determine whether as a group the firm performance variables affect salary 45b Relationship between F and t Statistics We have seen in this section how the F statistic can be used to test whether a group of variables should be included in a model What happens if we apply the F statistic to the case of testing significance of a single independent variable This case is certainly not ruled out by the previ ous development For example we can take the null to be H0 bk 5 0 and q 5 1 to test the single exclusion restriction that xk can be excluded from the model From Section 42 we know that the t statistic on bk can be used to test this hypothesis The question then is do we have two separate ways of testing hypotheses about a single coefficient The answer is no It can be shown that the F statistic for testing exclusion of a single variable is equal to the square of the corresponding t statistic Since t2 n2k21 has an F1n2k21 distribution the two approaches lead to exactly the same out come provided that the alternative is twosided The t statistic is more flexible for testing a single hypothesis because it can be directly used to test against onesided alternatives Since t statistics are also easier to obtain than F statistics there is really no reason to use an F statistic to test hypotheses about a single parameter We have already seen in the salary regressions for major league baseball players that two or more variables that each have insignificant t statistics can be jointly very significant It is also pos sible that in a group of several explanatory variables one variable has a significant t statistic but the group of variables is jointly insignificant at the usual significance levels What should we make of this kind of outcome For concreteness suppose that in a model with many explanatory variables we cannot reject the null hypothesis that b1 b2 b3 b4 and b5 are all equal to zero at the 5 level yet the t statistic for b 1 is significant at the 5 level Logically we cannot have b1 2 0 but also have b1 b2 b3 b4 and b5 all equal to zero But as a matter of testing it is possible that we can group a bunch of insignificant variables with a significant variable and conclude that the entire set of variables is jointly insignificant Such possible conflicts between a t test and a joint F test give another example of why we should not accept null hypotheses we should only fail to reject them The F statistic is intended to detect whether a set of coefficients is different from zero but it is never the best test for determining whether a single coefficient is different from zero The t test is best suited for testing a single hypothesis In statistical terms an F statistic for joint restrictions including b1 5 0 will have less power for detecting b1 2 0 than the usual t statistic See Section C6 in Appendix C for a discus sion of the power of a test Unfortunately the fact that we can sometimes hide a statistically significant variable along with some insignificant variables could lead to abuse if regression results are not carefully reported For example suppose that in a study of the determinants of loanacceptance rates at the city level x1 is the fraction of black households in the city Suppose that the variables x2 x3 x4 and x5 are the fractions of households headed by different age groups In explaining loan rates we would include measures of income wealth credit ratings and so on Suppose that age of household head has no effect on loan approval rates once other variables are controlled for Even if race has a margin ally significant effect it is possible that the race and age variables could be jointly insignificant Someone wanting to conclude that race is not a factor could simply report something like Race and age variables were added to the equation but they were jointly insignificant at the 5 level Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 133 Hopefully peer review prevents these kinds of misleading conclusions but you should be aware that such outcomes are possible Often when a variable is very statistically significant and it is tested jointly with another set of variables the set will be jointly significant In such cases there is no logical inconsistency in rejecting both null hypotheses 45c The RSquared Form of the F Statistic For testing exclusion restrictions it is often more convenient to have a form of the F statistic that can be computed using the Rsquareds from the restricted and unrestricted models One reason for this is that the Rsquared is always between zero and one whereas the SSRs can be very large depending on the unit of measurement of y making the calculation based on the SSRs tedious Using the fact that SSRr 5 SST11 2 R2 r 2 and SSRur 5 SST11 2 R2 ur2 we can substitute into 437 to obtain F 5 1R2 ur 2 R2 r 2q 11 2 R2 ur21n 2 k 2 12 5 1R2 ur 2 R2 r 2q 11 2 R2 ur2dfur 441 note that the SST terms cancel everywhere This is called the Rsquared form of the F statistic At this point you should be cautioned that although equation 441 is very convenient for testing exclusion restrictions it cannot be applied for testing all linear restrictions As we will see when we discuss testing general linear restrictions the sum of squared residuals form of the F statistic is some times needed Because the Rsquared is reported with almost all regressions whereas the SSR is not it is easy to use the Rsquareds from the unrestricted and restricted models to test for exclusion of some vari ables Particular attention should be paid to the order of the Rsquareds in the numerator the unre stricted Rsquared comes first contrast this with the SSRs in 437 Because R2 ur R2 r this shows again that F will always be positive In using the Rsquared form of the test for excluding a set of variables it is important to not square the Rsquared before plugging it into formula 441 the squaring has already been done All regressions report R2 and these numbers are plugged directly into 441 For the baseball salary example we can use 441 to obtain the F statistic F 5 16278 2 59712 11 2 62782 347 3 954 which is very close to what we obtained before The difference is due to rounding error example 49 parents education in a birth Weight equation As another example of computing an F statistic consider the following model to explain child birth weight in terms of various factors bwght 5 b0 1 b1cigs 1 b2parity 1 b3faminc 1 b4motheduc 1 b5fatheduc 1 u 442 where bwght 5 birth weight in pounds cigs 5 average number of cigarettes the mother smoked per day during pregnancy parity 5 the birth order of this child faminc 5 annual family income motheduc 5 years of schooling for the mother fatheduc 5 years of schooling for the father Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 134 Let us test the null hypothesis that after controlling for cigs parity and faminc parents education has no effect on birth weight This is stated as H0 b4 5 0 b5 5 0 and so there are q 5 2 exclusion restrictions to be tested There are k 1 1 5 6 parameters in the unrestricted model 442 so the df in the unrestricted model is n 2 6 where n is the sample size We will test this hypothesis using the data in BWGHT This data set contains information on 1388 births but we must be careful in counting the observations used in testing the null hypothesis It turns out that information on at least one of the variables motheduc and fatheduc is missing for 197 births in the sample these observations cannot be included when estimating the unrestricted model Thus we really have n 5 1191 observations and so there are 1191 2 6 5 1185 df in the unre stricted model We must be sure to use these same 1191 observations when estimating the restricted model not the full 1388 observations that are available Generally when estimating the restricted model to compute an F test we must use the same observations to estimate the unrestricted model otherwise the test is not valid When there are no missing data this will not be an issue The numerator df is 2 and the denominator df is 1185 from Table G3 the 5 critical value is c 5 30 Rather than report the complete results for brevity we present only the Rsquareds The Rsquared for the full model turns out to be R2 ur 5 0387 When motheduc and fathe duc are dropped from the regression the Rsquared falls to R2 r 5 0364 Thus the F statistic is F 5 3 10387 2 0364211 2 03872 41118522 5 142 since this is well below the 5 critical value we fail to reject H0 In other words motheduc and fatheduc are jointly insignificant in the birth weight equation Most statistical packages these days have builtin commands for testing multiple hypotheses after OLS estimation and so one need not worry about making the mistake of running the two regressions on different data sets Typically the commands are applied after estimation of the unrestricted model which means the smaller subset of data is used whenever there are missing values on some variables Formulas for computing the F statistic using matrix algebrasee Appendix Edo not require estimation of the retricted model 45d Computing pValues for F Tests For reporting the outcomes of F tests pvalues are especially useful Since the F distribution depends on the numerator and denominator df it is difficult to get a feel for how strong or weak the evidence is against the null hypothesis simply by looking at the value of the F statistic and one or two critical values In the F testing context the pvalue is defined as pvalue 5 P1 F2 443 where for emphasis we let denote an F random variable with qn 2 k 2 1 degrees of freedom and F is the actual value of the test statistic The pvalue still has the same interpretation as it did for t statis tics it is the probability of observing a value of F at least as large as we did given that the null hypoth esis is true A small pvalue is evidence against H0 For example pvalue 5 016 means that the chance of observing a value of F as large as we did when the null hypothesis was true is only 16 we usually reject H0 in such cases If the pvalue 5 314 then the chance of observing a value of the F statistic as large as we did under the null hypothesis is 314 Most would find this to be pretty weak evidence against H0 The data in ATTEND were used to estimate the two equations atndrte 5 4713 1 1337 priGPA 12872 11092 n 5 680 R2 5 183 and atndrte 5 7570 1 1726 priGPA 2 172 ACT 13882 11082 12 n 5 680 R2 5 291 where as always standard errors are in parentheses the standard error for ACT is missing in the second equation What is the t statistic for the coefficient on ACT Hint First compute the F statistic for significance of ACT exploring FurTher 45 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 135 As with t testing once the pvalue has been computed the F test can be carried out at any signifi cance level For example if the pvalue 5 024 we reject H0 at the 5 significance level but not at the 1 level The pvalue for the F test in Example 49 is 238 and so the null hypothesis that bmotheduc and bfatheduc are both zero is not rejected at even the 20 significance level Many econometrics packages have a builtin feature for testing multiple exclusion restric tions These packages have several advantages over calculating the statistics by hand we will less likely make a mistake pvalues are computed automatically and the problem of missing data as in Example 49 is handled without any additional work on our part 45e The F Statistic for Overall Significance of a Regression A special set of exclusion restrictions is routinely tested by most regression packages These restric tions have the same interpretation regardless of the model In the model with k independent variables we can write the null hypothesis as H0 x1 x2 p xk do not help to explain y This null hypothesis is in a way very pessimistic It states that none of the explanatory variables has an effect on y Stated in terms of the parameters the null is that all slope parameters are zero H0 b1 5 b2 5 p 5 bk 5 0 444 and the alternative is that at least one of the bj is different from zero Another useful way of stating the null is that H0 E1y0x1 x2 p xk2 5 E1y2 so that knowing the values of x1 x2 p xk does not affect the expected value of y There are k restrictions in 444 and when we impose them we get the restricted model y 5 b0 1 u 445 all independent variables have been dropped from the equation Now the Rsquared from estimating 445 is zero none of the variation in y is being explained because there are no explanatory variables Therefore the F statistic for testing 444 can be written as R2k 11 2 R221n 2 k 2 12 446 where R2 is just the usual Rsquared from the regression of y on x1 x2 p xk Most regression packages report the F statistic in 446 automatically which makes it tempting to use this statistic to test general exclusion restrictions You must avoid this temptation The F statis tic in 441 is used for general exclusion restrictions it depends on the Rsquareds from the restricted and unrestricted models The special form of 446 is valid only for testing joint exclusion of all inde pendent variables This is sometimes called determining the overall significance of the regression If we fail to reject 444 then there is no evidence that any of the independent variables help to explain y This usually means that we must look for other variables to explain y For Example 49 the F statistic for testing 444 is about 955 with k 5 5 and n 2 k 2 1 5 1185 df The pvalue is zero to four places after the decimal point so that 444 is rejected very strongly Thus we conclude that the variables in the bwght equation do explain some variation in bwght The amount explained is not large only 387 But the seemingly small Rsquared results in a highly significant F statistic That is why we must compute the F statistic to test for joint significance and not just look at the size of the Rsquared Occasionally the F statistic for the hypothesis that all independent variables are jointly insignifi cant is the focus of a study Problem 10 asks you to use stock return data to test whether stock returns over a fouryear horizon are predictable based on information known only at the beginning of the period Under the efficient markets hypothesis the returns should not be predictable the null hypoth esis is precisely 444 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 136 45f Testing General Linear Restrictions Testing exclusion restrictions is by far the most important application of F statistics Sometimes how ever the restrictions implied by a theory are more complicated than just excluding some independent variables It is still straightforward to use the F statistic for testing As an example consider the following equation log1price2 5 b0 1 b1log1assess2 1 b2log1lotsize2 1 b3log1sqrft2 1 b4bdrms 1 u 447 where price 5 house price assess 5 the assessed housing value1before the house was sold2 lotsize 5 size of the lot in square feet sqrft 5 square footage bdrms 5 number of bedrooms Now suppose we would like to test whether the assessed housing price is a rational valuation If this is the case then a 1 change in assess should be associated with a 1 change in price that is b1 5 1 In addition lotsize sqrft and bdrms should not help to explain logprice once the assessed value has been controlled for Together these hypotheses can be stated as H0 b1 5 1 b2 5 0 b3 5 0 b4 5 0 448 Four restrictions have to be tested three are exclusion restrictions but b1 5 1 is not How can we test this hypothesis using the F statistic As in the exclusion restriction case we estimate the unrestricted model 447 in this case and then impose the restrictions in 448 to obtain the restricted model It is the second step that can be a little tricky But all we do is plug in the restrictions If we write 447 as y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 b4x4 1 u 449 then the restricted model is y 5 b0 1 x1 1 u Now to impose the restriction that the coefficient on x1 is unity we must estimate the following model y 2 x1 5 b0 1 u 450 This is just a model with an intercept 1b02 but with a different dependent variable than in 449 The procedure for computing the F statistic is the same estimate 450 obtain the SSR1SSRr2 and use this with the unrestricted SSR from 449 in the F statistic 437 We are test ing q 5 4 restrictions and there are n 2 5 df in the unrestricted model The F statistic is simply 3 1SSRr 2 SSRur2SSRur43 1n 2 5244 Before illustrating this test using a data set we must emphasize one point we cannot use the Rsquared form of the F statistic for this example because the dependent variable in 450 is different from the one in 449 This means the total sum of squares from the two regressions will be different and 441 is no longer equivalent to 437 As a general rule the SSR form of the F statistic should be used if a different dependent variable is needed in running the restricted regression The estimated unrestricted model using the data in HPRICE1 is log1price2 5 264 1 1043 log1assess2 1 0074 log1lotsize2 15702 11512 103862 2 1032 log1sqrft2 1 0338 bdrms 113842 102212 n 5 88 SSR 5 1822 R2 5 773 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 137 If we use separate t statistics to test each hypothesis in 448 we fail to reject each one But rationality of the assessment is a joint hypothesis so we should test the restrictions jointly The SSR from the restricted model turns out to be SSRr 5 1880 and so the F statistic is 3 11880 2 182221822418342 5 661 The 5 critical value in an F distribution with 483 df is about 250 and so we fail to reject H0 There is essentially no evidence against the hypothesis that the assessed values are rational 46 Reporting Regression Results We end this chapter by providing a few guidelines on how to report multiple regression results for relatively complicated empirical projects This should help you to read published works in the applied social sciences while also preparing you to write your own empirical papers We will expand on this topic in the remainder of the text by reporting results from various examples but many of the key points can be made now Naturally the estimated OLS coefficients should always be reported For the key variables in an analysis you should interpret the estimated coefficients which often requires knowing the units of measurement of the variables For example is an estimate an elasticity or does it have some other interpretation that needs explanation The economic or practical importance of the estimates of the key variables should be discussed The standard errors should always be included along with the estimated coefficients Some authors prefer to report the t statistics rather than the standard errors and sometimes just the abso lute value of the t statistics Although nothing is really wrong with this there is some preference for reporting standard errors First it forces us to think carefully about the null hypothesis being tested the null is not always that the population parameter is zero Second having standard errors makes it easier to compute confidence intervals The Rsquared from the regression should always be included We have seen that in addition to providing a goodnessoffit measure it makes calculation of F statistics for exclusion restrictions simple Reporting the sum of squared residuals and the standard error of the regression is sometimes a good idea but it is not crucial The number of observations used in estimating any equation should appear near the estimated equation If only a couple of models are being estimated the results can be summarized in equation form as we have done up to this point However in many papers several equations are estimated with many different sets of independent variables We may estimate the same equation for different groups of people or even have equations explaining different dependent variables In such cases it is better to summarize the results in one or more tables The dependent variable should be indicated clearly in the table and the independent variables should be listed in the first column Standard errors or t statis tics can be put in parentheses below the estimates example 410 Salarypension tradeoff for teachers Let totcomp denote average total annual compensation for a teacher including salary and all fringe benefits pension health insurance and so on Extending the standard wage equation total compen sation should be a function of productivity and perhaps other characteristics As is standard we use logarithmic form log1totcomp2 5 f 1productivity characteristics other factors2 where f 12 is some function unspecified for now Write totcomp 5 salary 1 benefits 5 salary a1 1 benefits salary b Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 138 This equation shows that total compensation is the product of two terms salary and 1 1 bs where bs is shorthand for the benefits to salary ratio Taking the log of this equation gives log1totcomp2 5 log1salary2 1 log11 1 bs2 Now for small bs log11 1 bs2 bs we will use this approximation This leads to the econometric model log1salary2 5 b0 1 b11bs2 1 other factors Testing the salarybenefits tradeoff then is the same as a test of H0 b1 5 21 against H1 b1 2 21 We use the data in MEAP93 to test this hypothesis These data are averaged at the school level and we do not observe very many other factors that could affect total compensation We will include controls for size of the school enroll staff per thousand students staff and measures such as the school dropout and graduation rates The average bs in the sample is about 205 and the largest value is 450 The estimated equations are given in Table 41 where standard errors are given in parentheses below the coefficient estimates The key variable is bs the benefitssalary ratio From the first column in Table 41 we see that without controlling for any other factors the OLS coefficient for bs is 2825 The t statistic for testing the null hypothesis H0 b1 5 21 is t 5 12825 1 12200 5 875 and so the simple regression fails to reject H0 After adding controls for school size and staff size which roughly captures the number of students taught by each teacher the esti mate of the bs coefficient becomes 2605 Now the test of b1 5 21 gives a t statistic of about 239 thus H0 is rejected at the 5 level against a twosided alternative The variables logenroll and logstaff are very statistically significant how does adding droprate and gradrate affect the estimate of the salarybenefits tradeoff Are these variables jointly signifi cant at the 5 level What about the 10 level exploring FurTher 46 TAblE 41 Testing the Salarybenefits Tradeoff Dependent Variable logsalary Independent Variables 1 2 3 bs 825 200 605 165 589 165 logenroll 0874 0073 0881 0073 logstaff 222 050 218 050 droprate 00028 00161 gradrate 00097 00066 intercept 10523 0042 10884 0252 10738 0258 Observations Rsquared 408 040 408 353 408 361 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 139 Summary In this chapter we have covered the very important topic of statistical inference which allows us to infer something about the population model from a random sample We summarize the main points 1 Under the classical linear model assumptions MLR1 through MLR6 the OLS estimators are normally distributed 2 Under the CLM assumptions the t statistics have t distributions under the null hypothesis 3 We use t statistics to test hypotheses about a single parameter against one or twosided alternatives using one or twotailed tests respectively The most common null hypothesis is H0 bj 5 0 but we sometimes want to test other values of bj under H0 4 In classical hypothesis testing we first choose a significance level which along with the df and alter native hypothesis determines the critical value against which we compare the t statistic It is more informative to compute the pvalue for a t testthe smallest significance level for which the null hypothesis is rejectedso that the hypothesis can be tested at any significance level 5 Under the CLM assumptions confidence intervals can be constructed for each bj These CIs can be used to test any null hypothesis concerning bj against a twosided alternative 6 Single hypothesis tests concerning more than one bj can always be tested by rewriting the model to contain the parameter of interest Then a standard t statistic can be used 7 The F statistic is used to test multiple exclusion restrictions and there are two equivalent forms of the test One is based on the SSRs from the restricted and unrestricted models A more convenient form is based on the Rsquareds from the two models 8 When computing an F statistic the numerator df is the number of restrictions being tested while the denominator df is the degrees of freedom in the unrestricted model 9 The alternative for F testing is twosided In the classical approach we specify a significance level which along with the numerator df and the denominator df determines the critical value The null hypothesis is rejected when the statistic F exceeds the critical value c Alternatively we can compute a pvalue to summarize the evidence against H0 10 General multiple linear restrictions can be tested using the sum of squared residuals form of the F statistic 11 The F statistic for the overall significance of a regression tests the null hypothesis that all slope param eters are zero with the intercept unrestricted Under H0 the explanatory variables have no effect on the expected value of y 12 When data are missing on one or more explanatory variables one must be careful when computing F statistics by hand that is using either the sum of squared residuals or Rsquareds from the two regressions Whenever possible it is best to leave the calculations to statistical packages that have builtin commands which work with or without missing data The ClassiCal linear Model assuMpTions Now is a good time to review the full set of classical linear model CLM assumptions for crosssectional regression Following each assumption is a comment about its role in multiple regression analysis assumption Mlr1 linear in parameters The model in the population can be written as y 5 b0 1 b1x1 1 b2 x2 1 p 1 bk xk 1 u where b0 b1 p bk are the unknown parameters constants of interest and u is an unobserved random error or disturbance term Assumption MLR1 describes the population relationship we hope to estimate and explicitly sets out the bjthe ceteris paribus population effects of the xj on yas the parameters of interest Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 140 PART 1 Regression Analysis with CrossSectional Data assumption Mlr2 random sampling We have a random sample of n observations 5 1xi1 xi2 p xik yi2 i 5 1 p n6 following the population model in Assumption MLR1 This random sampling assumption means that we have data that can be used to estimate the bj and that the data have been chosen to be representative of the population described in Assumption MLR1 assumption Mlr3 no perfect Collinearity In the sample and therefore in the population none of the independent variables is constant and there are no exact linear relationships among the independent variables Once we have a sample of data we need to know that we can use the data to compute the OLS esti mates the b j This is the role of Assumption MLR3 if we have sample variation in each independent vari able and no exact linear relationships among the independent variables we can compute the b j assumption Mlr4 Zero Conditional Mean The error u has an expected value of zero given any values of the explanatory variables In other words E1u0x1 x2 p xk2 5 0 As we discussed in the text assuming that the unobserved factors are on average unrelated to the explanatory variables is key to deriving the first statistical property of each OLS estimator its unbiasedness for the corresponding population parameter Of course all of the previous assumptions are used to show unbiasedness assumption Mlr5 homoskedasticity The error u has the same variance given any values of the explanatory variables In other words Var1u0x1 x2 p xk2 5 s2 Compared with Assumption MLR4 the homoskedasticity assumption is of secondary importance in particular Assumption MLR5 has no bearing on the unbiasedness of the b j Still homoskedasticity has two important implications 1 We can derive formulas for the sampling variances whose components are easy to characterize 2 We can conclude under the GaussMarkov assumptions MLR1 through MLR5 that the OLS estimators have smallest variance among all linear unbiased estimators assumption Mlr6 normality The population error u is independent of the explanatory variables x1 x2 p xk and is normally distributed with zero mean and variance s2 u Normal10 s22 In this chapter we added Assumption MLR6 to obtain the exact sampling distributions of t statistics and F statistics so that we can carry out exact hypotheses tests In the next chapter we will see that MLR6 can be dropped if we have a reasonably large sample size Assumption MLR6 does imply a stronger effi ciency property of OLS the OLS estimators have smallest variance among all unbiased estimators the comparison group is no longer restricted to estimators linear in the 5yi i 5 1 2 p n6 Key Terms Alternative Hypothesis Classical Linear Model Classical Linear Model CLM Assumptions Confidence Interval CI Critical Value Denominator Degrees of Freedom Economic Significance Exclusion Restrictions F Statistic Joint Hypotheses Test Jointly Insignificant Jointly Statistically Significant Minimum Variance Unbiased Estimators Multiple Hypotheses Test Multiple Restrictions Normality Assumption Null Hypothesis Numerator Degrees of Freedom OneSided Alternative OneTailed Test Overall Significance of the Regression pValue Practical Significance Rsquared Form of the F Statistic Rejection Rule Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 141 Restricted Model Significance Level Statistically Insignificant Statistically Significant t Ratio t Statistic TwoSided Alternative TwoTailed Test Unrestricted Model Problems 1 Which of the following can cause the usual OLS t statistics to be invalid that is not to have t distribu tions under H0 i Heteroskedasticity ii A sample correlation coefficient of 95 between two independent variables that are in the model iii Omitting an important explanatory variable 2 Consider an equation to explain salaries of CEOs in terms of annual firm sales return on equity roe in percentage form and return on the firms stock ros in percentage form log1salary2 5 b0 1 b1log1sales2 1 b2roe 1 b3ros 1 u i In terms of the model parameters state the null hypothesis that after controlling for sales and roe ros has no effect on CEO salary State the alternative that better stock market performance increases a CEOs salary ii Using the data in CEOSAL1 the following equation was obtained by OLS log1salary2 5 432 1 280 log1sales2 1 0174 roe 1 00024 ros 1322 10352 100412 1000542 n 5 209 R2 5 283 By what percentage is salary predicted to increase if ros increases by 50 points Does ros have a practically large effect on salary iii Test the null hypothesis that ros has no effect on salary against the alternative that ros has a positive effect Carry out the test at the 10 significance level iv Would you include ros in a final model explaining CEO compensation in terms of firm performance Explain 3 The variable rdintens is expenditures on research and development RD as a percentage of sales Sales are measured in millions of dollars The variable profmarg is profits as a percentage of sales Using the data in RDCHEM for 32 firms in the chemical industry the following equation is estimated rdintens 5 472 1 321 log1sales2 1 050 profmarg 113692 12162 10462 n 5 32 R2 5 099 i Interpret the coefficient on logsales In particular if sales increases by 10 what is the estimated percentage point change in rdintens Is this an economically large effect ii Test the hypothesis that RD intensity does not change with sales against the alternative that it does increase with sales Do the test at the 5 and 10 levels iii Interpret the coefficient on profmarg Is it economically large iv Does profmarg have a statistically significant effect on rdintens 4 Are rent rates influenced by the student population in a college town Let rent be the average monthly rent paid on rental units in a college town in the United States Let pop denote the total city population avginc the average city income and pctstu the student population as a percentage of the total popula tion One model to test for a relationship is log1rent2 5 b0 1 b1log1pop2 1 b2log1avginc2 1 b3pctstu 1 u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 142 PART 1 Regression Analysis with CrossSectional Data i State the null hypothesis that size of the student body relative to the population has no ceteris paribus effect on monthly rents State the alternative that there is an effect ii What signs do you expect for b1 and b2 iii The equation estimated using 1990 data from RENTAL for 64 college towns is log1rent2 5 043 1 066 log1pop2 1 507 log1avginc2 1 0056 pctstu 18442 10392 10812 100172 n 5 64 R2 5 458 What is wrong with the statement A 10 increase in population is associated with about a 66 increase in rent iv Test the hypothesis stated in part i at the 1 level 5 Consider the estimated equation from Example 43 which can be used to study the effects of skipping class on college GPA colGPA 5 139 1 412 hsGPA 1 015 ACT 2 083 skipped 1332 10942 10112 10262 n 5 141 R2 5 234 i Using the standard normal approximation find the 95 confidence interval for bhsGPA ii Can you reject the hypothesis H0 bhsGPA 5 4 against the twosided alternative at the 5 level iii Can you reject the hypothesis H0 bhsGPA 5 1 against the twosided alternative at the 5 level 6 In Section 45 we used as an example testing the rationality of assessments of housing prices There we used a loglog model in price and assess see equation 447 Here we use a levellevel formulation i In the simple regression model price 5 b0 1 b1assess 1 u the assessment is rational if b1 5 1 and b0 5 0 The estimated equation is price 5 21447 1 976 assess 116272 10492 n 5 88 SSR 5 16564451 R2 5 820 First test the hypothesis that H0 b0 5 0 against the twosided alternative Then test H0 b1 5 1 against the twosided alternative What do you conclude ii To test the joint hypothesis that b0 5 0 and b0 5 1 we need the SSR in the restricted model This amounts to computing g n i511pricei 2 assessi2 2 where n 5 88 since the residuals in the restricted model are just pricei 2 assessi No estimation is needed for the restricted model because both parameters are specified under H0 This turns out to yield SSR 5 20944899 Carry out the F test for the joint hypothesis iii Now test H0 b2 5 0 b3 5 0 and b4 5 0 in the model price 5 b0 1 b1assess 1 b2lotsize 1 b3sqrft 1 b4bdrms 1 u The Rsquared from estimating this model using the same 88 houses is 829 iv If the variance of price changes with assess lotsize sqrft or bdrms what can you say about the F test from part iii 7 In Example 47 we used data on nonunionized manufacturing firms to estimate the relationship between the scrap rate and other firm characteristics We now look at this example more closely and use all avail able firms i The population model estimated in Example 47 can be written as log1scrap2 5 b0 1 b1hrsemp 1 b2log1sales2 1 b3log1employ2 1 u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 143 Using the 43 observations available for 1987 the estimated equation is log1scrap2 5 1174 2 042 hrsemp 2 951 log1sales2 1 992 log1employ2 14572 10192 13702 13602 n 5 43 R2 5 310 Compare this equation to that estimated using only the 29 nonunionized firms in the sample ii Show that the population model can also be written as log1scrap2 5 b0 1 b1hrsemp 1 b2log1salesemploy2 1 u3log1employ2 1 u where u3 5 b2 1 b3 Hint Recall that log1x2x32 5 log1x22 2 log1x32 Interpret the hypothesis H0 u3 5 0 iii When the equation from part ii is estimated we obtain log1scrap2 5 1174 2 042 hrsemp 2 951 log1salesemploy2 1 041 log1employ2 14572 10192 13702 12052 n 5 43 R2 5 310 Controlling for worker training and for the salestoemployee ratio do bigger firms have larger statistically significant scrap rates iv Test the hypothesis that a 1 increase in salesemploy is associated with a 1 drop in the scrap rate 8 Consider the multiple regression model with three independent variables under the classical linear model assumptions MLR1 through MLR6 y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 u You would like to test the null hypothesis H0 b1 2 3b2 5 1 i Let b 1 and b 2 denote the OLS estimators of b1 and b2 Find Var1b 1 2 3b 22 in terms of the variances of b 1 and b 2 and the covariance between them What is the standard error of b 1 2 3b 2 ii Write the t statistic for testing H0 b1 2 3b2 5 1 iii Define u1 5 b1 2 3b2 and u 1 5 b 1 2 3b 2 Write a regression equation involving b0 u1 b2 and b3 that allows you to directly obtain u 1 and its standard error 9 In Problem 3 in Chapter 3 we estimated the equation sleep 5 363825 2 148 totwrk 2 1113 educ 1 220 age 1112282 10172 15882 11452 n 5 706 R2 5 113 where we now report standard errors along with the estimates i Is either educ or age individually significant at the 5 level against a twosided alternative Show your work ii Dropping educ and age from the equation gives sleep 5 358638 2 151 totwrk 138912 10172 n 5 706 R2 5 103 Are educ and age jointly significant in the original equation at the 5 level Justify your answer iii Does including educ and age in the model greatly affect the estimated tradeoff between sleeping and working iv Suppose that the sleep equation contains heteroskedasticity What does this mean about the tests computed in parts i and ii Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 144 PART 1 Regression Analysis with CrossSectional Data 10 Regression analysis can be used to test whether the market efficiently uses information in valuing stocks For concreteness let return be the total return from holding a firms stock over the fouryear period from the end of 1990 to the end of 1994 The efficient markets hypothesis says that these returns should not be systematically related to information known in 1990 If firm characteristics known at the beginning of the period help to predict stock returns then we could use this information in choosing stocks For 1990 let dkr be a firms debt to capital ratio let eps denote the earnings per share let netinc denote net income and let salary denote total compensation for the CEO i Using the data in RETURN the following equation was estimated return 5 21437 1 321 dkr 1 043 eps 2 0051 nentinc 1 0035 salary 16892 12012 10782 100472 100222 n 5 142 R2 5 0395 Test whether the explanatory variables are jointly significant at the 5 level Is any explanatory variable individually significant ii Now reestimate the model using the log form for netinc and salary return 5 23630 1 327 dkr 1 069 eps 2 474 log1netinc2 1 724 log1salary2 139372 12032 10802 13392 16312 n 5 142 R2 5 0330 Do any of your conclusions from part i change iii In this sample some firms have zero debt and others have negative earnings Should we try to use logdkr or logeps in the model to see if these improve the fit Explain iv Overall is the evidence for predictability of stock returns strong or weak 11 The following table was created using the data in CEOSAL2 where standard errors are in parentheses below the coefficients Dependent Variable logsalary Independent Variables 1 2 3 logsales 224 027 158 040 188 040 logmktval 112 050 100 049 Profmarg 0023 0022 0022 0021 Ceoten 0171 0055 comten 0092 0033 intercept 494 020 462 025 457 025 Observations Rsquared 177 281 177 304 177 353 The variable mktval is market value of the firm profmarg is profit as a percentage of sales ceoten is years as CEO with the current company and comten is total years with the company i Comment on the effect of profmarg on CEO salary ii Does market value have a significant effect Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 145 iii Interpret the coefficients on ceoten and comten Are these explanatory variables statistically significant iv What do you make of the fact that longer tenure with the company holding the other factors fixed is associated with a lower salary 12 The following analysis was obtained using data in MEAP93 which contains schoollevel pass rates as a percent on a tenthgrade math test i The variable expend is expenditures per student in dollars and math10 is the pass rate on the exam The following simple regression relates math10 to lexpend logexpend math10 5 26934 1 1116 lexpend 125532 13172 n 5 408 R2 5 0297 Interpret the coefficient on lexpend In particular if expend increases by 10 what is the estimated percentage point change in math10 What do you make of the large negative intercept estimate The minimum value of lexpend is 811 and its average value is 837 ii Does the small Rsquared in part i imply that spending is correlated with other factors affecting math10 Explain Would you expect the Rsquared to be much higher if expenditures were randomly assigned to schoolsthat is independent of other school and student characteristicsrather than having the school districts determine spending iii When log of enrollment and the percent of students eligible for the federal free lunch program are included the estimated equation becomes math10 5 22314 1 775 lexpend 2 126 lenroll 2 324 lnchprg 124992 13042 10582 10362 n 5 408 R2 5 1893 Comment on what happens to the coefficient on lexpend Is the spending coefficient still statistically different from zero iv What do you make of the Rsquared in part iii What are some other factors that could be used to explain math10 at the school level 13 The data in MEAPSINGLE were used to estimate the following equations relating schoollevel perfor mance on a fourthgrade math test to socioeconomic characteristics of students attending school The vari able free measured at the school level is the percentage of students eligible for the federal free lunch program The variable medinc is median income in the ZIP code and pctsgle is percent of students not liv ing with two parents also measured at the ZIP code level See also Computer Exercise C11 in Chapter 3 math4 5 9677 2 833 pctsgle 11602 10712 n 5 299 R2 5 380 math4 5 9300 2 275 pctsgle 2 402 free 11632 11172 10702 n 5 299 R2 5 459 math4 5 2449 2 274 pctsgle 2 422 free 2 752 lmedinc 1 901 lexppp 159242 11612 10712 153582 14042 n 5 299 R2 5 472 math4 5 1752 2 259 pctsgle 2 420 free 1 880 lexppp 132252 11172 10702 13762 n 5 299 R2 5 472 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 146 PART 1 Regression Analysis with CrossSectional Data i Interpret the coefficient on the variable pctsgle in the first equation Comment on what happens when free is added as an explanatory variable ii Does expenditure per pupil entered in logarithmic form have a statistically significant effect on performance How big is the estimated effect iii If you had to choose among the four equations as your best estimate of the effect of pctsgle and obtain a 95 confidence interval of bpctsgle which would you choose Why Computer Exercises C1 The following model can be used to study whether campaign expenditures affect election outcomes voteA 5 b0 1 b1log1expendA2 1 b2log1expendB2 1 b3prtystrA 1 u where voteA is the percentage of the vote received by Candidate A expendA and expendB are cam paign expenditures by Candidates A and B and prtystrA is a measure of party strength for Candidate A the percentage of the most recent presidential vote that went to As party i What is the interpretation of b1 ii In terms of the parameters state the null hypothesis that a 1 increase in As expenditures is offset by a 1 increase in Bs expenditures iii Estimate the given model using the data in VOTE1 and report the results in usual form Do As expenditures affect the outcome What about Bs expenditures Can you use these results to test the hypothesis in part ii iv Estimate a model that directly gives the t statistic for testing the hypothesis in part ii What do you conclude Use a twosided alternative C2 Use the data in LAWSCH85 for this exercise i Using the same model as in Problem 4 in Chapter 3 state and test the null hypothesis that the rank of law schools has no ceteris paribus effect on median starting salary ii Are features of the incoming class of studentsnamely LSAT and GPAindividually or jointly significant for explaining salary Be sure to account for missing data on LSAT and GPA iii Test whether the size of the entering class clsize or the size of the faculty faculty needs to be added to this equation carry out a single test Be careful to account for missing data on clsize and faculty iv What factors might influence the rank of the law school that are not included in the salary regression C3 Refer to Computer Exercise C2 in Chapter 3 Now use the log of the housing price as the dependent variable log1price2 5 b0 1 b1sqrft 1 b2bdrms 1 u i You are interested in estimating and obtaining a confidence interval for the percentage change in price when a 150squarefoot bedroom is added to a house In decimal form this is u1 5 150b1 1 b2 Use the data in HPRICE1 to estimate u1 ii Write b2 in terms of u1 and b1 and plug this into the logprice equation iii Use part ii to obtain a standard error for u 2 and use this standard error to construct a 95 confidence interval C4 In Example 49 the restricted version of the model can be estimated using all 1388 observations in the sample Compute the Rsquared from the regression of bwght on cigs parity and faminc using all observations Compare this to the Rsquared reported for the restricted model in Example 49 C5 Use the data in MLB1 for this exercise i Use the model estimated in equation 431 and drop the variable rbisyr What happens to the statistical significance of hrunsyr What about the size of the coefficient on hrunsyr Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 4 Multiple Regression Analysis Inference 147 ii Add the variables runsyr runs per year fldperc fielding percentage and sbasesyr stolen bases per year to the model from part i Which of these factors are individually significant iii In the model from part ii test the joint significance of bavg fldperc and sbasesyr C6 Use the data in WAGE2 for this exercise i Consider the standard wage equation log1wage2 5 b0 1 b1educ 1 b2exper 1 b3tenure 1 u State the null hypothesis that another year of general workforce experience has the same effect on logwage as another year of tenure with the current employer ii Test the null hypothesis in part i against a twosided alternative at the 5 significance level by constructing a 95 confidence interval What do you conclude C7 Refer to the example used in Section 44 You will use the data set TWOYEAR i The variable phsrank is the persons high school percentile A higher number is better For example 90 means you are ranked better than 90 percent of your graduating class Find the smallest largest and average phsrank in the sample ii Add phsrank to equation 426 and report the OLS estimates in the usual form Is phsrank statistically significant How much is 10 percentage points of high school rank worth in terms of wage iii Does adding phsrank to 426 substantively change the conclusions on the returns to two and fouryear colleges Explain iv The data set contains a variable called id Explain why if you add id to equation 417 or 426 you expect it to be statistically insignificant What is the twosided pvalue C8 The data set 401KSUBS contains information on net financial wealth nettfa age of the survey respondent age annual family income inc family size fsize and participation in certain pension plans for people in the United States The wealth and income variables are both recorded in thousands of dollars For this question use only the data for singleperson households so fsize 5 1 i How many singleperson households are there in the data set ii Use OLS to estimate the model nettfa 5 b0 1 b1inc 1 b2age 1 u and report the results using the usual format Be sure to use only the singleperson households in the sample Interpret the slope coefficients Are there any surprises in the slope estimates iii Does the intercept from the regression in part ii have an interesting meaning Explain iv Find the pvalue for the test H0 b2 5 1 against H1 b2 1 Do you reject H0 at the 1 significance level v If you do a simple regression of nettfa on inc is the estimated coefficient on inc much different from the estimate in part ii Why or why not C9 Use the data in DISCRIM to answer this question See also Computer Exercise C8 in Chapter 3 i Use OLS to estimate the model log1psoda2 5 b0 1 b1prpblck 1 b2log1income2 1 b3prppov 1 u and report the results in the usual form Is b 1 statistically different from zero at the 5 level against a twosided alternative What about at the 1 level ii What is the correlation between logincome and prppov Is each variable statistically significant in any case Report the twosided pvalues iii To the regression in part i add the variable loghseval Interpret its coefficient and report the twosided pvalue for H0 blog1hseval2 5 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 148 PART 1 Regression Analysis with CrossSectional Data iv In the regression in part iii what happens to the individual statistical significance of logincome and prppov Are these variables jointly significant Compute a pvalue What do you make of your answers v Given the results of the previous regressions which one would you report as most reliable in determining whether the racial makeup of a zip code influences local fastfood prices C10 Use the data in ELEM9495 to answer this question The findings can be compared with those in Table 41 The dependent variable lavgsal is the log of average teacher salary and bs is the ratio of average benefits to average salary by school i Run the simple regression of lavgsal on bs Is the estimated slope statistically different from zero Is it statistically different from 21 ii Add the variables lenrol and lstaff to the regression from part i What happens to the coefficient on bs How does the situation compare with that in Table 41 iii How come the standard error on the bs coefficient is smaller in part ii than in part i Hint What happens to the error variance versus multicollinearity when lenrol and lstaff are added iv How come the coefficient on lstaff is negative Is it large in magnitude v Now add the variable lunch to the regression Holding other factors fixed are teachers being compensated for teaching students from disadvantaged backgrounds Explain vi Overall is the pattern of results that you find with ELEM9495 consistent with the pattern in Table 41 C11 Use the data in HTV to answer this question See also Computer Exercise C10 in Chapter 3 i Estimate the regression model educ 5 b0 1 b1motheduc 1 b2fatheduc 1 b3abil 1 b4abil2 1 u by OLS and report the results in the usual form Test the null hypothesis that educ is linearly related to abil against the alternative that the relationship is quadratic ii Using the equation in part i test H0 b1 5 b2 against a twosided alternative What is the pvalue of the test iii Add the two college tuition variables to the regression from part i and determine whether they are jointly statistically significant iv What is the correlation between tuit17 and tuit18 Explain why using the average of the tuition over the two years might be preferred to adding each separately What happens when you do use the average v Do the findings for the average tuition variable in part iv make sense when interpreted causally What might be going on C12 Use the data in ECONMATH to answer the following questions i Estimate a model explaining colgpa to hsgpa actmth and acteng Report the results in the usual form Are all explanatory variables statistically significant ii Consider an increase in hsgpa of one standard deviation about 343 By how much does colgpa increase holding actmth and acteng fixed About how many standard deviations would the actmth have to increase to change colgpa by the same amount as a one standard deviation in hsgpa Comment iii Test the null hypothesis that actmth and acteng have the same effect in the population against a twosided alternative Report the pvalue and describe your conclusions iv Suppose the college admissions officer wants you to use the data on the variables in part i to create an equation that explains at least 50 percent of the variation in colgpa What would you tell the officer Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 149 c h a p t e r 5 Multiple Regression Analysis OLS Asymptotics I n Chapters 3 and 4 we covered what are called finite sample small sample or exact properties of the OLS estimators in the population model y 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u 51 For example the unbiasedness of OLS derived in Chapter 3 under the first four GaussMarkov assumptions is a finite sample property because it holds for any sample size n subject to the mild restriction that n must be at least as large as the total number of parameters in the regression model k 1 1 Similarly the fact that OLS is the best linear unbiased estimator under the full set of Gauss Markov assumptions MLR1 through MLR5 is a finite sample property In Chapter 4 we added the classical linear model Assumption MLR6 which states that the error term u is normally distributed and independent of the explanatory variables This allowed us to derive the exact sampling distributions of the OLS estimators conditional on the explanatory variables in the sample In particular Theorem 41 showed that the OLS estimators have normal sampling distri butions which led directly to the t and F distributions for t and F statistics If the error is not normally distributed the distribution of a t statistic is not exactly t and an F statistic does not have an exact F distribution for any sample size In addition to finite sample properties it is important to know the asymptotic properties or large sample properties of estimators and test statistics These properties are not defined for a particular sample size rather they are defined as the sample size grows without bound Fortunately under the assumptions we have made OLS has satisfactory large sample properties One practically important Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 150 finding is that even without the normality assumption Assumption MLR6 t and F statistics have approximately t and F distributions at least in large sample sizes We discuss this in more detail in Section 52 after we cover the consistency of OLS in Section 51 Because the material in this chapter is more difficult to understand and because one can conduct empirical work without a deep understanding of its contents this chapter may be skipped However we will necessarily refer to large sample properties of OLS when we study discrete response variables in Chapter 7 relax the homoskedasticity assumption in Chapter 8 and delve into estimation with time series data in Part 2 Furthermore virtually all advanced econometric methods derive their justifica tion using largesample analysis so readers who will continue into Part 3 should be familiar with the contents of this chapter 51 Consistency Unbiasedness of estimators although important cannot always be obtained For example as we dis cussed in Chapter 3 the standard error of the regression s is not an unbiased estimator for s the standard deviation of the error u in a multiple regression model Although the OLS estimators are unbiased under MLR1 through MLR4 in Chapter 11 we will find that there are time series regres sions where the OLS estimators are not unbiased Further in Part 3 of the text we encounter several other estimators that are biased yet useful Although not all useful estimators are unbiased virtually all economists agree that consistency is a minimal requirement for an estimator The Nobel Prizewinning econometrician Clive W J Granger once remarked If you cant get it right as n goes to infinity you shouldnt be in this busi ness The implication is that if your estimator of a particular population parameter is not consistent then you are wasting your time There are a few different ways to describe consistency Formal definitions and results are given in Appendix C here we focus on an intuitive understanding For concreteness let b j be the OLS esti mator of bj for some j For each n b j has a probability distribution representing its possible values in different random samples of size n Because b j is unbiased under Assumptions MLR1 through MLR4 this distribution has mean value bj If this estimator is consistent then the distribution of b j becomes more and more tightly distributed around bj as the sample size grows As n tends to infinity the distribution of b j collapses to the single point bj In effect this means that we can make our esti mator arbitrarily close to bj if we can collect as much data as we want This convergence is illustrated in Figure 51 Naturally for any application we have a fixed sample size which is a major reason an asymp totic property such as consistency can be difficult to grasp Consistency involves a thought experi ment about what would happen as the sample size gets large while at the same time we obtain numerous random samples for each sample size If obtaining more and more data does not generally get us closer to the parameter value of interest then we are using a poor estimation procedure Conveniently the same set of assumptions implies both unbiasedness and consistency of OLS We summarize with a theorem ConsistenCy of oLs Under Assumptions MLR1 through MLR4 the OLS estimator b j is consistent for bj for all j 5 0 1 p k Theorem 51 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 151 1 1 f n3 n2 n1 ˆ ˆ FiguRE 51 Sampling distributions of b 1 for sample sizes n1 n2 n3 A general proof of this result is most easily developed using the matrix algebra methods described in Appendices D and E But we can prove Theorem 51 without difficulty in the case of the simple regression model We focus on the slope estimator b 1 The proof starts out the same as the proof of unbiasedness we write down the formula for b 1 and then plug in yi 5 b0 1 b1xi1 1 ui b 1 5 a a n i51 1xi1 2 x12yiba a n i51 1xi1 2 x12 2b 52 5 b1 1 an21 a n i51 1xi1 2 x12uiban21 a n i51 1xi1 2 x12 2b where dividing both the numerator and denominator by n does not change the expression but allows us to directly apply the law of large numbers When we apply the law of large numbers to the averages in the second part of equation 52 we conclude that the numerator and denominator converge in probability to the population quantities Cov1x1u2 and Var1x12 respectively Provided that Var1x12 2 0which is assumed in MLR3we can use the properties of probability limits see Appendix C to get plim b 1 5 b1 1 Cov1x1u2Var1x12 53 5 b1 because Cov1x1u2 5 0 We have used the fact discussed in Chapters 2 and 3 that E1u0x12 5 0 Assumption MLR4 implies that x1 and u are uncorrelated have zero covariance Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 152 As a technical matter to ensure that the probability limits exist we should assume that Var1x12 and Varu which means that their probability distributions are not too spread out but we will not worry about cases where these assumptions might fail Further we couldand in an advanced treatment of econometrics we wouldexplicitly relax Assumption MLR3 to rule out only perfect collinearity in the population As stated Assumption MLR3 also disallows per fect collinearity among the regressors in the sample we have at hand Technically for the thought experiment we can show consistency with no perfect collinearity in the population allowing for the unlucky possibility that we draw a data set that does exhibit perfect collinearity From a practical perspective the distinction is unimportant as we cannot compute the OLS estimates for our sample if MLR3 fails The previous arguments and equation 53 in particular show that OLS is consistent in the sim ple regression case if we assume only zero correlation This is also true in the general case We now state this as an assumption Assumption MLR4 Zero Mean and Zero Correlation E1u2 5 0 and Cov1xju2 5 0 for j 5 1 2 p k Assumption MLR4r is weaker than Assumption MLR4 in the sense that the latter implies the former One way to characterize the zero conditional mean assumption E1u0x1 p xk2 5 0 is that any function of the explanatory variables is uncorrelated with u Assumption MLR4r requires only that each xj is uncorrelated with u and that u has a zero mean in the population In Chapter 2 we actu ally motivated the OLS estimator for simple regression using Assumption MLR4r and the first order conditions for OLS in the multiple regression case given in equation 313 are simply the sample analogs of the population zero correlation assumptions and zero mean assumption Therefore in some ways Assumption MLR4r is more natural an assumption because it leads directly to the OLS estimates Further when we think about violations of Assumption MLR4 we usually think in terms of Cov1xju2 2 0 for some j So how come we have used Assumption MLR4 until now There are two reasons both of which we have touched on earlier First OLS turns out to be biased but consist ent under Assumption MLR4r if E1u0x1 p xk2 depends on any of the xj Because we have previ ously focused on finite sample or exact sampling properties of the OLS estimators we have needed the stronger zero conditional mean assumption Second and probably more important is that the zero conditional mean assumption means that we have properly modeled the population regression function PRF That is under Assumption MLR4 we can write E1y0x1 p xk2 5 b0 1 b1x1 1 p 1 bkxk and so we can obtain partial effects of the explanatory variables on the average or expected value of y If we instead only assume Assumption MLR4r b0 1 b1x1 1 p 1 bkxk need not represent the PRF and we face the possibility that some nonlinear functions of the xj such as x2 j could be correlated with the error u A situation like this means that we have neglected nonlinearities in the model that could help us better explain y if we knew that we would usually include such nonlinear functions In other words most of the time we hope to get a good estimate of the PRF and so the zero conditional mean assumption is natural Nevertheless the weaker zero correlation assumption turns out to be useful in interpreting OLS estimation of a linear model as providing the best linear approximation to the PRF It is also used in more advanced settings such as in Chapter 15 where we have no interest in modeling a PRF For further discussion of this somewhat subtle point see Wooldridge 2010 Chapter 4 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 153 51a Deriving the Inconsistency in OLS Just as failure of E1u0x1 p xk2 5 0 causes bias in the OLS estimators correlation between u and any of x1 x2 p xk generally causes all of the OLS estimators to be inconsistent This simple but important observation is often summarized as if the error is correlated with any of the independent variables then OLS is biased and inconsistent This is very unfortunate because it means that any bias persists as the sample size grows In the simple regression case we can obtain the inconsistency from the first part of equation 53 which holds whether or not u and x1 are uncorrelated The inconsistency in b 1 sometimes loosely called the asymptotic bias is plim b 1 2 b1 5 Cov1x1u2Var1x12 54 Because Var1x12 0 the inconsistency in b 1 is positive if x1 and u are positively correlated and the inconsistency is negative if x1 and u are negatively correlated If the covariance between x1 and u is small relative to the variance in x1 the inconsistency can be negligible unfortunately we cannot even estimate how big the covariance is because u is unobserved We can use 54 to derive the asymptotic analog of the omitted variable bias see Table 32 in Chapter 3 Suppose the true model y 5 b0 1 b1x1 1 b2x2 1 v satisfies the first four GaussMarkov assumptions Then v has a zero mean and is uncorrelated with x1 and x2 If b 0 b 1 and b 2 denote the OLS estimators from the regression of y on x1 and x2 then Theorem 51 implies that these estimators are consistent If we omit x2 from the regression and do the simple regression of y on x1 then u 5 b2x2 1 v Let b 1 denote the simple regression slope estimator Then plim b 1 5 b1 1 b2d1 55 where d1 5 Cov1x1x22Var1x12 56 Thus for practical purposes we can view the inconsistency as being the same as the bias The differ ence is that the inconsistency is expressed in terms of the population variance of x1 and the population covariance between x1 and x2 while the bias is based on their sample counterparts because we condi tion on the values of x1 and x2 in the sample If x1 and x2 are uncorrelated in the population then d1 5 0 and b 1 is a consistent estimator of b1 although not necessarily unbiased If x2 has a positive partial effect on y so that b2 0 and x1 and x2 are positively correlated so that d1 0 then the inconsistency in b 1 is positive and so on We can obtain the direction of the inconsistency or asymptotic bias from Table 32 If the covariance between x1 and x2 is small relative to the variance of x1 the inconsistency can be small exaMpLe 51 Housing prices and Distance from an incinerator Let y denote the price of a house price let x1 denote the distance from the house to a new trash in cinerator distance and let x2 denote the quality of the house quality The variable quality is left vague so that it can include things like size of the house and lot number of bedrooms and bathrooms and intangibles such as attractiveness of the neighborhood If the incinerator depresses house prices then b1 should be positive everything else being equal a house that is farther away from the incinera tor is worth more By definition b2 is positive since higher quality houses sell for more other factors being equal If the incinerator was built farther away on average from better homes then distance and quality are positively correlated and so d1 0 A simple regression of price on distance or logprice on logdistance will tend to overestimate the effect of the incinerator b1 1 b2 d1 b1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 154 An important point about inconsistency in OLS estimators is that by definition the problem does not go away by adding more observations to the sample If anything the problem gets worse with more data the OLS estimator gets closer and closer to b1 1 b2d1 as the sample size grows Deriving the sign and magnitude of the incon sistency in the general k regressor case is harder just as deriving the bias is more difficult We need to remember that if we have the model in equation 51 where say x1 is correlated with u but the other independent variables are uncorrelated with u all of the OLS estimators are generally inconsistent For example in the k 5 2 case y 5 b0 1 b1x1 1 b2x2 1 u 54 suppose that x2 and u are uncorrelated but x1 and u are correlated Then the OLS estimators b 1 and b 2 will generally both be inconsistent The intercept will also be inconsistent The inconsistency in b 2 arises when x1 and x2 are correlated as is usually the case If x1 and x2 are uncorrelated then any cor relation between x1 and u does not result in the inconsistency of b 2 plim b 2 5 b2 Further the incon sistency in b 1 is the same as in 54 The same statement holds in the general case if x1 is correlated with u but x1 and u are uncorrelated with the other independent variables then only b 1 is inconsistent and the inconsistency is given by 54 The general case is very similar to the omitted variable case in Section 3A4 of Appendix 3A 52 Asymptotic Normality and Large Sample Inference Consistency of an estimator is an important property but it alone does not allow us to perform statistical inference Simply knowing that the estimator is getting closer to the population value as the sample size grows does not allow us to test hypotheses about the parameters For test ing we need the sampling distribution of the OLS estimators Under the classical linear model assumptions MLR1 through MLR6 Theorem 41 shows that the sampling distributions are nor mal This result is the basis for deriving the t and F distributions that we use so often in applied econometrics The exact normality of the OLS estimators hinges crucially on the normality of the distribution of the error u in the population If the errors u1 u2 p un are random draws from some distribution other than the normal the b j will not be normally distributed which means that the t statistics will not have t distributions and the F statistics will not have F distributions This is a potentially serious problem because our inference hinges on being able to obtain critical values or pvalues from the t or F distributions Recall that Assumption MLR6 is equivalent to saying that the distribution of y given x1 x2 p xk is normal Because y is observed and u is not in a particular application it is much easier to think about whether the distribution of y is likely to be normal In fact we have already seen a few examples where y definitely cannot have a conditional normal distribution A normally distributed random variable is symmetrically distributed about its mean it can take on any posi tive or negative value and more than 95 of the area under the distribution is within two standard deviations Suppose that the model score 5 b0 1 b1skipped 1 b2 priGPA 1 u satisfies the first four GaussMarkov assumptions where score is score on a final exam skipped is number of classes skipped and priGPA is GPA prior to the current semester If b 1 is from the simple regression of score on skipped what is the direction of the asymptotic bias in b 1 exploring FurTher 51 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 155 In Example 35 we estimated a model explaining the number of arrests of young men dur ing a particular year narr86 In the population most men are not arrested during the year and the vast majority are arrested one time at the most In the sample of 2725 men in the data set CRIME1 fewer than 8 were arrested more than once during 1986 Because narr86 takes on only two values for 92 of the sample it cannot be close to being normally distributed in the population In Example 46 we estimated a model explaining participation percentages prate in 401k pension plans The frequency distribution also called a histogram in Figure 52 shows that the dis tribution of prate is heavily skewed to the right rather than being normally distributed In fact over 40 of the observations on prate are at the value 100 indicating 100 participation This violates the normality assumption even conditional on the explanatory variables We know that normality plays no role in the unbiasedness of OLS nor does it affect the conclu sion that OLS is the best linear unbiased estimator under the GaussMarkov assumptions But exact inference based on t and F statistics requires MLR6 Does this mean that in our prior analysis of prate in Example 46 we must abandon the t statistics for determining which variables are statisti cally significant Fortunately the answer to this question is no Even though the yi are not from a normal distribution we can use the central limit theorem from Appendix C to conclude that the OLS estimators satisfy asymptotic normality which means they are approximately normally distributed in large enough sample sizes 0 10 20 30 40 50 60 70 80 90 100 0 2 4 6 8 participation rate in percentage form proportion in cell FiguRE 52 Histogram of prate using the data in 401K Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 156 The proof of asymptotic normality is somewhat complicated and is sketched in the appendix for the simple regression case Part ii follows from the law of large numbers and part iii follows from parts i and ii and the asymptotic properties discussed in Appendix C Theorem 52 is useful because the normality Assumption MLR6 has been dropped the only restriction on the distribution of the error is that it has finite variance something we will always assume We have also assumed zero conditional mean MLR4 and homoskedasticity of u MLR5 In trying to understand the meaning of Theorem 52 it is important to keep separate the notions of the population distribution of the error term u and the sampling distributions of the b j as the sample size grows A common mistake is to think that something is happening to the distribution of unamely that it is getting closer to normalas the sample size grows But remember that the population distribution is immutable and has nothing to do with the sample size For example we previously discussed narr86 the number of times a young man is arrested during the year 1986 The nature of this variableit takes on small nonnegative integer valuesis fixed in the population Whether we sample 10 men or 1000 men from this population obviously has no effect on the popula tion distribution What Theorem 52 says is that regardless of the population distribution of u the OLS estima tors when properly standardized have approximate standard normal distributions This approxima tion comes about by the central limit theorem because the OLS estimators involvein a complicated waythe use of sample averages Effectively the sequence of distributions of averages of the under lying errors is approaching normality for virtually any population distribution Notice how the standardized b j has an asymptotic standard normal distribution whether we divide the difference b j 2 bj by sd1b j2 which we do not observe because it depends on s or by se1b j2 which we can compute from our data because it depends on s In other words from an asymptotic point of view it does not matter that we have to replace s with s Of course replacing s with s affects the exact distribution of the standardized b j We just saw in Chapter 4 that under the classical linear model assumptions 1b j 2 bj2sd1b j2 has an exact Normal01 distribution and 1b j 2 bj2sd1b j2 has an exact tn2k21 distribution How should we use the result in equation 57 It may seem one consequence is that if we are going to appeal to largesample analysis we should now use the standard normal distribution for inference rather than the t distribution But from a practical perspective it is just as legitimate to write asyMptotiC norMaLity of oLs Under the GaussMarkov Assumptions MLR1 through MLR5 i n1b j 2 bj2 a Normal10 s2a2 j 2 where s2a2 j 0 is the asymptotic variance of n1b j 2 bj2 for the slope coefficients a2 j 5 plim1n21g n i51 r2 ij 2 where the rij are the residuals from regressing xj on the other independent variables We say that b j is asymptotically normally distributed see Appendix C ii s 2 is a consistent estimator of s2 5 Var1u2 iii For each j 1b j 2 bj2sd1b j2 a Normal1012 and 1b j 2 bj2se1b j2 a Normal1012 57 where se1b j2 is the usual OLS standard error Theorem 52 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 157 1b j 2 bj2se1b j2 a tn2k21 5 tdf 58 because tdf approaches the Normal01 distribution as df gets large Because we know under the CLM assumptions the tn2k21 distribution holds exactly it makes sense to treat 1b j 2 bj2se1b j2 as a tn2k21 random variable generally even when MLR6 does not hold Equation 58 tells us that t testing and the construction of confidence intervals are carried out exactly as under the classical linear model assumptions This means that our analysis of dependent variables like prate and narr86 does not have to change at all if the GaussMarkov assumptions hold in both cases we have at least 1500 observations which is certainly enough to justify the approxima tion of the central limit theorem If the sample size is not very large then the t distribution can be a poor approximation to the distribution of the t statistics when u is not normally distributed Unfortunately there are no general prescriptions on how big the sample size must be before the approximation is good enough Some econometricians think that n 5 30 is satisfactory but this cannot be sufficient for all possible distribu tions of u Depending on the distribution of u more observations may be necessary before the central limit theorem delivers a useful approximation Further the quality of the approximation depends not just on n but on the df n 2 k 2 1 With more independent variables in the model a larger sample size is usually needed to use the t approximation Methods for inference with small degrees of free dom and nonnormal errors are outside the scope of this text We will simply use the t statistics as we always have without worrying about the normality assumption It is very important to see that Theorem 52 does require the homoskedasticity assumption along with the zero conditional mean assumption If Varyx is not constant the usual t statistics and con fidence intervals are invalid no matter how large the sample size is the central limit theorem does not bail us out when it comes to heteroskedasticity For this reason we devote all of Chapter 8 to discuss ing what can be done in the presence of heteroskedasticity One conclusion of Theorem 52 is that s 2 is a consistent estimator of s2 we already know from Theorem 33 that s 2 is unbiased for s2 under the GaussMarkov assumptions The consistency implies that s is a consistent estimator of s which is important in establishing the asymptotic nor mality result in equation 57 Remember that s appears in the standard error for each b j In fact the estimated variance of b j is Var1b j2 5 s 2 SSTj11 2 R2 j 2 59 where SSTj is the total sum of squares of xj in the sample and R2 j is the Rsquared from regressing xj on all of the other independent variables In Section 34 we studied each component of 59 which we will now expound on in the context of asymptotic analysis As the sample size grows s 2 converges in probability to the constant s2 Further R2 j approaches a number strictly between zero and unity so that 1 2 R2 j converges to some number between zero and one The sample variance of xj is SSTjn and so SSTjn converges to Var1xj2 as the sample size grows This means that SSTj grows at approximately the same rate as the sample size SSTj ns2 j where s2 j is the population variance of xj When we combine these facts we find that Var1b j2 shrinks to zero at the rate of 1n this is why larger sample sizes are better When u is not normally distributed the square root of 59 is sometimes called the asymptotic standard error and t statistics are called asymptotic t statistics Because these are the same quanti ties we dealt with in Chapter 4 we will just call them standard errors and t statistics with the under standing that sometimes they have only largesample justification A similar comment holds for an asymptotic confidence interval constructed from the asymptotic standard error In a regression model with a large sample size what is an approximate 95 confi dence interval for b j under MLR1 through MLR5 We call this an asymptotic confi dence interval exploring FurTher 52 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 158 Using the preceding argument about the estimated variance we can write se1b j2 cjn 510 where cj is a positive constant that does not depend on the sample size In fact the constant cj can be shown to be cj 5 s sj1 2 r2 j where s 5 sd1u2 sj 5 sd1xj2 and r2 j is the population Rsquared from regressing xj on the other explanatory variables Just like studying equation 59 to see which variables affect Var1b j2 under the GaussMarkov assumptions we can use this expression for cj to study the impact of larger error standard deviation 1s2 more population variation in xj 1sj2 and multicollinearity in the popula tion 1r2 j 2 Equation 510 is only an approximation but it is a useful rule of thumb standard errors can be expected to shrink at a rate that is the inverse of the square root of the sample size exaMpLe 52 standard errors in a Birth Weight equation We use the data in BWGHT to estimate a relationship where log of birth weight is the dependent variable and cigarettes smoked per day cigs and log of family income are independent variables The total number of observations is 1388 Using the first half of the observations 694 the stand ard error for b cigs is about 0013 The standard error using all of the observations is about 00086 The ratio of the latter standard error to the former is 000860013 662 This is pretty close to 6941388 707 the ratio obtained from the approximation in 510 In other words equation 510 implies that the standard error using the larger sample size should be about 707 of the stand ard error using the smaller sample This percentage is pretty close to the 662 we actually compute from the ratio of the standard errors The asymptotic normality of the OLS estimators also implies that the F statistics have approxi mate F distributions in large sample sizes Thus for testing exclusion restrictions or other multiple hypotheses nothing changes from what we have done before 52a Other Large Sample Tests The Lagrange Multiplier Statistic Once we enter the realm of asymptotic analysis other test statistics can be used for hypothesis testing For most purposes there is little reason to go beyond the usual t and F statistics as we just saw these statistics have large sample justification without the normality assumption Nevertheless sometimes it is useful to have other ways to test multiple exclusion restrictions and we now cover the Lagrange multiplier LM statistic which has achieved some popularity in modern econometrics The name Lagrange multiplier statistic comes from constrained optimization a topic beyond the scope of this text See Davidson and MacKinnon 1993 The name score statisticwhich also comes from optimization using calculusis used as well Fortunately in the linear regression frame work it is simple to motivate the LM statistic without delving into complicated mathematics The form of the LM statistic we derive here relies on the GaussMarkov assumptions the same assumptions that justify the F statistic in large samples We do not need the normality assumption To derive the LM statistic consider the usual multiple regression model with k independent variables y 5 b0 1 b1x1 1 p 1 bkxk 1 u 511 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 159 We would like to test whether say the last q of these variables all have zero population parameters the null hypothesis is H0 bk2q11 5 0 p bk 5 0 512 which puts q exclusion restrictions on the model 511 As with F testing the alternative to 512 is that at least one of the parameters is different from zero The LM statistic requires estimation of the restricted model only Thus assume that we have run the regression y 5 b 0 1 b 1x1 1 p 1 b k2qxk2q 1 u 513 where indicates that the estimates are from the restricted model In particular u indicates the residuals from the restricted model As always this is just shorthand to indicate that we obtain the restricted residual for each observation in the sample If the omitted variables xk2q11 through xk truly have zero population coefficients then at least approximately u should be uncorrelated with each of these variables in the sample This suggests running a regression of these residuals on those independent variables excluded under H0 which is almost what the LM test does However it turns out that to get a usable test statistic we must include all of the independent variables in the regression We must include all regressors because in general the omitted regressors in the restricted model are correlated with the regressors that appear in the restricted model Thus we run the regression of u on x1 x2 p xk 514 This is an example of an auxiliary regression a regression that is used to compute a test statistic but whose coefficients are not of direct interest How can we use the regression output from 514 to test 512 If 512 is true the Rsquared from 514 should be close to zero subject to sampling error because u will be approximately uncorrelated with all the independent variables The question as always with hypothesis testing is how to determine when the statistic is large enough to reject the null hypothesis at a chosen sig nificance level It turns out that under the null hypothesis the sample size multiplied by the usual Rsquared from the auxiliary regression 514 is distributed asymptotically as a chisquare random variable with q degrees of freedom This leads to a simple procedure for testing the joint significance of a set of q independent variables The Lagrange Multiplier Statistic for q Exclusion Restrictions i Regress y on the restricted set of independent variables and save the residuals u ii Regress u on all of the independent variables and obtain the Rsquared say R2 u to distinguish it from the Rsquareds obtained with y as the dependent variable iii Compute LM 5 nR2 u the sample size times the Rsquared obtained from step ii iv Compare LM to the appropriate critical value c in a x2 q distribution if LM c the null hypothe sis is rejected Even better obtain the pvalue as the probability that a x2 q random variable exceeds the value of the test statistic If the pvalue is less than the desired significance level then H0 is rejected If not we fail to reject H0 The rejection rule is essentially the same as for F testing Because of its form the LM statistic is sometimes referred to as the nRsquared statistic Unlike with the F statistic the degrees of freedom in the unrestricted model plays no role in carrying out the LM test All that matters is the number of restrictions being tested q the size of the auxiliary Rsquared 1R2 u2 and the sample size n The df in the unrestricted model plays no role because of the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 160 asymptotic nature of the LM statistic But we must be sure to multiply R2 u by the sample size to obtain LM a seemingly low value of the Rsquared can still lead to joint significance if n is large Before giving an example a word of caution is in order If in step i we mistakenly regress y on all of the independent variables and obtain the residuals from this unrestricted regression to be used in step ii we do not get an interesting statistic the resulting Rsquared will be exactly zero This is because OLS chooses the estimates so that the residuals are uncorrelated in samples with all included independent variables see equations in 313 Thus we can only test 512 by regressing the restricted residuals on all of the independent variables Regressing the restricted residuals on the restricted set of independent variables will also produce R2 5 0 exaMpLe 53 economic Model of Crime We illustrate the LM test by using a slight extension of the crime model from Example 35 narr86 5 b0 1 b1pcnv 1 b2avgsen 1 b3tottime 1 b4ptime86 1 b5qemp86 1 u where narr86 5 the number of times a man was arrested pcnv 5 the proportion of prior arrests leading to conviction avgsen 5 average sentence served from past convictions tottime 5 total time the man has spent in prison prior to 1986 since reaching the age of 18 ptime86 5 months spent in prison in 1986 qemp86 5 number of quarters in 1986 during which the man was legally employed We use the LM statistic to test the null hypothesis that avgsen and tottime have no effect on narr86 once the other factors have been controlled for In step i we estimate the restricted model by regressing narr86 on pcnv ptime86 and qemp86 the variables avgsen and tottime are excluded from this regression We obtain the residuals u from this regression 2725 of them Next we run the regression of u on pcnv ptime86 qemp86 avgsen and tottime 515 as always the order in which we list the independent variables is irrelevant This second regression produces R2 u which turns out to be about 0015 This may seem small but we must multiply it by n to get the LM statistic LM 5 2725100152 409 The 10 critical value in a chisquare distribu tion with two degrees of freedom is about 461 rounded to two decimal places see Table G4 Thus we fail to reject the null hypothesis that bavgsen 5 0 and btottime 5 0 at the 10 level The pvalue is P1x2 2 4092 129 so we would reject H0 at the 15 level As a comparison the F test for joint significance of avgsen and tottime yields a pvalue of about 131 which is pretty close to that obtained using the LM statistic This is not surprising since asymp totically the two statistics have the same probability of Type I error That is they reject the null hypothesis with the same frequency when the null is true As the previous example suggests with a large sample we rarely see important discrepancies between the outcomes of LM and F tests We will use the F statistic for the most part because it is computed routinely by most regression packages But you should be aware of the LM statistic as it is used in applied work One final comment on the LM statistic As with the F statistic we must be sure to use the same observations in steps i and ii If data are missing for some of the independent variables that are excluded under the null hypothesis the residuals from step i should be obtained from a regression on the reduced data set Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 161 53 Asymptotic Efficiency of OLS We know that under the GaussMarkov assumptions the OLS estimators are best linear unbiased OLS is also asymptotically efficient among a certain class of estimators under the GaussMarkov assumptions A general treatment requires matrix algebra and advanced asymptotic analysis First we describe the result in the simple regression case In the model y 5 b0 1 b1x 1 u 516 u has a zero conditional mean under MLR4 E1u0x2 5 0 This opens up a variety of consistent esti mators for b0 and b1 as usual we focus on the slope parameter b1 Let gx be any function of x for example g1x2 5 x2 or g1x2 5 111 1 0x0 2 Then u is uncorrelated with gx see Property CE5 in Appendix B Let zi 5 g1xi2 for all observations i Then the estimator b 1 5 a a n i51 1zi 2 z2yiba a n i51 1zi 2 z2xib 517 is consistent for b1 provided gx and x are correlated Remember it is possible that gx and x are uncorrelated because correlation measures linear dependence To see this we can plug in yi 5 b0 1 b1xi 1 ui and write b 1 as b 1 5 b1 1 an21 a n i51 1zi 2 z2uiban21 a n i51 1zi 2 z2xib 518 Now we can apply the law of large numbers to the numerator and denominator which converge in probability to Covzu and Covzx respectively Provided that Cov1zu2 2 0so that z and x are correlatedwe have plim b 1 5 b1 1 Cov1zu2Cov1zx2 5 b1 because Cov1zu2 5 0 under MLR4 It is more difficult to show that b 1 is asymptotically normal Nevertheless using arguments simi lar to those in the appendix it can be shown that n1b 1 2 b12 is asymptotically normal with mean zero and asymptotic variance s2 Var1z23Cov1zx2 42 The asymptotic variance of the OLS estima tor is obtained when z 5 x in which case Cov1zx2 5 Cov1xx2 5 Var1x2 Therefore the asymp totic variance of n1b 1 2 b12 where b 1 is the OLS estimator is s2Var1x23Var1x2 42 5 s2Var1x2 Now the CauchySchwartz inequality see Appendix B4 implies that 3Cov1zx2 42 Var1z2Var1x2 which implies that the asymptotic variance of n1b 1 2 b12 is no larger than that of n1b 1 2 b12 We have shown in the simple regression case that under the GaussMarkov assumptions the OLS estimator has a smaller asymptotic variance than any estimator of the form 517 The estimator in 517 is an example of an instrumental variables estimator which we will study extensively in Chapter 15 If the homoskedasticity assumption fails then there are estimators of the form 517 that have a smaller asymptotic variance than OLS We will see this in Chapter 8 The general case is similar but much more difficult mathematically In the k regressor case the class of consistent estimators is obtained by generalizing the OLS first order conditions a n i51 gj1xi2 1yi 2 b 0 2 b 1xi1 2 p 2 b k xik2 5 0 j 5 0 1 p k 519 where gj1xi2 denotes any function of all explanatory variables for observation i As can be seen by comparing 519 with the OLS first order conditions in 313 we obtain the OLS estimators when g01xi2 5 1 and gj1xi2 5 xij for j 5 1 2 p k The class of estimators in 519 is infinite because we can use any functions of the xij that we want Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 162 PART 1 Regression Analysis with CrossSectional Data Proving consistency of the estimators in 519 let alone showing they are asymptotically normal is mathematically difficult See Wooldridge 2010 Chapter 5 Summary The claims underlying the material in this chapter are fairly technical but their practical implications are straightforward We have shown that the first four GaussMarkov assumptions imply that OLS is consist ent Furthermore all of the methods of testing and constructing confidence intervals that we learned in Chapter 4 are approximately valid without assuming that the errors are drawn from a normal distribution equivalently the distribution of y given the explanatory variables is not normal This means that we can apply OLS and use previous methods for an array of applications where the dependent variable is not even approximately normally distributed We also showed that the LM statistic can be used instead of the F sta tistic for testing exclusion restrictions Before leaving this chapter we should note that examples such as Example 53 may very well have problems that do require special attention For a variable such as narr86 which is zero or one for most men in the population a linear model may not be able to adequately capture the functional relationship between narr86 and the explanatory variables Moreover even if a linear model does describe the expected value of arrests heteroskedasticity might be a problem Problems such as these are not mitigated as the sample size grows and we will return to them in later chapters Key Terms Theorem 53 Asymptotic EfficiEncy of oLs Under the GaussMarkov assumptions let b j denote estimators that solve equations of the form 519 and let b j denote the OLS estimators Then for j 5 0 1 2 p k the OLS estimators have the smallest asymptotic variances Avarn1b j 2 bj2 Avarn1b j 2 bj2 Asymptotic Bias Asymptotic Confidence Interval Asymptotic Normality Asymptotic Properties Asymptotic Standard Error Asymptotic t Statistics Asymptotic Variance Asymptotically Efficient Auxiliary Regression Consistency Inconsistency Lagrange Multiplier LM Statistic Large Sample Properties nRSquared Statistic Score Statistic Problems 1 In the simple regression model under MLR1 through MLR4 we argued that the slope estimator b 1 is consistent for b1 Using b 0 5 y 2 b 1x1 show that plim b 0 5 b0 You need to use the consistency of b 1 and the law of large numbers along with the fact that b0 5 E1y2 2 b1E1x12 2 Suppose that the model pctstck 5 b0 1 b1funds 1 b2risktol 1 u satisfies the first four GaussMarkov assumptions where pctstck is the percentage of a workers pension invested in the stock market funds is the number of mutual funds that the worker can Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 163 choose from and risktol is some measure of risk tolerance larger risktol means the person has a higher tolerance for risk If funds and risktol are positively correlated what is the inconsistency in b 1 the slope coefficient in the simple regression of pctstck on funds 3 The data set SMOKE contains information on smoking behavior and other variables for a random sample of single adults from the United States The variable cigs is the average number of cigarettes smoked per day Do you think cigs has a normal distribution in the US adult population Explain 4 In the simple regression model 516 under the first four GaussMarkov assumptions we showed that estimators of the form 517 are consistent for the slope b1 Given such an estimator define an esti mator of b0 by b 0 5 y 2 b 1x Show that plim b 0 5 b0 5 The following histogram was created using the variable score in the data file ECONMATH Thirty bins were used to create the histogram and the height of each cell is the proportion of observations falling within the corresponding interval The bestfitting normal distributionthat is using the sample mean and sample standard deviationhas been superimposed on the histogram 0 02 04 06 08 1 proportion in cell 20 40 60 80 100 course score in percentage form i If you use the normal distribution to estimate the probability that score exceeds 100 would the answer be zero Why does your answer contradict the assumption of a normal distribution for score ii Explain what is happening in the left tail of the histogram Does the normal distribution fit well in the left tail Computer Exercises C1 Use the data in WAGE1 for this exercise i Estimate the equation wage 5 b0 1 b1educ 1 b2exper 1 b3tenure 1 u Save the residuals and plot a histogram ii Repeat part i but with logwage as the dependent variable iii Would you say that Assumption MLR6 is closer to being satisfied for the levellevel model or the loglevel model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 164 PART 1 Regression Analysis with CrossSectional Data C2 Use the data in GPA2 for this exercise i Using all 4137 observations estimate the equation colgpa 5 b0 1 b1hsperc 1 b2sat 1 u and report the results in standard form ii Reestimate the equation in part i using the first 2070 observations iii Find the ratio of the standard errors on hsperc from parts i and ii Compare this with the result from 510 C3 In equation 442 of Chapter 4 using the data set BWGHT compute the LM statistic for testing whether motheduc and fatheduc are jointly significant In obtaining the residuals for the restricted model be sure that the restricted model is estimated using only those observations for which all vari ables in the unrestricted model are available see Example 49 C4 Several statistics are commonly used to detect nonnormality in underlying population distributions Here we will study one that measures the amount of skewness in a distribution Recall that any nor mally distributed random variable is symmetric about its mean therefore if we standardize a sym metrically distributed random variable say z 5 1y 2 my2sy where my 5 E1y2 and sy 5 sd1y2 then z has mean zero variance one and E1z32 5 0 Given a sample of data 5yi i 5 1 p n6 we can stan dardize yi in the sample by using zi 5 1yi 2 m y2s y where m y is the sample mean and s y is the sample standard deviation We ignore the fact that these are estimates based on the sample A sample statistic that measures skewness is n21g n i51z3 i or where n is replaced with n 21 as a degreesoffreedom ad justment If y has a normal distribution in the population the skewness measure in the sample for the standardized values should not differ significantly from zero i First use the data set 401KSUBS keeping only observations with fsize 5 1 Find the skewness measure for inc Do the same for loginc Which variable has more skewness and therefore seems less likely to be normally distributed ii Next use BWGHT2 Find the skewness measures for bwght and logbwght What do you conclude iii Evaluate the following statement The logarithmic transformation always makes a positive variable look more normally distributed iv If we are interested in the normality assumption in the context of regression should we be evaluating the unconditional distributions of y and logy Explain C5 Consider the analysis in Computer Exercise C11 in Chapter 4 using the data in HTV where educ is the dependent variable in a regression i How many different values are taken on by educ in the sample Does educ have a continuous distribution ii Plot a histogram of educ with a normal distribution overlay Does the distribution of educ appear anything close to normal iii Which of the CLM assumptions seems clearly violated in the model educ 5 b0 1 b1motheduc 1 b2fatheduc 1 b3abil 1 b4abil2 1 u How does this violation change the statistical inference procedures carried out in Computer Exercise C11 in Chapter 4 C6 Use the data in ECONMATH to answer this question i Logically what are the smallest and largest values that can be taken on by the variable score What are the smallest and largest values in the sample ii Consider the linear model score 5 b0 1 b1colgpa 1 b2actmth 1 b3acteng 1 u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 5 Multiple Regression Analysis OLS Asymptotics 165 Why cannot Assumption MLR6 hold for the error term u What consequences does this have for using the usual t statistic to test H0 b3 5 0 iii Estimate the model from part ii and obtain the t statistic and associated pvalue for testing H0 b3 5 0 How would you defend your findings to someone who makes the following state ment You cannot trust that pvalue because clearly the error term in the equation cannot have a normal distribution APPEndix 5A Asymptotic Normality of OLS We sketch a proof of the asymptotic normality of OLS Theorem 52i in the simple regression case Write the simple regression model as in equation 516 Then by the usual algebra of simple regression we can write n1b 1 2 b12 5 11s2 x2 cn212 a n i51 1xi 2 x2uid where we use s2 x to denote the sample variance of 5xi i 5 1 2 p n6 By the law of large num bers see Appendix C s2 x S p s2 x 5 Var1x2 Assumption MLR3 rules out perfect collinearity which means that Varx 0 xi varies in the sample and therefore x is not constant in the population Next n212g n i511xi 2 x2ui 5 n212g n i511xi 2 m2ui 1 1m 2 x2 n212g n i51ui where m 5 E1x2 is the population mean of x Now 5ui6 is a sequence of iid random variables with mean zero and variance s2 and so n212g n i51ui converges to the Normal0s2 distribution as n S this is just the central limit theorem from Appendix C By the law of large numbers plim1u 2 x2 5 0 A standard result in asymptotic theory is that if plim1wn2 5 0 and zn has an asymptotic normal distribution then plim1wnzn2 5 0 See Wooldridge 2010 Chapter 3 for more discussion This implies that 1m 2 x2n212g n i51ui has zero plim Next 1xi 2 m2ui i 5 1 2 p is an indefinite sequence of iid random variables with mean zerobecause u and x are uncorrelated under As sumption MLR4and variance s2s2 x by the homoskedasticity Assumption MLR5 Therefore n212g n i511xi 2 m2ui has an asymptotic Normal10s2s2 x2 distribution We just showed that the dif ference between n212g n i511xi 2 x2ui and n212g n i511xi 2 m2ui has zero plim A result in asymptotic theory is that if zn has an asymptotic normal distribution and plim1vn 2 zn2 5 0 then vn has the same asymptotic normal distribution as zn It follows that n212g n i511xi 2 x2ui also has an asymptotic Normal10s2s2 x2 distribution Putting all of the pieces together gives n1b 1 2 b12 5 11s2 x2 cn212 a n i51 1xi 2 x2uid 111s2 x2 2 11s2 xcn212 a n i51 1xi 2 x2uid and since plim11s2 x2 5 1s2 x the second term has zero plim Therefore the asymptotic distribu tion of n1b 1 2 b12 is Normal105s2s2 x65s2 x622 5 Normal10s2s2 x2 This completes the proof in the simple regression case as a2 1 5 s2 x in this case See Wooldridge 2010 Chapter 4 for the general case Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 166 c h a p t e r 6 Multiple Regression Analysis Further Issues T his chapter brings together several issues in multiple regression analysis that we could not conveniently cover in earlier chapters These topics are not as fundamental as the material in Chapters 3 and 4 but they are important for applying multiple regression to a broad range of empirical problems 61 Effects of Data Scaling on OLS Statistics In Chapter 2 on bivariate regression we briefly discussed the effects of changing the units of meas urement on the OLS intercept and slope estimates We also showed that changing the units of meas urement did not affect Rsquared We now return to the issue of data scaling and examine the effects of rescaling the dependent or independent variables on standard errors t statistics F statistics and confidence intervals We will discover that everything we expect to happen does happen When variables are rescaled the coefficients standard errors confidence intervals t statistics and F statistics change in ways that preserve all measured effects and testing outcomes Although this is no great surprisein fact we would be very worried if it were not the caseit is useful to see what occurs explicitly Often data scaling is used for cosmetic purposes such as to reduce the number of zeros after a decimal point in an estimated coefficient By judiciously choosing units of measurement we can improve the appear ance of an estimated equation while changing nothing that is essential We could treat this problem in a general way but it is much better illustrated with examples Likewise there is little value here in introducing an abstract notation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 167 We begin with an equation relating infant birth weight to cigarette smoking and family income bwght 5 b 0 1 b 1cigs 1 b 2 faminc 61 where bwght 5 child birth weight in ounces cigs 5 number of cigarettes smoked by the mother while pregnant per day faminc 5 annual family income in thousands of dollars The estimates of this equation obtained using the data in BWGHT are given in the first column of Table 61 Standard errors are listed in parentheses The estimate on cigs says that if a woman smoked five more cigarettes per day birth weight is predicted to be about 4634152 5 2317 ounces less The t statistic on cigs is 506 so the variable is very statistically significant Now suppose that we decide to measure birth weight in pounds rather than in ounces Let bwghtlbs 5 bwght16 be birth weight in pounds What happens to our OLS statistics if we use this as the dependent variable in our equation It is easy to find the effect on the coefficient estimates by simple manipulation of equation 61 Divide this entire equation by 16 bwght16 5 b 0 16 1 1b 1162cigs 1 1b 2 162faminc Since the lefthand side is birth weight in pounds it follows that each new coefficient will be the corresponding old coefficient divided by 16 To verify this the regression of bwghtlbs on cigs and faminc is reported in column 2 of Table 61 Up to the reported digits and any digits beyond the intercept and slopes in column 2 are just those in column 1 divided by 16 For example the coefficient on cigs is now 0289 this means that if cigs were higher by five birth weight would be 0289152 5 1445 pounds lower In terms of ounces we have 14451162 5 2312 which is slightly different from the 2317 we obtained earlier due to rounding error The point is once the effects are transformed into the same units we get exactly the same answer regardless of how the dependent variable is measured What about statistical significance As we expect changing the dependent variable from ounces to pounds has no effect on how statistically important the independent variables are The standard errors in column 2 are 16 times smaller than those in column 1 A few quick calculations show TAblE 61 Effects of Data Scaling Dependent Variable 1 bwght 2 bwghtlbs 3 bwght Independent Variables cigs 4634 0916 0289 0057 packs 9268 1832 faminc 0927 0292 0058 0018 0927 0292 intercept 116974 1049 73109 0656 116974 1049 Observations 1388 1388 1388 RSquared 0298 0298 0298 SSR 55748551 21776778 55748551 SER 20063 12539 20063 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 168 that the t statistics in column 2 are indeed identical to the t statistics in column 1 The endpoints for the confidence intervals in column 2 are just the endpoints in column 1 divided by 16 This is because the CIs change by the same factor as the standard errors Remember that the 95 CI here is b j 6 196 se1b j2 In terms of goodnessoffit the Rsquareds from the two regressions are identical as should be the case Notice that the sum of squared residuals SSR and the standard error of the regression SER do differ across equations These differences are easily explained Let u i denote the residual for obser vation i in the original equation 61 Then the residual when bwghtlbs is the dependent variable is simply u i16 Thus the squared residual in the second equation is 1u i162 2 5 u 2 i 256 This is why the sum of squared residuals in column 2 is equal to the SSR in column 1 divided by 256 Since SER 5 s 5 SSR1n 2 k 2 12 5 SSR1385 the SER in column 2 is 16 times smaller than that in column 1 Another way to think about this is that the error in the equation with bwghtlbs as the dependent variable has a standard deviation 16 times smaller than the standard devia tion of the original error This does not mean that we have reduced the error by changing how birth weight is measured the smaller SER simply reflects a difference in units of measurement Next let us return the dependent variable to its original units bwght is measured in ounces Instead let us change the unit of measurement of one of the independent variables cigs Define packs to be the number of packs of cigarettes smoked per day Thus packs 5 cigs20 What happens to the coefficients and other OLS statistics now Well we can write bwght 5 b 0 1 120b 12 1cigs202 1 b 2 faminc 5 b 0 1 120b 12packs 1 b 2 faminc Thus the intercept and slope coefficient on faminc are unchanged but the coefficient on packs is 20 times that on cigs This is intuitively appealing The results from the regression of bwght on packs and faminc are in column 3 of Table 61 Incidentally remember that it would make no sense to include both cigs and packs in the same equa tion this would induce perfect multicollinearity and would have no interesting meaning Other than the coefficient on packs there is one other statistic in column 3 that differs from that in column 1 the standard error on packs is 20 times larger than that on cigs in column 1 This means that the t statistic for testing the significance of cigarette smoking is the same whether we measure smoking in terms of cigarettes or packs This is only natural The previous example spells out most of the possibilities that arise when the dependent and inde pendent variables are rescaled Rescaling is often done with dollar amounts in economics especially when the dollar amounts are very large In Chapter 2 we argued that if the dependent variable appears in logarithmic form changing the unit of measurement does not affect the slope coefficient The same is true here changing the unit of measurement of the dependent variable when it appears in logarithmic form does not affect any of the slope estimates This follows from the simple fact that log1c1yi2 5 log1c12 1 log1yi2 for any constant c1 0 The new intercept will be log1c12 1 b 0 Similarly changing the unit of measurement of any xj where log1xj2 appears in the regression only affects the intercept This corresponds to what we know about percentage changes and in particular elasticities they are invariant to the units of measurement of either y or the xj For example if we had specified the dependent variable in 61 to be logbwght estimated the equation and then reestimated it with logbwghtlbs as the dependent variable the coefficients on cigs and faminc would be the same in both regressions only the intercept would be different In the original birth weight equation 61 suppose that faminc is measured in dollars rather than in thousands of dollars Thus define the variable fincdol 5 1000faminc How will the OLS statistics change when fincdol is substituted for faminc For the purpose of presenting the regression re sults do you think it is better to measure income in dollars or in thousands of dollars Exploring FurthEr 61 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 169 61a Beta Coefficients Sometimes in econometric applications a key variable is measured on a scale that is difficult to inter pret Labor economists often include test scores in wage equations and the scale on which these tests are scored is often arbitrary and not easy to interpret at least for economists In almost all cases we are interested in how a particular individuals score compares with the population Thus instead of asking about the effect on hourly wage if say a test score is 10 points higher it makes more sense to ask what happens when the test score is one standard deviation higher Nothing prevents us from seeing what happens to the dependent variable when an independent variable in an estimated model increases by a certain number of standard deviations assuming that we have obtained the sample standard deviation of the independent variable which is easy in most regression packages This is often a good idea So for example when we look at the effect of a standardized test score such as the SAT score on college GPA we can find the standard deviation of SAT and see what happens when the SAT score increases by one or two standard deviations Sometimes it is useful to obtain regression results when all variables involved the dependent as well as all the independent variables have been standardized A variable is standardized in the sample by subtracting off its mean and dividing by its standard deviation see Appendix C This means that we compute the zscore for every variable in the sample Then we run a regression using the zscores Why is standardization useful It is easiest to start with the original OLS equation with the vari ables in their original forms yi 5 b 0 1 b 1xi1 1 b 2xi2 1 p 1 b kxik 1 u i 62 We have included the observation subscript i to emphasize that our standardization is applied to all sample values Now if we average 62 use the fact that the u i have a zero sample average and sub tract the result from 62 we get yi 2 y 5 b 11xi1 2 x12 1 b 21xi2 2 x22 1 p 1 b k1xik 2 xk2 1 u i Now let s y be the sample standard deviation for the dependent variable let s 1 be the sample sd for x1 let s 2 be the sample sd for x2 and so on Then simple algebra gives the equation 1yi 2 y2s y 5 1s 1s y2b 13 1xi1 2 x12s 14 1 p 1 1s ks y2b k3 1xik 2 xk2s k4 1 1u is y2 63 Each variable in 63 has been standardized by replacing it with its zscore and this has resulted in new slope coefficients For example the slope coefficient on 1xi1 2 x12s 1 is 1s 1s y2b 1This is sim ply the original coefficient b 1 multiplied by the ratio of the standard deviation of x1 to the standard deviation of y The intercept has dropped out altogether It is useful to rewrite 63 dropping the i subscript as zy 5 b 1z1 1 b 2z2 1 p 1 b kzk 1 error 64 where zy denotes the zscore of y z1 is the zscore of x1 and so on The new coefficients are b j 5 1s js y2b j for j 5 1 p k 65 These b j are traditionally called standardized coefficients or beta coefficients The latter name is more common which is unfortunate because we have been using beta hat to denote the usual OLS estimates Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 170 Beta coefficients receive their interesting meaning from equation 64 If x1 increases by one standard deviation then y changes by b 1 standard deviations Thus we are measuring effects not in terms of the original units of y or the xj but in standard deviation units Because it makes the scale of the regressors irrelevant this equation puts the explanatory variables on equal footing In a standard OLS equation it is not possible to simply look at the size of different coefficients and conclude that the explanatory variable with the largest coefficient is the most important We just saw that the magnitudes of coefficients can be changed at will by changing the units of measurement of the xj But when each xj has been standardized comparing the magnitudes of the resulting beta coefficients is more compelling When the regression equation has only a single explanatory variable x1 its stand ardized coefficient is simply the sample correlation coefficient between y and x1 which means it must lie in the range 1 to 1 Even in situations where the coefficients are easily interpretablesay the dependent variable and independent variables of interest are in logarithmic form so the OLS coefficients of interest are estimated elasticitiesthere is still room for computing beta coefficients Although elasticities are free of units of measurement a change in a particular explanatory variable by say 10 may repre sent a larger or smaller change over a variables range than changing another explanatory variable by 10 For example in a state with wide income variation but relatively little variation in spending per student it might not make much sense to compare performance elasticities with respect to the income and spending Comparing beta coefficient magnitudes can be helpful To obtain the beta coefficients we can always standardize y x1 p xk and then run the OLS regression of the zscore of y on the zscores of x1 p xkwhere it is not necessary to include an intercept as it will be zero This can be tedious with many independent variables Many regression packages provide beta coefficients via a simple command The following example illustrates the use of beta coefficients ExamplE 61 Effects of pollution on Housing prices We use the data from Example 45 in the file HPRICE2 to illustrate the use of beta coefficients Recall that the key independent variable is nox a measure of the nitrogen oxide in the air over each community One way to understand the size of the pollution effectwithout getting into the science underlying nitrogen oxides effect on air qualityis to compute beta coefficients An alternative approach is contained in Example 45 we obtained a price elasticity with respect to nox by using price and nox in logarithmic form The population equation is the levellevel model price 5 b0 1 b1nox 1 b2crime 1 b3rooms 1 b4dist 1 b5stratio 1 u where all the variables except crime were defined in Example 45 crime is the number of reported crimes per capita The beta coefficients are reported in the following equation so each variable has been converted to its zscore zprice 5 2340 znox 2 143 zcrime 1 514 zrooms 2 235 zdist 2 270 zstratio This equation shows that a one standard deviation increase in nox decreases price by 34 standard deviation a one standard deviation increase in crime reduces price by 14 standard deviation Thus the same relative movement of pollution in the population has a larger effect on housing prices than crime does Size of the house as measured by number of rooms rooms has the largest standard ized effect If we want to know the effects of each independent variable on the dollar value of median house price we should use the unstandardized variables Whether we use standardized or unstandardized variables does not affect statistical significance the t statistics are the same in both cases Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 171 62 More on Functional Form In several previous examples we have encountered the most popular device in econometrics for allowing nonlinear relationships between the explained and explanatory variables using logarithms for the dependent or independent variables We have also seen models containing quadratics in some explanatory variables but we have yet to provide a systematic treatment of them In this section we cover some variations and extensions on functional forms that often arise in applied work 62a More on Using Logarithmic Functional Forms We begin by reviewing how to interpret the parameters in the model log1price2 5 b0 1 b1log1nox2 1 b2rooms 1 u 66 where these variables are taken from Example 45 Recall that throughout the text logx is the natural log of x The coefficient b1 is the elasticity of price with respect to nox pollution The coefficient b2 is the change in log price when Drooms 5 1 as we have seen many times when multiplied by 100 this is the approximate percentage change in price Recall that 100 b2 is sometimes called the semi elasticity of price with respect to rooms When estimated using the data in HPRICE2 we obtain log1price2 5 923 2 718 log1nox2 1 306 rooms 10192 10662 10192 67 n 5 506 R2 5 514 Thus when nox increases by 1 price falls by 718 holding only rooms fixed When rooms increases by one price increases by approximately 10013062 5 306 The estimate that one more room increases price by about 306 turns out to be somewhat inac curate for this application The approximation error occurs because as the change in logy becomes larger and larger the approximation Dy 100 Dlog1y2 becomes more and more inaccurate Fortunately a simple calculation is available to compute the exact percentage change To describe the procedure we consider the general estimated model log1y2 5 b 0 1 b 1log1x12 1 b 2x2 Adding additional independent variables does not change the procedure Now fixing x1 we have Dlog1y2 5 b 2Dx2 Using simple algebraic properties of the exponential and logarithmic functions gives the exact percentage change in the predicted y as Dy 5 100 3exp1b 2Dx22 2 14 68 where the multiplication by 100 turns the proportionate change into a percentage change When Dx2 5 1 Dy 5 100 3exp1b 22 2 14 69 Applied to the housing price example with x2 5 rooms and b 2 5 306 Dprice 5 1003exp13062 2 14 5 358 which is notably larger than the approximate percentage change 306 obtained directly from 67 Incidentally this is not an unbiased estimator because exp is a nonlinear function it is however a consistent estimator of 1003exp1b22 2 14 This is because the proba bility limit passes through continuous functions while the expected value operator does not See Appendix C Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 172 The adjustment in equation 68 is not as crucial for small percentage changes For example when we include the studentteacher ratio in equation 67 its estimated coefficient is 052 which means that if stratio increases by one price decreases by approximately 52 The exact proportion ate change is exp12052221 2051 or 251 On the other hand if we increase stratio by five then the approximate percentage change in price is 26 while the exact change obtained from equation 68 is 1003exp12262214 2229 The logarithmic approximation to percentage changes has an advantage that justifies its reporting even when the percentage change is large To describe this advantage consider again the effect on price of changing the number of rooms by one The logarithmic approximation is just the coefficient on rooms in equation 67 multiplied by 100 namely 306 We also computed an estimate of the exact percentage change for increasing the number of rooms by one as 358 But what if we want to estimate the percentage change for decreasing the number of rooms by one In equation 68 we take D x2 5 21 and b 2 5 306 and so Dprice 5 1003exp123062214 5 2264 or a drop of 264 Notice that the approximation based on using the coefficient on rooms is between 264 and 358an outcome that always occurs In other words simply using the coefficient multiplied by 100 gives us an estimate that is always between the absolute value of the estimates for an increase and a decrease If we are specifically interested in an increase or a decrease we can use the calculation based on equation 68 The point just made about computing percentage changes is essentially the one made in introduc tory economics when it comes to computing say price elasticities of demand based on large price changes the result depends on whether we use the beginning or ending price and quantity in comput ing the percentage changes Using the logarithmic approximation is similar in spirit to calculating an arc elasticity of demand where the averages of prices and quantities are used in the denominators in computing the percentage changes We have seen that using natural logs leads to coefficients with appealing interpretations and we can be ignorant about the units of measurement of variables appearing in logarithmic form because the slope coefficients are invariant to rescalings There are several other reasons logs are used so much in applied work First when y 0 models using logy as the dependent variable often satisfy the CLM assumptions more closely than models using the level of y Strictly positive variables often have conditional distributions that are heteroskedastic or skewed taking the log can mitigate if not eliminate both problems Another potential benefit of using logs is that taking the log of a variable often narrows its range This is particularly true of variables that can be large monetary values such as firms annual sales or baseball players salaries Population variables also tend to vary widely Narrowing the range of the dependent and independent variables can make OLS estimates less sensitive to outlying or extreme values we take up the issue of outlying observations in Chapter 9 However one must not indiscriminately use the logarithmic transformation because in some cases it can actually create extreme values An example is when a variable y is between zero and one such as a proportion and takes on values close to zero In this case logy which is necessarily negative can be very large in magnitude whereas the original variable y is bounded between zero and one There are some standard rules of thumb for taking logs although none is written in stone When a variable is a positive dollar amount the log is often taken We have seen this for variables such as wages salaries firm sales and firm market value Variables such as population total number of employees and school enrollment often appear in logarithmic form these have the common feature of being large integer values Variables that are measured in yearssuch as education experience tenure age and so on usually appear in their original form A variable that is a proportion or a percentsuch as the unemployment rate the participation rate in a pension plan the percentage of students passing a standardized exam and the arrest rate on reported crimescan appear in either original or logarith mic form although there is a tendency to use them in level forms This is because any regression Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 173 coefficients involving the original variablewhether it is the dependent or independent variable will have a percentage point change interpretation See Appendix A for a review of the distinction between a percentage change and a percentage point change If we use say logunem in a regres sion where unem is the percentage of unemployed individuals we must be very careful to distin guish between a percentage point change and a percentage change Remember if unem goes from 8 to 9 this is an increase of one percentage point but a 125 increase from the initial unemploy ment level Using the log means that we are looking at the percentage change in the unemployment rate log192 2 log182 118 or 118 which is the logarithmic approximation to the actual 125 increase One limitation of the log is that it cannot be used if a variable takes on zero or negative values In cases where a variable y is nonnegative but can take on the value 0 log11y is sometimes used The percentage change interpretations are often closely preserved except for changes beginning at y 5 0 where the percentage change is not even defined Generally using log11y and then interpreting the estimates as if the variable were logy is accepta ble when the data on y contain relatively few zeros An example might be where y is hours of training per employee for the population of manufacturing firms if a large fraction of firms provides training to at least one worker Technically however log 11y cannot be normally distributed although it might be less heteroskedastic than y Useful albeit more advanced alternatives are the Tobit and Poisson models in Chapter 17 One drawback to using a dependent variable in logarithmic form is that it is more difficult to predict the original variable The original model allows us to predict logy not y Nevertheless it is fairly easy to turn a prediction for logy into a prediction for y see Section 64 A related point is that it is not legitimate to compare Rsquareds from models where y is the dependent variable in one case and logy is the dependent variable in the other These measures explain variations in different variables We discuss how to compute comparable goodnessoffit measures in Section 64 62b Models with Quadratics Quadratic functions are also used quite often in applied economics to capture decreasing or increas ing marginal effects You may want to review properties of quadratic functions in Appendix A In the simplest case y depends on a single observed factor x but it does so in a quadratic fashion y 5 b0 1 b1x 1 b2x2 1 u For example take y 5 wage and x 5 exper As we discussed in Chapter 3 this model falls outside of simple regression analysis but is easily handled with multiple regression It is important to remember that b1 does not measure the change in y with respect to x it makes no sense to hold x2 fixed while changing x If we write the estimated equation as y 5 b 0 1 b 1x 1 b 2x2 610 then we have the approximation Dy 1b 0 1 2b 2x2Dx so DyDx b 1 1 2b 2x 611 Suppose that the annual number of drunk driving arrests is determined by log1arrests2 5 b0 1 b1log1pop2 1 b2age1625 1 other factors where age 1625 is the proportion of the population between 16 and 25 years of age Show that b2 has the following ceteris paribus interpretation it is the percentage change in arrests when the percentage of the people aged 16 to 25 increases by one percentage point Exploring FurthEr 62 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 174 This says that the slope of the relationship between x and y depends on the value of x the estimated slope is b 1 1 2b 2x If we plug in x 5 0 we see that b 1 can be interpreted as the approximate slope in going from x 5 0 to x 5 1 After that the second term 2b 2x must be accounted for If we are only interested in computing the predicted change in y given a starting value for x and a change in x we could use 610 directly there is no reason to use the calculus approximation at all However we are usually more interested in quickly summarizing the effect of x on y and the interpre tation of b 1 and b 2 in equation 611 provides that summary Typically we might plug in the average value of x in the sample or some other interesting values such as the median or the lower and upper quartile values In many applications b 1 is positive and b 2 is negative For example using the wage data in WAGE1 we obtain wage 5 373 1 298 exper 2 0061 exper2 1352 10412 100092 612 n 5 526 R2 5 093 This estimated equation implies that exper has a diminishing effect on wage The first year of expe rience is worth roughly 30 per hour 298 The second year of experience is worth less about 298 2 2100612 112 286 or 286 according to the approximation in 611 with x 5 1 In going from 10 to 11 years of experience wage is predicted to increase by about 298 2 2100612 1102 5 176 or 176 And so on When the coefficient on x is positive and the coefficient on x2 is negative the quadratic has a parabolic shape There is always a positive value of x where the effect of x on y is zero before this point x has a positive effect on y after this point x has a negative effect on y In practice it can be important to know where this turning point is In the estimated equation 610 with b 1 0 and b 2 0 the turning point or maximum of the function is always achieved at the coefficient on x over twice the absolute value of the coefficient on x2 xp 5 0b 112b 22 0 613 In the wage example xp 5 experp is 29832100612 4 244 Note how we just drop the minus sign on 0061 in doing this calculation This quadratic relationship is illustrated in Figure 61 In the wage equation 612 the return to experience becomes zero at about 244 years What should we make of this There are at least three possible explanations First it may be that few people in the sample have more than 24 years of experience and so the part of the curve to the right of 24 can be ignored The cost of using a quadratic to capture diminishing effects is that the quadratic must eventually turn around If this point is beyond all but a small percentage of the people in the sample then this is not of much concern But in the data set WAGE1 about 28 of the people in the sample have more than 24 years of experience this is too high a percentage to ignore It is possible that the return to exper really becomes negative at some point but it is hard to believe that this happens at 24 years of experience A more likely possibility is that the estimated effect of exper on wage is biased because we have controlled for no other factors or because the functional relationship between wage and exper in equation 612 is not entirely correct Computer Exercise C2 asks you to explore this possibility by controlling for education in addition to using logwage as the dependent variable When a model has a dependent variable in logarithmic form and an explanatory variable entering as a quadratic some care is needed in reporting the partial effects The following example also shows that the quadratic can have a Ushape rather than a parabolic shape A Ushape arises in equation 610 when b 1 is negative and b 2 is positive this captures an increasing effect of x on y Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 175 ExamplE 62 Effects of pollution on Housing prices We modify the housing price model from Example 45 to include a quadratic term in rooms log1price2 5 b0 1 b1log1nox2 1 b2log1dist2 1 b3rooms 1 b4rooms2 1 b5stratio 1 u 614 The model estimated using the data in HPRICE2 is log1price2 5 1339 2 902 log1nox2 2 087 log1dist2 1572 11152 10432 2 545 rooms 1 062 rooms2 2 048 stratio 11652 10132 10062 n 5 506 R2 5 603 The quadratic term rooms2 has a t statistic of about 477 and so it is very statistically significant But what about interpreting the effect of rooms on logprice Initially the effect appears to be strange Because the coefficient on rooms is negative and the coefficient on rooms2 is positive this equation literally implies that at low values of rooms an additional room has a negative effect on logprice At some point the effect becomes positive and the quadratic shape means that the semielasticity of price with respect to rooms is increasing as rooms increases This situation is shown in Figure 62 We obtain the turnaround value of rooms using equation 613 even though b 1 is negative and b 2 is positive The absolute value of the coefficient on rooms 545 divided by twice the coefficient on rooms2 062 gives roomsp 5 5453210622 4 44 this point is labeled in Figure 62 Do we really believe that starting at three rooms and increasing to four rooms actually reduces a houses expected value Probably not It turns out that only five of the 506 communities in the sample 373 737 exper wage 244 FiguRE 61 Quadratic relationship between wage and exper Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 176 have houses averaging 44 rooms or less about 1 of the sample This is so small that the quadratic to the left of 44 can for practical purposes be ignored To the right of 44 we see that adding another room has an increasing effect on the percentage change in price Dlog1price2 532545 1 210622 4rooms6Drooms and so Dprice 100532545 1 210622 4rooms6Drooms 5 12545 1 124 rooms2Drooms Thus an increase in rooms from say five to six increases price by about 2545 1 124152 5 75 the increase from six to seven increases price by roughly 2545 1 124162 5 199 This is a very strong increasing effect The strong increasing effect of rooms on logprice in this example illustrates an important les son one cannot simply look at the coefficient on the quadratic termin this case 062and declare that it is too small to bother with based only on its magnitude In many applications with quadratics the coefficient on the squared variable has one or more zeros after the decimal point after all this coefficient measures how the slope is changing as x rooms changes A seemingly small coefficient can have practically important consequences as we just saw As a general rule one must compute the partial effect and see how it varies with x to determine if the quadratic term is practically important In doing so it is useful to compare the changing slope implied by the quadratic model with the constant slope obtained from the model with only a linear term If we drop rooms2 from the equation the coef ficient on rooms becomes about 255 which implies that each additional room starting from any number of roomsincreases median price by about 255 This is very different from the quadratic model where the effect becomes 255 at rooms 5 645 but changes rapidly as rooms gets smaller or larger For example at rooms 5 7 the return to the next room is about 323 rooms logprice 44 FiguRE 62 logprice as a quadratic function of rooms Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 177 What happens generally if the coefficients on the level and squared terms have the same sign either both positive or both negative and the explanatory variable is necessarily nonnegative as in the case of rooms or exper In either case there is no turning point for values x 0 For example if b1 and b2 are both positive the smallest expected value of y is at x 5 0 and increases in x always have a positive and increasing effect on y This is also true if b1 5 0 and b2 0 which means that the partial effect is zero at x 5 0 and increasing as x increases Similarly if b1 and b2 are both nega tive the largest expected value of y is at x 5 0 and increases in x have a negative effect on y with the magnitude of the effect increasing as x gets larger The general formula for the turning point of any quadratic is xp 5 2b 112b 22 which leads to a positive value if b 1 and b 2 have opposite signs and a negative value when b 1 and b 2 have the same sign Knowing this simple formula is useful in cases where x may take on both positive and negative values one can compute the turning point and see if it makes sense taking into account the range of x in the sample There are many other possibilities for using quadratics along with logarithms For example an extension of 614 that allows a nonconstant elasticity between price and nox is log1price2 5 b0 1 b1log1nox2 1 b23log1nox2 42 1 b3crime 1 b4rooms 1 b5rooms2 1 b6stratio 1 u 615 If b2 5 0 then b1 is the elasticity of price with respect to nox Otherwise this elasticity depends on the level of nox To see this we can combine the arguments for the partial effects in the quadratic and logarithmic models to show that Dprice 3b1 1 2b2log1nox2 4Dnox 616 therefore the elasticity of price with respect to nox is b1 1 2b2log1nox2 so that it depends on lognox Finally other polynomial terms can be included in regression models Certainly the quadratic is seen most often but a cubic and even a quartic term appear now and then An often reasonable func tional form for a total cost function is cost 5 b0 1 b1quantity 1 b2quantity2 1 b3quantity3 1 u Estimating such a model causes no complications Interpreting the parameters is more involved though straightforward using calculus we do not study these models further 62c Models with Interaction Terms Sometimes it is natural for the partial effect elasticity or semielasticity of the dependent variable with respect to an explanatory variable to depend on the magnitude of yet another explanatory vari able For example in the model price 5 b0 1 b1sqrft 1 b2bdrms 1 b3sqrft bdrms 1 b4bthrms 1 u the partial effect of bdrms on price holding all other variables fixed is Dprice Dbdrms 5 b2 1 b3sqrft 617 If b3 0 then 617 implies that an additional bedroom yields a higher increase in housing price for larger houses In other words there is an interaction effect between square footage and number of bedrooms In summarizing the effect of bdrms on price we must evaluate 617 at interesting values Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 178 of sqrft such as the mean value or the lower and upper quartiles in the sample Whether or not b3 is zero is something we can easily test The parameters on the original variables can be tricky to interpret when we include an interac tion term For example in the previous housing price equation equation 617 shows that b2 is the effect of bdrms on price for a home with zero square feet This effect is clearly not of much interest Instead we must be careful to put interesting values of sqrft such as the mean or median values in the sample into the estimated version of equation 617 Often it is useful to reparameterize a model so that the coefficients on the original variables have an interesting meaning Consider a model with two explanatory variables and an interaction y 5 b0 1 b1x1 1 b2x2 1 b3x1x2 1 u As just mentioned b2 is the partial effect of x2 on y when x1 5 0 Often this is not of interest Instead we can reparameterize the model as y 5 a0 1 d1x1 1 d2x2 1 b31x1 2 m12 1x2 2 m22 1 u where m1 is the population mean of x1 and m2 is the population mean of x2 We can easily see that now the coefficient on x2 d2 is the partial effect of x2 on y at the mean value of x1 By multiplying out the interaction in the second equation and comparing the coefficients we can easily show that d2 5 b2 1 b3m1 The parameter d1 has a similar interpretation Therefore if we subtract the means of the variablesin practice these would typically be the sample meansbefore creating the interac tion term the coefficients on the original variables have a useful interpretation Plus we immediately obtain standard errors for the partial effects at the mean values Nothing prevents us from replacing m1 or m2 with other values of the explanatory variables that may be of interest The following example illustrates how we can use interaction terms ExamplE 63 Effects of attendance on Final Exam performance A model to explain the standardized outcome on a final exam stndfnl in terms of percentage of classes attended prior college grade point average and ACT score is stndfnl 5 b0 1 b1atndrte 1 b2priGPA 1 b3ACT 1 b4priGPA2 1 b5ACT2 1 b6priGPAatndrte 1 u 618 We use the standardized exam score for the reasons discussed in Section 61 it is easier to inter pret a students performance relative to the rest of the class In addition to quadratics in priGPA and ACT this model includes an interaction between priGPA and the attendance rate The idea is that class attendance might have a different effect for students who have performed differently in the past as measured by priGPA We are interested in the effects of attendance on final exam score DstndfnlDatndrte 5 b1 1 b6priGPA Using the 680 observations in ATTEND for students in a course on microeconomic principles the estimated equation is stndfnl 5 2 05 2 0067 atndrte 2 1 63 priGPA 2 128 ACT 11362 101022 1482 10982 1 296 priGPA2 1 0045 ACT2 1 0056 priGPA atndrte 619 11012 100222 100432 n 5 680 R2 5 229 R2 5 222 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 179 We must interpret this equation with extreme care If we simply look at the coefficient on atndrte we will incorrectly conclude that attendance has a negative effect on final exam score But this coefficient supposedly measures the effect when priGPA 5 0 which is not interesting in this sample the small est prior GPA is about 86 We must also take care not to look separately at the estimates of b1 and b6 and conclude that because each t statistic is insignificant we cannot reject H0 b1 5 0 b6 5 0 In fact the pvalue for the F test of this joint hypothesis is 014 so we certainly reject H0 at the 5 level This is a good example of where looking at separate t statistics when testing a joint hypothesis can lead one far astray How should we estimate the partial effect of atndrte on stndfnl We must plug in interesting values of priGPA to obtain the partial effect The mean value of priGPA in the sample is 259 so at the mean priGPA the effect of atndrte on stndfnl is 20067 1 005612592 0078 What does this mean Because atndrte is measured as a percentage it means that a 10 percentage point increase in atndrte increases stndfnl by 078 standard deviations from the mean final exam score How can we tell whether the estimate 0078 is statistically different from zero We need to rerun the regression where we replace priGPAatndrte with priGPA 259atndrte This gives as the new coefficient on atndrte the estimated effect at priGPA 5 259 along with its standard error noth ing else in the regression changes We described this device in Section 44 Running this new regres sion gives the standard error of b 1 1 b 612592 5 0078 as 0026 which yields t 5 00780026 5 3 Therefore at the average priGPA we conclude that attendance has a statistically significant positive effect on final exam score Things are even more complicated for finding the effect of priGPA on stndfnl because of the quadratic term priGPA2 To find the effect at the mean value of priGPA and the mean attend ance rate 82 we would replace priGPA2 with 1priGPA 2 2592 2 and priGPAatndrte with priGPAatndrte 82 The coefficient on priGPA becomes the partial effect at the mean values and we would have its standard error See Computer Exercise C7 62d Computing Average Partial Effects The hallmark of models with quadratics interactions and other nonlinear functional forms is that the partial effects depend on the values of one or more explanatory variables For example we just saw in Example 63 that the effect of atndrte depends on the value of priGPA It is easy to see that the partial effect of priGPA in equation 618 is b2 1 2b4priGPA 1 b6atndrte something that can be verified with simple calculus or just by combining the quadratic and interac tion formulas The embellishments in equation 618 can be useful for seeing how the strength of associations between stndfnl and each explanatory variable changes with the values of all explanatory variables The flexibility afforded by a model such as 618 does have a cost it is tricky to describe the partial effects of the explanatory variables on stndfnl with a single number Often one wants a single value to describe the relationship between the dependent variable y and each explanatory variable One popular summary measure is the average partial effect APE also called the average marginal effect The idea behind the APE is simple for models such as 618 After computing the partial effect and plugging in the estimated parameters we average the partial effects for each unit across the sample So the estimated partial effect of atndrte on stndfnl is b 1 1 b 6 priGPAi If we add the term b7 ACTatndrte to equation 618 what is the partial effect of atndrte on stndfnl Exploring FurthEr 63 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 180 We do not want to report this partial effect for each of the 680 students in our sample Instead we average these partial effects to obtain APEstndfnl 5 b 1 1 b 6 priGPA where priGPA is the sample average of priGPA The single number APEstndfnl is the estimated APE The APE of priGPA is only a little more complicated APEpriGPA 5 b 2 1 2b 4priGPA 1 b6atndrte Both APEstndfnl and APEpriGPA tell us the size of the partial effects on average The centering of explanatory variables about their sample averages before creating quadratics or interactions forces the coefficient on the levels to be the APEs This can be cumbersome in compli cated models Fortunately some commonly used regression packages compute APEs with a simple command after OLS estimation Just as importantly proper standard errors are computed using the fact that an APE is a linear combination of the OLS coefficients For example the APEs and their standard errors for models with both quadratics and interactions as in Example 63 are easy to obtain APEs are also useful in models that are inherently nonlinear in parameters which we treat in Chapter 17 At that point we will revisit the definition and calculation of APEs 63 More on GoodnessofFit and Selection of Regressors Until now we have not focused much on the size of R2 in evaluating our regression models primarily because beginning students tend to put too much weight on Rsquared As we will see shortly choos ing a set of explanatory variables based on the size of the Rsquared can lead to nonsensical models In Chapter 10 we will discover that Rsquareds obtained from time series regressions can be artifi cially high and can result in misleading conclusions Nothing about the classical linear model assumptions requires that R2 be above any particular value R2 is simply an estimate of how much variation in y is explained by x1 x2 p xk in the popula tion We have seen several regressions that have had pretty small Rsquareds Although this means that we have not accounted for several factors that affect y this does not mean that the factors in u are correlated with the independent variables The zero conditional mean assumption MLR4 is what determines whether we get unbiased estimators of the ceteris paribus effects of the independent vari ables and the size of the Rsquared has no direct bearing on this A small Rsquared does imply that the error variance is large relative to the variance of y which means we may have a hard time precisely estimating the bj But remember we saw in Section 34 that a large error variance can be offset by a large sample size if we have enough data we may be able to precisely estimate the partial effects even though we have not controlled for many unobserved fac tors Whether or not we can get precise enough estimates depends on the application For example suppose that some incoming students at a large university are randomly given grants to buy computer equipment If the amount of the grant is truly randomly determined we can estimate the ceteris pari bus effect of the grant amount on subsequent college grade point average by using simple regression analysis Because of random assignment all of the other factors that affect GPA would be uncor related with the amount of the grant It seems likely that the grant amount would explain little of the variation in GPA so the Rsquared from such a regression would probably be very small But if we have a large sample size we still might get a reasonably precise estimate of the effect of the grant Another good illustration of where poor explanatory power has nothing to do with unbiased esti mation of the bj is given by analyzing the data set APPLE Unlike the other data sets we have used the key explanatory variables in APPLE were set experimentallythat is without regard to other factors that might affect the dependent variable The variable we would like to explain ecolbs is the hypothetical pounds of ecologically friendly ecolabeled apples a family would demand Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 181 Each family actually family head was presented with a description of ecolabeled apples along with prices of regular apples regprc and prices of the hypothetical ecolabeled apples ecoprc Because the price pairs were randomly assigned to each family they are unrelated to other observed factors such as family income and unobserved factors such as desire for a clean environment Therefore the regression of ecolbs on ecoprc regprc across all samples generated in this way produces unbi ased estimators of the price effects Nevertheless the Rsquared from the regression is only 0364 the price variables explain only about 36 of the total variation in ecolbs So here is a case where we explain very little of the variation in y yet we are in the rare situation of knowing that the data have been generated so that unbiased estimation of the bj is possible Incidentally adding observed family characteristics has a very small effect on explanatory power See Computer Exercise C11 Remember though that the relative change in the Rsquared when variables are added to an equation is very useful the F statistic in 441 for testing the joint significance crucially depends on the difference in Rsquareds between the unrestricted and restricted models As we will see in Section 64 an important consequence of a low Rsquared is that prediction is difficult Because most of the variation in y is explained by unobserved factors or at least factors we do not include in our model we will generally have a hard time using the OLS equation to predict individual future outcomes on y given a set of values for the explanatory variables In fact the low Rsquared means that we would have a hard time predicting y even if we knew the bj the population coefficients Fundamentally most of the factors that explain y are unaccounted for in the explanatory variables making prediction difficult 63a Adjusted RSquared Most regression packages will report along with the Rsquared a statistic called the adjusted Rsquared Because the adjusted Rsquared is reported in much applied work and because it has some useful features we cover it in this subsection To see how the usual Rsquared might be adjusted it is usefully written as R2 5 1 2 1SSRn21SSTn2 620 where SSR is the sum of squared residuals and SST is the total sum of squares compared with equa tion 328 all we have done is divide both SSR and SST by n This expression reveals what R2 is actually estimating Define s2 y as the population variance of y and let s2 u denote the population variance of the error term u Until now we have used s2 to denote s2 u but it is helpful to be more specific here The population Rsquared is defined as r2 5 1 2 s2 us2 y this is the proportion of the variation in y in the population explained by the independent variables This is what R2 is supposed to be estimating R2 estimates s2 u by SSRn which we know to be biased So why not replace SSRn with SSRn k 1 Also we can use SSTn 1 in place of SSTn as the former is the unbiased estimator of s2 y Using these estimators we arrive at the adjusted Rsquared R2 5 1 2 3SSR1n 2 k 2 12 43SST1n 2 12 4 621 5 1 2 s 23SST1n 2 12 4 because s 2 5 SSR1n 2 k 2 12 Because of the notation used to denote the adjusted Rsquared it is sometimes called Rbar squared The adjusted Rsquared is sometimes called the corrected Rsquared but this is not a good name because it implies that R2 is somehow better than R2 as an estimator of the population Rsquared Unfortunately R2 is not generally known to be a better estimator It is tempting to think that R2 cor rects the bias in R2 for estimating the population Rsquared r2 but it does not the ratio of two unbi ased estimators is not an unbiased estimator Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 182 The primary attractiveness of R2 is that it imposes a penalty for adding additional independent variables to a model We know that R2 can never fall when a new independent variable is added to a regression equation this is because SSR never goes up and usually falls as more independent vari ables are added assuming we use the same set of observations But the formula for R2 shows that it depends explicitly on k the number of independent variables If an independent variable is added to a regression SSR falls but so does the df in the regression n k 1 SSRn k 1 can go up or down when a new independent variable is added to a regression An interesting algebraic fact is the following if we add a new independent variable to a regres sion equation R2 increases if and only if the t statistic on the new variable is greater than one in absolute value An extension of this is that R2 increases when a group of variables is added to a regression if and only if the F statistic for joint significance of the new variables is greater than unity Thus we see immediately that using R2 to decide whether a certain independent variable or set of variables belongs in a model gives us a different answer than standard t or F testing because a t or F statistic of unity is not statistically significant at traditional significance levels It is sometimes useful to have a formula for R2 in terms of R2 Simple algebra gives R2 5 1 2 11 2 R22 1n 2 121n 2 k 2 12 622 For example if R2 5 30 n 5 51 and k 5 10 then R2 5 1 2 70150240 5 125 Thus for small n and large k R2 can be substantially below R2 In fact if the usual Rsquared is small and n k 1 is small R2 can actually be negative For example you can plug in R2 5 10 n 5 51 and k 5 10 to verify that R2 5 2125 A negative R2 indicates a very poor model fit relative to the number of degrees of freedom The adjusted Rsquared is sometimes reported along with the usual Rsquared in regressions and sometimes R2 is reported in place of R2 It is important to remember that it is R2 not R2 that appears in the F statistic in 441 The same formula with R2 r and R2 ur is not valid 63b Using Adjusted RSquared to Choose between Nonnested Models In Section 45 we learned how to compute an F statistic for testing the joint significance of a group of variables this allows us to decide at a particular significance level whether at least one variable in the group affects the dependent variable This test does not allow us to decide which of the variables has an effect In some cases we want to choose a model without redundant independent variables and the adjusted Rsquared can help with this In the major league baseball salary example in Section 45 we saw that neither hrunsyr nor rbisyr was individually significant These two variables are highly correlated so we might want to choose between the models log1salary2 5 b0 1 b1years 1 b2gamesyr 1 b3bavg 1 b4hrunsyr 1 u and log1salary2 5 b0 1 b1years 1 b2gamesyr 1 b3bavg 1 b4rbisyr 1 u These two equations are nonnested models because neither equation is a special case of the other The F statistics we studied in Chapter 4 only allow us to test nested models one model the restricted model is a special case of the other model the unrestricted model See equations 432 and 428 for examples of restricted and unrestricted models One possibility is to create a composite model that contains all explanatory variables from the original models and then to test each model against the general model using the F test The problem with this process is that either both models might Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 183 be rejected or neither model might be rejected as happens with the major league baseball salary example in Section 45 Thus it does not always provide a way to distinguish between models with nonnested regressors In the baseball player salary regression using the data in MLB1 R2 for the regression containing hrunsyr is 6211 and R2 for the regression containing rbisyr is 6226 Thus based on the adjusted Rsquared there is a very slight preference for the model with rbisyr But the difference is practi cally very small and we might obtain a different answer by controlling for some of the variables in Computer Exercise C5 in Chapter 4 Because both nonnested models contain five parameters the usual Rsquared can be used to draw the same conclusion Comparing R2 to choose among different nonnested sets of independent variables can be valu able when these variables represent different functional forms Consider two models relating RD intensity to firm sales rdintens 5 b0 1 b1log1sales2 1 u 623 rdintens 5 b0 1 b1sales 1 b2sales2 1 u 624 The first model captures a diminishing return by including sales in logarithmic form the second model does this by using a quadratic Thus the second model contains one more parameter than the first When equation 623 is estimated using the 32 observations on chemical firms in RDCHEM R2 is 061 and R2 for equation 624 is 148 Therefore it appears that the quadratic fits much better But a comparison of the usual Rsquareds is unfair to the first model because it contains one fewer param eter than 624 That is 623 is a more parsimonious model than 624 Everything else being equal simpler models are better Since the usual Rsquared does not penalize more complicated models it is better to use R2 The R2 for 623 is 030 while R2 for 624 is 090 Thus even after adjusting for the difference in degrees of freedom the quadratic model wins out The quadratic model is also preferred when profit margin is added to each regression There is an important limitation in using R2 to choose between nonnested models we cannot use it to choose between different functional forms for the dependent variable This is unfortunate because we often want to decide on whether y or logy or maybe some other transformation should be used as the dependent variable based on goodnessoffit But neither R2 nor R2 can be used for this purpose The reason is simple these Rsquareds measure the explained proportion of the total variation in whatever dependent variable we are using in the regression and different nonlinear func tions of the dependent variable will have different amounts of variation to explain For example the total variations in y and logy are not the same and are often very different Comparing the adjusted Rsquareds from regressions with these different forms of the dependent variables does not tell us anything about which model fits better they are fit ting two separate dependent variables ExamplE 64 CEO Compensation and Firm performance Consider two estimated models relating CEO compensation to firm performance salary 5 830 63 1 0163 sales 1 19 63 roe 1223902 100892 111082 625 n 5 209 R2 5 029 R2 5 020 Explain why choosing a model by maximiz ing R2 or minimizing s the standard error of the regression is the same thing Exploring FurthEr 64 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 184 and lsalary 5 4 36 1 275 lsales 1 0179 roe 10292 10332 100402 626 n 5 209 R2 5 282 R2 5 275 where roe is the return on equity discussed in Chapter 2 For simplicity lsalary and lsales denote the natural logs of salary and sales We already know how to interpret these different estimated equa tions But can we say that one model fits better than the other The Rsquared for equation 625 shows that sales and roe explain only about 29 of the varia tion in CEO salary in the sample Both sales and roe have marginal statistical significance Equation 626 shows that logsales and roe explain about 282 of the variation in logsalary In terms of goodnessoffit this much higher Rsquared would seem to imply that model 626 is much better but this is not necessarily the case The total sum of squares for salary in the sample is 391732982 while the total sum of squares for logsalary is only 6672 Thus there is much less variation in logsalary that needs to be explained At this point we can use features other than R2 or R2 to decide between these models For exam ple logsales and roe are much more statistically significant in 626 than are sales and roe in 625 and the coefficients in 626 are probably of more interest To be sure however we will need to make a valid goodnessoffit comparison In Section 64 we will offer a goodnessoffit measure that does allow us to compare models where y appears in both level and log form 63c Controlling for Too Many Factors in Regression Analysis In many of the examples we have covered and certainly in our discussion of omitted variables bias in Chapter 3 we have worried about omitting important factors from a model that might be correlated with the independent variables It is also possible to control for too many variables in a regression analysis If we overemphasize goodnessoffit we open ourselves to controlling for factors in a regression model that should not be controlled for To avoid this mistake we need to remember the ceteris pari bus interpretation of multiple regression models To illustrate this issue suppose we are doing a study to assess the impact of state beer taxes on traffic fatalities The idea is that a higher tax on beer will reduce alcohol consumption and likewise drunk driving resulting in fewer traffic fatalities To measure the ceteris paribus effect of taxes on fatalities we can model fatalities as a function of several factors including the beer tax fatalities 5 b0 1 b1tax 1 b2miles 1 b3percmale 1 b4perc1621 1 p where miles 5 total miles driven percmale 5 percentage of the state population that is male perc1621 5 percentage of the population between ages 16 and 21 and so on Notice how we have not included a variable measuring per capita beer consumption Are we committing an omitted variables error The answer is no If we control for beer consumption in this equation then how would beer taxes affect traffic fatalities In the equation fatalities 5 b0 1 b1tax 1 b2beercons 1 p Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 185 b1 measures the difference in fatalities due to a one percentage point increase in tax holding beercons fixed It is difficult to understand why this would be interesting We should not be controlling for dif ferences in beercons across states unless we want to test for some sort of indirect effect of beer taxes Other factors such as gender and age distribution should be controlled for As a second example suppose that for a developing country we want to estimate the effects of pesticide usage among farmers on family health expenditures In addition to pesticide usage amounts should we include the number of doctor visits as an explanatory variable No Health expenditures include doctor visits and we would like to pick up all effects of pesticide use on health expenditures If we include the number of doctor visits as an explanatory variable then we are only measuring the effects of pesticide use on health expenditures other than doctor visits It makes more sense to use number of doctor visits as a dependent variable in a separate regression on pesticide amounts The previous examples are what can be called over controlling for factors in multiple regression Often this results from nervousness about potential biases that might arise by leaving out an important explanatory variable But it is important to remember the ceteris paribus nature of multiple regression In some cases it makes no sense to hold some factors fixed precisely because they should be allowed to change when a policy variable changes Unfortunately the issue of whether or not to control for certain factors is not always clearcut For example Betts 1995 studies the effect of high school quality on subsequent earnings He points out that if better school quality results in more education then controlling for education in the regres sion along with measures of quality will underestimate the return to quality Betts does the analysis with and without years of education in the equation to get a range of estimated effects for quality of schooling To see explicitly how pursuing high Rsquareds can lead to trouble consider the housing price example from Section 45 that illustrates the testing of multiple hypotheses In that case we wanted to test the rationality of housing price assessments We regressed logprice on logassess loglotsize logsqrft and bdrms and tested whether the latter three variables had zero population coefficients while logassess had a coefficient of unity But what if we change the purpose of the analysis and estimate a hedonic price model which allows us to obtain the marginal values of various housing attributes Should we include logassess in the equation The adjusted Rsquared from the regres sion with logassess is 762 while the adjusted Rsquared without it is 630 Based on goodness offit only we should include logassess But this is incorrect if our goal is to determine the effects of lot size square footage and number of bedrooms on housing values Including logassess in the equation amounts to holding one measure of value fixed and then asking how much an additional bedroom would change another measure of value This makes no sense for valuing housing attributes If we remember that different models serve different purposes and we focus on the ceteris pari bus interpretation of regression then we will not include the wrong factors in a regression model 63d Adding Regressors to Reduce the Error Variance We have just seen some examples of where certain independent variables should not be included in a regression model even though they are correlated with the dependent variable From Chapter 3 we know that adding a new independent variable to a regression can exacerbate the multicollinearity problem On the other hand since we are taking something out of the error term adding a variable generally reduces the error variance Generally we cannot know which effect will dominate However there is one case that is clear we should always include independent variables that affect y and are uncorrelated with all of the independent variables of interest Why Because adding such a variable does not induce multicollinearity in the population and therefore multicollinearity in the sample should be negligible but it will reduce the error variance In large sample sizes the stand ard errors of all OLS estimators will be reduced As an example consider estimating the individual demand for beer as a function of the average county beer price It may be reasonable to assume that individual characteristics are uncorrelated with Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 186 countylevel prices and so a simple regression of beer consumption on county price would suffice for estimating the effect of price on individual demand But it is possible to get a more precise estimate of the price elasticity of beer demand by including individual characteristics such as age and amount of education If these factors affect demand and are uncorrelated with price then the standard error of the price coefficient will be smaller at least in large samples As a second example consider the grants for computer equipment given at the beginning of Section 63 If in addition to the grant variable we control for other factors that can explain college GPA we can probably get a more precise estimate of the effect of the grant Measures of high school grade point average and rank SAT and ACT scores and family background variables are good can didates Because the grant amounts are randomly assigned all additional control variables are uncor related with the grant amount in the sample multicollinearity between the grant amount and other independent variables should be minimal But adding the extra controls might significantly reduce the error variance leading to a more precise estimate of the grant effect Remember the issue is not unbiasedness here we obtain an unbiased and consistent estimator whether or not we add the high school performance and family background variables The issue is getting an estimator with a smaller sampling variance A related point is that when we have random assignment of a policy we need not worry about whether some of our explanatory variables are endogenousprovided these variables themselves are not affected by the policy For example in studying the effect of hours in a job training program on labor earnings we can include the amount of education reported prior to the job training program We need not worry that schooling might be correlated with omitted factors such as ability because we are not trying to estimate the return to schooling We are trying to estimate the effect of the job training program and we can include any controls that are not themselves affected by job training without biasing the job training effect What we must avoid is including a variable such as the amount of education after the job training program as some people may decide to get more education because of how many hours they were assigned to the job training program Unfortunately cases where we have information on additional explanatory variables that are uncorrelated with the explanatory variables of interest are somewhat rare in the social sciences But it is worth remembering that when these variables are available they can be included in a model to reduce the error variance without inducing multicollinearity 64 Prediction and Residual Analysis In Chapter 3 we defined the OLS predicted or fitted values and the OLS residuals Predictions are certainly useful but they are subject to sampling variation because they are obtained using the OLS estimators Thus in this section we show how to obtain confidence intervals for a prediction from the OLS regression line From Chapters 3 and 4 we know that the residuals are used to obtain the sum of squared residu als and the Rsquared so they are important for goodnessoffit and testing Sometimes economists study the residuals for particular observations to learn about individuals or firms houses etc in the sample 64a Confidence Intervals for Predictions Suppose we have estimated the equation y 5 b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk 627 When we plug in particular values of the independent variables we obtain a prediction for y which is an estimate of the expected value of y given the particular values for the explanatory variables For emphasis let c1 c2 p ck denote particular values for each of the k independent variables these Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 187 may or may not correspond to an actual data point in our sample The parameter we would like to estimate is u0 5 b0 1 b1c1 1 b2c2 1 p 1 bkck 628 5 E1y0x1 5 c1 x2 5 c2 p xk 5 ck2 The estimator of u0 is u 0 5 b 0 1 b 1c1 1 b 2c2 1 p 1 bkck 629 In practice this is easy to compute But what if we want some measure of the uncertainty in this pre dicted value It is natural to construct a confidence interval for u0 which is centered at u 0 To obtain a confidence interval for u0 we need a standard error for u 0 Then with a large df we can construct a 95 confidence interval using the rule of thumb u 0 6 2 se1u 02 As always we can use the exact percentiles in a t distribution How do we obtain the standard error of u 0 This is the same problem we encountered in Section 44 we need to obtain a standard error for a linear combination of the OLS estimators Here the problem is even more complicated because all of the OLS estimators generally appear in u 0 unless some cj are zero Nevertheless the same trick that we used in Section 44 will work here Write b0 5 u0 2 b1c1 2 p 2 bkck and plug this into the equation y 5 b0 1 b1x1 1 p 1 bkxk 1 u to obtain y 5 u0 1 b11x1 2 c12 1 b21x2 2 c22 1 p 1 bk1xk 2 ck2 1 u 630 In other words we subtract the value cj from each observation on xj and then we run the regression of yi on 1xi1 2 c12 p 1xik 2 ck2 i 5 1 2 p n 631 The predicted value in 629 and more importantly its standard error are obtained from the intercept or constant in regression 631 As an example we obtain a confidence interval for a prediction from a college GPA regression where we use high school information ExamplE 65 Confidence Interval for predicted College Gpa Using the data in GPA2 we obtain the following equation for predicting college GPA colgpa 5 1493 1 00149 sat 2 01386 hsperc 100752 1000072 1000562 2 06088 hsize 1 00546 hsize2 632 1016502 1002272 n 5 4137 R2 5 278 R2 5 277 s 5 560 where we have reported estimates to several digits to reduce roundoff error What is predicted col lege GPA when sat 5 1200 hsperc 5 30 and hsize 5 5 which means 500 This is easy to get by plugging these values into equation 632 colgpa 5 270 rounded to two digits Unfortunately we cannot use equation 632 directly to get a confidence interval for the expected colgpa at the given Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 188 values of the independent variables One simple way to obtain a confidence interval is to define a new set of independent variables sat0 5 sat 2 1200 hsperc0 5 hsperc 2 30 hsize0 5 hsize 2 5 and hsizesq0 5 hsize2 2 25 When we regress colgpa on these new independent variables we get colgpa 5 2700 1 00149 sat0 2 01386 hsperc0 100202 1000072 1000562 2 06088 hsize0 1 00546 hsizesq0 1016502 1002272 n 5 4137 R2 5 278 R2 5 277 s 5 560 The only difference between this regression and that in 632 is the intercept which is the predic tion we want along with its standard error 020 It is not an accident that the slope coefficients their standard errors Rsquared and so on are the same as before this provides a way to check that the proper transformations were done We can easily construct a 95 confidence interval for the expected college GPA 270 6 196020 or about 266 to 274 This confidence interval is rather narrow due to the very large sample size Because the variance of the intercept estimator is smallest when each explanatory variable has zero sample mean see Question 25 for the simple regression case it follows from the regression in 631 that the variance of the prediction is smallest at the mean values of the xj That is cj 5 xj for all j This result is not too surprising since we have the most faith in our regression line near the middle of the data As the values of the cj get farther away from the xj Var1y 2 gets larger and larger The previous method allows us to put a confidence interval around the OLS estimate of E1y0x1 p xk2 for any values of the explanatory variables In other words we obtain a confidence interval for the average value of y for the subpopulation with a given set of covariates But a confi dence interval for the average person in the subpopulation is not the same as a confidence interval for a particular unit individual family firm and so on from the population In forming a confidence interval for an unknown outcome on y we must account for another very important source of varia tion the variance in the unobserved error which measures our ignorance of the unobserved factors that affect y Let y0 denote the value for which we would like to construct a confidence interval which we sometimes call a prediction interval For example y0 could represent a person or firm not in our original sample Let x0 1 p x0 k be the new values of the independent variables which we assume we observe and let u0 be the unobserved error Therefore we have y0 5 b0 1 b1x0 1 1 b2x0 2 1 p 1 bkx0 k 1 u0 633 As before our best prediction of y0 is the expected value of y0 given the explanatory variables which we estimate from the OLS regression line y0 5 b 0 1 b 1x0 1 1 b 2x0 2 1 p 1 b kx0 k The prediction error in using y0 to predict y0 is e0 5 y0 2 y0 5 1b0 1 b1x0 1 1 p 1 bkx0 k2 1 u0 2 y0 634 Now E1y02 5 E1b 02 1 E1b 12x0 1 1 E1b 22x0 2 1 p 1 E1b k2x0 k 5 b0 1 b1x0 1 1 p 1 bkx0 k because the b j are unbiased As before these expectations are all conditional on the sample values of the independent variables Because u0 has zero mean E1e02 5 0 We have shown that the expected prediction error is zero In finding the variance of e0 note that u0 is uncorrelated with each b j because u0 is uncorrelated with the errors in the sample used to obtain the b j By basic properties of covariance see Appendix B Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 189 u0 and y0 are uncorrelated Therefore the variance of the prediction error conditional on all in sample values of the independent variables is the sum of the variances Var1e02 5 Var1y02 1 Var1u02 5 Var1y02 1 s2 635 where s2 5 Var1u02 is the error variance There are two sources of variation in e0 The first is the sampling error in y0 which arises because we have estimated the bj Because each b j has a vari ance proportional to 1n where n is the sample size Var1y02 is proportional to 1n This means that for large samples Var1y02 can be very small By contrast s2 is the variance of the error in the population it does not change with the sample size In many examples s2 will be the domi nant term in 635 Under the classical linear model assumptions the b j and u0 are normally distributed and so e0 is also normally distributed conditional on all sample values of the explanatory variables Earlier we described how to obtain an unbiased estimator of Var1y02 and we obtained our unbiased estimator of s2 in Chapter 3 By using these estimators we can define the standard error of e0 as se1e02 5 53se1y02 42 1 s 261 2 636 Using the same reasoning for the t statistics of the b j e0se1e02 has a t distribution with n k 1 1 degrees of freedom Therefore P32t025 e0se1e02 t0254 5 95 where t025 is the 975th percentile in the tn2k21 distribution For large n k 1 remember that t025 196 Plugging in e0 5 y0 2 y0 and rearranging gives a 95 prediction interval for y0 y0 6 t025 se1e02 637 as usual except for small df a good rule of thumb is y0 6 2se1e02 This is wider than the confidence interval for y0 itself because of s 2 in 636 it often is much wider to reflect the factors in u0 that we have not accounted for ExamplE 66 Confidence Interval for Future College Gpa Suppose we want a 95 CI for the future college GPA of a high school student with sat 5 1200 hsperc 5 30 and hsize 5 5 In Example 65 we obtained a 95 confidence interval for the average college GPA among all students with the particular characteristics sat 5 1200 hsperc 5 30 and hsize 5 5 Now we want a 95 confidence interval for any particular student with these charac teristics The 95 prediction interval must account for the variation in the individual unobserved characteristics that affect college performance We have everything we need to obtain a CI for colgpa se1y02 5 020 and s 5 560 and so from 636 se1e02 5 3 10202 2 1 15602 241 2 560 Notice how small se1y02 is relative to s virtually all of the variation in e0 comes from the variation in u0 The 95 CI is 270 6 196560 or about 160 to 380 This is a wide confidence interval and shows that based on the factors we included in the regression we cannot accurately pin down an individuals future college grade point average In one sense this is good news as it means that high school rank and performance on the SAT do not preordain ones performance in college Evidently the unob served characteristics that affect college GPA vary widely among individuals with the same observed SAT score and high school rank Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 190 64b Residual Analysis Sometimes it is useful to examine individual observations to see whether the actual value of the dependent variable is above or below the predicted value that is to examine the residuals for the individual observations This process is called residual analysis Economists have been known to examine the residuals from a regression in order to aid in the purchase of a home The following housing price example illustrates residual analysis Housing price is related to various observable characteristics of the house We can list all of the characteristics that we find important such as size number of bedrooms number of bathrooms and so on We can use a sample of houses to estimate a relationship between price and attributes where we end up with a predicted value and an actual value for each house Then we can construct the residuals u i 5 yi 2 yi The house with the most negative residual is at least based on the factors we have controlled for the most underpriced one relative to its observed characteristics Of course a selling price substantially below its predicted price could indi cate some undesirable feature of the house that we have failed to account for and which is therefore contained in the unobserved error In addition to obtaining the prediction and residual it also makes sense to compute a confidence interval for what the future selling price of the home could be using the method described in equation 637 Using the data in HPRICE1 we run a regression of price on lotsize sqrft and bdrms In the sample of 88 homes the most negative residual is 2 120206 for the 81st house Therefore the asking price for this house is 120206 below its predicted price There are many other uses of residual analysis One way to rank law schools is to regress median starting salary on a variety of student characteristics such as median LSAT scores of entering class median college GPA of entering class and so on and to obtain a predicted value and residual for each law school The law school with the largest residual has the highest predicted value added Of course there is still much uncertainty about how an individuals starting salary would compare with the median for a law school overall These residuals can be used along with the costs of attending each law school to determine the best value this would require an appropriate discounting of future earnings Residual analysis also plays a role in legal decisions A New York Times article entitled Judge Says Pupils Poverty Not Segregation Hurts Scores 62895 describes an important legal case The issue was whether the poor performance on standardized tests in the Hartford School District relative to performance in surrounding suburbs was due to poor school quality at the highly segre gated schools The judge concluded that the disparity in test scores does not indicate that Hartford is doing an inadequate or poor job in educating its students or that its schools are failing because the predicted scores based upon the relevant socioeconomic factors are about at the levels that one would expect This conclusion is based on a regression analysis of average or median scores on socioeco nomic characteristics of various school districts in Connecticut The judges conclusion suggests that given the poverty levels of students at Hartford schools the actual test scores were similar to those predicted from a regression analysis the residual for Hartford was not sufficiently negative to con clude that the schools themselves were the cause of low test scores 64c Predicting y When logy Is the Dependent Variable Because the natural log transformation is used so often for the dependent variable in empirical eco nomics we devote this subsection to the issue of predicting y when logy is the dependent variable As a byproduct we will obtain a goodnessoffit measure for the log model that can be compared with the Rsquared from the level model How would you use residual analysis to determine which professional athletes are overpaid or underpaid relative to their performance Exploring FurthEr 65 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 191 To obtain a prediction it is useful to define logy 5 log1y2 this emphasizes that it is the log of y that is predicted in the model logy 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u 638 In this equation the xj might be transformations of other variables for example we could have x1 5 log1sales2 x2 5 log1mktval2 x3 5 ceoten in the CEO salary example Given the OLS estimators we know how to predict logy for any value of the independent variables logy 5 b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk 639 Now since the exponential undoes the log our first guess for predicting y is to simply exponenti ate the predicted value for log1y2 y 5 exp1logy2 This does not work in fact it will systematically underestimate the expected value of y In fact if model 638 follows the CLM assumptions MLR1 through MLR6 it can be shown that E1y0x2 5 exp1s222 exp1b0 1 b1x1 1 b2x2 1 p 1 bkxk2 where x denotes the independent variables and s2 is the variance of u 3If u Normal10s22 then the expected value of expu is exp1s2224 This equation shows that a simple adjustment is needed to predict y y 5 exp1s 222exp1logy2 640 where s 2 is simply the unbiased estimator of s2 Because s the standard error of the regression is always reported obtaining predicted values for y is easy Because s 2 0 exp1s 222 1 For large s 2 this adjustment factor can be substantially larger than unity The prediction in 640 is not unbiased but it is consistent There are no unbiased predictions of y and in many cases 640 works well However it does rely on the normality of the error term u In Chapter 5 we showed that OLS has desirable properties even when u is not normally distributed Therefore it is useful to have a prediction that does not rely on normality If we just assume that u is independent of the explanatory variables then we have E1y0x2 5 a0exp1b0 1 b1x1 1 b2x2 1 p 1 bkxk2 641 where a0 is the expected value of expu which must be greater than unity Given an estimate a 0 we can predict y as y 5 a 0exp1logy2 642 which again simply requires exponentiating the predicted value from the log model and multiplying the result by a 0 Two approaches suggest themselves for estimating a0 without the normality assumption The first is based on a0 5 E3exp1u2 4 To estimate a0 we replace the population expectation with a sample average and then we replace the unobserved errors ui with the OLS residuals u i 5 log1yi2 2 b 0 2 b 1xi1 2 p 2 b kxik This leads to the method of moments estimator see Appendix C a 0 5 n21 a n i51 exp1u i2 643 Not surprisingly a 0 is a consistent estimator of a0 but it is not unbiased because we have replaced ui with u i inside a nonlinear function This version of a 0 is a special case of what Duan 1983 called a smearing estimate Because the OLS residuals have a zero sample average it can be shown that Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 192 for any data set a 0 1 Technically a 0 would equal one if all the OLS residuals were zero but this never happens in any interesting application That a 0 is necessarily greater than one is convenient because it must be that a0 1 A different estimate of a0 is based on a simple regression through the origin To see how it works define mi 5 exp1b0 1 b1xi1 1 p 1 bkxik2 so that from equation 641 E1yi0mi2 5 a0mi If we could observe the mi we could obtain an unbiased estimator of a0 from the regression yi on mi without an intercept Instead we replace the bj with their OLS estimates and obtain m i 5 exp1logyi2 where of course the logyi are the fitted values from the regression logyi on xi1 p xik with an intercept Then aˇ 0 to distinguish it from a 0 in equation 643 is the OLS slope estimate from the simple regression yi on m i no intercept aˇ 0 5 a a n i51 m 2 i b 21 a a n i51 m iyib 644 We will call aˇ 0 the regression estimate of a0 Like a 0 aˇ 0 is consistent but not unbiased Interestingly aˇ 0 is not guaranteed to be greater than one although it will be in most applications If aˇ 0 is less than one and especially if it is much less than one it is likely that the assumption of independence between u and the xj is violated If aˇ 0 1 one possibility is to just use the estimate in 643 although this may simply be masking a problem with the linear model for logy We summarize the steps 64d Predicting y When the Dependent Variable Is logy 1 Obtain the fitted values logyi and residuals u i from the regression logy on x1 p xk 2 Obtain a 0 as in equation 643 or aˇ 0 in equation 644 3 For given values of x1 p xk obtain logy from 642 4 Obtain the prediction y from 642 with a 0 or aˇ 0 We now show how to predict CEO salaries using this procedure ExamplE 67 predicting CEO Salaries The model of interest is log1salary2 5 b0 1 b1log1sales2 1 b2log1mktval2 1 b3ceoten 1 u so that b1 and b2 are elasticities and 100 b3 is a semielasticity The estimated equation using CEOSAL2 is lsalary 5 4504 1 163 lsales 1 109 lmktval 1 0117 ceoten 12572 10392 10502 100532 645 n 5 177 R2 5 318 where for clarity we let lsalary denote the log of salary and similarly for lsales and lmktval Next we obtain m i 5 exp1lsalaryi2 for each observation in the sample The Duan smearing estimate from 643 is about a 0 5 1136 and the regression estimate from 644 is aˇ 0 5 1117 We can use either estimate to predict salary for any values of sales mktval and ceoten Let us find the prediction for sales 5 5000 which means 5 billion because sales is in mil lions mktval 5 10000 or 10 billion and ceoten 5 10 From 645 the prediction for lsalary is 4504 1 163 log150002 1 109 log1100002 1 01171102 7013 and exp170132 1110983 Using the estimate of a0 from 643 the predicted salary is about 1262077 or 1262077 Using the estimate from 644 gives an estimated salary of about 1240968 These differ from each other by much less than each differs from the naive prediction of 1110983 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 193 We can use the previous method of obtaining predictions to determine how well the model with logy as the dependent variable explains y We already have measures for models when y is the dependent variable the Rsquared and the adjusted Rsquared The goal is to find a goodnessoffit measure in the logy model that can be compared with an Rsquared from a model where y is the dependent variable There are different ways to define a goodnessoffit measure after retransforming a model for logy to predict y Here we present two approaches that are easy to implement The first gives the same goodnessoffit measures whether we estimate a0 as in 640 643 or 644 To motivate the measure recall that in the linear regression equation estimated by OLS y 5 b 0 1 b 1x1 1 p 1 b kxk 646 the usual Rsquared is simply the square of the correlation between yi and yi see Section 32 Now if instead we compute fitted values from 642that is yi 5 a 0mi for all observations ithen it makes sense to use the square of the correlation between yi and these fitted values as an Rsquared Because correlation is unaffected if we multiply by a constant it does not matter which estimate of a0 we use In fact this Rsquared measure for y not logy is just the squared correlation between yi and m i We can compare this directly with the Rsquared from equation 646 The squared correlation measure does not depend on how we estimate a0 A second approach is to compute an Rsquared for y based on a sum of squared residuals For concreteness suppose we use equation 643 to estimate a0 Then the residual for predicting yi is ri 5 yi 2 a 0 exp1logyi2 647 and we can use these residuals to compute a sum of squared residuals Using the formula for Rsquared from linear regression we are led to 1 2 g n i51r2 i g n i511yi 2 y2 648 as an alternative goodnessoffit measure that can be compared with the Rsquared from the linear model for y Notice that we can compute such a measure for the alternative estimates of a0 in equation 640 and 644 by inserting those estimates in place of a 0 in 647 Unlike the squared correlation between yi and m i the Rsquared in 648 will depend on how we estimate a0 The estimate that mini mizes g n i51r2 i is that in equation 644 but that does not mean we should prefer it and certainly not if aˇ 0 1 We are not really trying to choose among the different estimates of a0 rather we are finding goodnessoffit measures that can be compared with the linear model for y ExamplE 68 predicting CEO Salaries After we obtain the m i we just obtain the correlation between salaryi and m i it is 493 The square of it is about 243 and this is a measure of how well the log model explains the variation in salary not logsalary The R2 from 645 318 tells us that the log model explains about 318 of the varia tion in logsalary As a competing linear model suppose we estimate a model with all variables in levels salary 5 b0 1 b1sales 1 b2mktval 1 b3ceoten 1 u 649 The key is that the dependent variable is salary We could use logs of sales or mktval on the righthand side but it makes more sense to have all dollar values in levels if one salary appears as a level The Rsquared from estimating this equation using the same 177 observations is 201 Thus the log model explains more of the variation in salary and so we prefer it to 649 on goodnessoffit Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 194 PART 1 Regression Analysis with CrossSectional Data grounds The log model is also preferred because it seems more realistic and its parameters are easier to interpret If we maintain the full set of classical linear model assumptions in the model 638 we can eas ily obtain prediction intervals for y0 5 exp1b0 1 b1x0 1 1 p 1 bkx0 k 1 u02 when we have estimated a linear model for logy Recall that x0 1 x0 2 p x0 k are known values and u0 is the unobserved error that partly determines y0 From equation 637 a 95 prediction interval for logy0 5 log1y02 is simply logy0 6 t025 se1e02 where se1e02 is obtained from the regression of logy on x1 p xk using the origi nal n observations Let cl 5 logy0 2t025 se1e02 and cu 5 logy0 1 t025 se1e02 be the lower and upper bounds of the prediction interval for logy0 That is P1cl logy0 cu2 5 95 Because the exponential function is strictly increasing it is also true that P3exp1cl2 exp1logy02 exp1cu2 4 5 95 that is P3exp1cl2 y0 exp1cu2 4 5 95 Therefore we can take exp1cl2 and exp1cu2 as the lower and upper bounds respectively for a 95 prediction interval for y0 For large n t025 5 196 and so a 95 pre diction interval for y0 is exp32196 se1e02 4 exp1b 0 1 x0b 2 to exp32196 se1e02 4 exp1b 0 1 x0b 2 where x0b is shorthand for b 1x0 1 1 p 1 b kx0 k Remember the b j and se1e02 are obtained from the regression with logy as the dependent variable Because we assume normality of u in 638 we probably would use 640 to obtain a point prediction for y0 Unlike in equation 637 this point pre diction will not lie halfway between the lower and upper bounds exp1cl2 and exp1cu2 One can obtain different 95 prediction intervalues by choosing different quantiles in the tn2k21 distribution If qa1 and qa2 are quantiles with a2 2 a1 5 95 then we can choose cl 5 qa1se1e02 and cu 5 qa2se1e02 As an example consider the CEO salary regression where we make the prediction at the same values of sales mktval and ceoten as in Example 67 The standard error of the regression for 643 is about 505 and the standard error of logy0 is about 075 Therefore using equation 636 se1e02 511 as in the GPA example the error variance swamps the estimation error in the parameters even though here the sample size is only 177 A 95 prediction interval for salary0 is exp32196 15112 4 exp170132 to exp3196 15112 4 exp170132 or about 408071 to 3024678 that is 408071 to 3024678 This very wide 95 prediction interval for CEO salary at the given sales market value and tenure values shows that there is much else that we have not included in the regression that determines salary Incidentally the point prediction for salary using 640 is about 1262075higher than the predictions using the other estimates of a0 and closer to the lower bound than the upper bound of the 95 prediction interval Summary In this chapter we have covered some important multiple regression analysis topics Section 61 showed that a change in the units of measurement of an independent variable changes the OLS coefficient in the expected manner if xj is multiplied by c its coefficient is divided by c If the dependent variable is multiplied by c all OLS coefficients are multiplied by c Neither t nor F statistics are affected by changing the units of measurement of any variables We discussed beta coefficients which measure the effects of the independent variables on the depend ent variable in standard deviation units The beta coefficients are obtained from a standard OLS regression after the dependent and independent variables have been transformed into zscores We provided a detailed discussion of functional form including the logarithmic transformation quad ratics and interaction terms It is helpful to summarize some of our conclusions Considerations When Using Logarithms 1 The coefficients have percentage change interpretations We can be ignorant of the units of meas urement of any variable that appears in logarithmic form and changing units from say dollars to thousands of dollars has no effect on a variables coefficient when that variable appears in logarithmic form Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 195 2 Logs are often used for dollar amounts that are always positive as well as for variables such as popu lation especially when there is a lot of variation They are used less often for variables measured in years such as schooling age and experience Logs are used infrequently for variables that are already percents or proportions such as an unemployment rate or a pass rate on a test 3 Models with logy as the dependent variable often more closely satisfy the classical linear model assumptions For example the model has a better chance of being linear homoskedasticity is more likely to hold and normality is often more plausible 4 In many cases taking the log greatly reduces the variation of a variable making OLS estimates less prone to outlier influence However in cases where y is a fraction and close to zero for many observa tions log1yi2 can have much more variability than yi For values yi very close to zero log1yi2 is a nega tive number very large in magnitude 5 If y 0 but y 5 0 is possible we cannot use logy Sometimes log1 1 y is used but interpretation of the coefficients is difficult 6 For large changes in an explanatory variable we can compute a more accurate estimate of the percent age change effect 7 It is harder but possible to predict y when we have estimated a model for logy Considerations When Using QUadratiCs 1 A quadratic function in an explanatory variable allows for an increasing or decreasing effect 2 The turning point of a quadratic is easily calculated and it should be calculated to see if it makes sense 3 Quadratic functions where the coefficients have the opposite sign have a strictly positive turning point if the signs of the coefficients are the same the turning point is at a negative value of x 4 A seemingly small coefficient on the square of a variable can be practically important in what it implies about a changing slope One can use a t test to see if the quadratic is statistically significant and compute the slope at various values of x to see if it is practically important 5 For a model quadratic in a variable x the coefficient on x measures the partial effect starting from x 5 0 as can be seen in equation 611 If zero is not a possible or interesting value of x one can center x about a more interesting value such as the average in the sample before computing the square This is the same as computing the average partial effect Computing Exercise 612 provides an example Considerations When Using interaCtions 1 Interaction terms allow the partial effect of an explanatory variable say x1 to depend on the level of another variable say x2and vice versa 2 Interpreting models with interactions can be tricky The coefficient on x1 say b1 measures the partial effect of x1 on y when x2 5 0 which may be impossible or uninteresting Centering x1 and x2 around interesting values before constructing the interaction term typically leads to an equation that is visu ally more appealing When the variables are centered about their sample averages the coefficients on the levels become estimated average partial effects 3 A standard t test can be used to determine if an interaction term is statistically significant Computing the partial effects at different values of the explanatory variables can be used to determine the practical importance of interactions We introduced the adjusted Rsquared R2 as an alternative to the usual Rsquared for measuring goodnessoffit Whereas R2 can never fall when another variable is added to a regression R2 penalizes the number of regressors and can drop when an independent variable is added This makes R2 preferable for choosing between nonnested models with different numbers of explanatory variables Neither R2 nor R2 can be used to compare models with different dependent variables Nevertheless it is fairly easy to obtain goodnessoffit measures for choosing between y and logy as the dependent variable as shown in Section 64 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 196 PART 1 Regression Analysis with CrossSectional Data Problems 1 The following equation was estimated using the data in CEOSAL1 log1salary2 5 4322 1 276 log1sales2 1 0215 roe 2 00008 roe2 13242 10332 101292 1000262 n 5 209 R2 5 282 This equation allows roe to have a diminishing effect on logsalary Is this generality necessary Explain why or why not 2 Let b 0 b 1 p b k be the OLS estimates from the regression of yi on xi1 p xik i 5 1 2 p n For nonzero constants c1 p ck argue that the OLS intercept and slopes from the regression of c0yi on c1xi1 p ckxik i 5 1 2 p n are given by b 0 5 c0b 0 b 1 5 1c0c12b 1 p b k 5 1c0 ck2b k Hint Use the fact that the b j solve the first order conditions in 313 and the b j must solve the first order condi tions involving the rescaled dependent and independent variables 3 Using the data in RDCHEM the following equation was obtained by OLS rdintens 5 2613 1 00030 sales 2 0000000070 sales2 14292 1000142 100000000372 n 5 32 R2 5 1484 i At what point does the marginal effect of sales on rdintens become negative ii Would you keep the quadratic term in the model Explain In Section 63 we discussed the somewhat subtle problem of relying too much on R2 or R2 in arriving at a final model it is possible to control for too many factors in a regression model For this reason it is important to think ahead about model specification particularly the ceteris paribus nature of the multiple regression equation Explanatory variables that affect y and are uncorrelated with all the other explanatory variables can be used to reduce the error variance without inducing multicollinearity In Section 64 we demonstrated how to obtain a confidence interval for a prediction made from an OLS regression line We also showed how a confidence interval can be constructed for a future unknown value of y Occasionally we want to predict y when logy is used as the dependent variable in a regression model Section 64 explains this simple method Finally we are sometimes interested in knowing about the sign and magnitude of the residuals for particular observations Residual analysis can be used to determine whether particular members of the sample have predicted values that are well above or well below the actual outcomes Key Terms Adjusted RSquared Average Partial Effect APE Beta Coefficients Bootstrap Bootstrap Standard Error Interaction Effect Nonnested Models Over Controlling Population RSquared Prediction Error Prediction Interval Predictions Quadratic Functions Resampling Method Residual Analysis Smearing Estimate Standardized Coefficients Variance of the Prediction Error Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 197 iii Define salesbil as sales measured in billions of dollars salesbil 5 sales1000 Rewrite the estimated equation with salesbil and salesbil2 as the independent variables Be sure to report standard errors and the Rsquared Hint Note that salesbil2 5 sales2110002 2 iv For the purpose of reporting the results which equation do you prefer 4 The following model allows the return to education to depend upon the total amount of both parents education called pareduc log1wage2 5 b0 1 b1educ 1 b2educpareduc 1 b3exper 1 b4tenure 1 u i Show that in decimal form the return to another year of education in this model is Dlog1wage2Deduc 5 b1 1 b2 pareduc What sign do you expect for b2 Why ii Using the data in WAGE2 the estimated equation is log1wage2 5 565 1 047 educ 1 00078 educ pareduc 1 1132 10102 1000212 019 exper 1 010 tenure 10042 10032 n 5 722 R2 5 169 Only 722 observations contain full information on parents education Interpret the coefficient on the interaction term It might help to choose two specific values for pareducfor example pareduc 5 32 if both parents have a college education or pareduc 5 24 if both parents have a high school educationand to compare the estimated return to educ iii When pareduc is added as a separate variable to the equation we get log1wage2 5 494 1 097 educ 1 033 pareduc 2 0016 educ pareduc 1382 10272 10172 100122 1 020 exper 1 010 tenure 10042 10032 n 5 722 R2 5 174 Does the estimated return to education now depend positively on parent education Test the null hypothesis that the return to education does not depend on parent education 5 In Example 42 where the percentage of students receiving a passing score on a tenthgrade math exam math10 is the dependent variable does it make sense to include sci11the percentage of elev enth graders passing a science examas an additional explanatory variable 6 When atndrte2 and ACT atndrte are added to the equation estimated in 619 the Rsquared becomes 232 Are these additional terms jointly significant at the 10 level Would you include them in the model 7 The following three equations were estimated using the 1534 observations in 401K prate 5 8029 1 5 44 mrate 1 269 age 2 00013 totemp 1782 1522 10452 1000042 R2 5 100 R2 5 098 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 198 PART 1 Regression Analysis with CrossSectional Data prate 5 9732 1 5 02 mrate 1 314 age 2 2 66 log1totemp2 11952 10512 10442 1282 R2 5 144 R2 5 142 prate 5 8062 1 5 34 mrate 1 290 age 2 00043 totemp 1782 1522 10452 1000092 1 0000000039 totemp2 1000000000102 R2 5 108 R2 5 106 Which of these three models do you prefer Why 8 Suppose we want to estimate the effects of alcohol consumption alcohol on college grade point aver age colGPA In addition to collecting information on grade point averages and alcohol usage we also obtain attendance information say percentage of lectures attended called attend A standardized test score say SAT and high school GPA hsGPA are also available i Should we include attend along with alcohol as explanatory variables in a multiple regression model Think about how you would interpret balcohol ii Should SAT and hsGPA be included as explanatory variables Explain 9 If we start with 638 under the CLM assumptions assume large n and ignore the estimation error in the b j a 95 prediction interval for y0 is 3exp12196s 2 exp1logy02 exp1196s 2 exp1logy02 4 The point prediction for y0 is y0 5 exp1s 22exp1logy02 i For what values of s will the point prediction be in the 95 prediction interval Does this condition seem likely to hold in most applications ii Verify that the condition from part i is satisfied in the CEO salary example 10 The following two equations were estimated using the data in MEAPSINGLE The key explanatory variable is lexppp the log of expenditures per student at the school level math4 5 2449 1 9 01 lexppp 2 422 free 2 752 lmedinc 2 274 pctsgle 159242 14042 10712 153582 11612 n 5 229 R2 5 472 R2 5 462 math4 5 14938 1 1 93 lexppp 2 060 free 2 10 78 lmedinc 2 397 pctsgle 1 667 read4 141702 12822 10542 13762 11112 10422 n 5 229 R2 5 749 R2 5 743 i If you are a policy maker trying to estimate the causal effect of perstudent spending on math test performance explain why the first equation is more relevant than the second What is the estimated effect of a 10 increase in expenditures per student ii Does adding read4 to the regression have strange effects on coefficients and statistical significance other than blexppp iii How would you explain to someone with only basic knowledge of regression why in this case you prefer the equation with the smaller adjusted Rsquared Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 199 Computer Exercises C1 Use the data in KIELMC only for the year 1981 to answer the following questions The data are for houses that sold during 1981 in North Andover Massachusetts 1981 was the year construction began on a local garbage incinerator i To study the effects of the incinerator location on housing price consider the simple regression model log1price2 5 b0 1 b1log1dist2 1 u where price is housing price in dollars and dist is distance from the house to the incinerator measured in feet Interpreting this equation causally what sign do you expect for b1 if the presence of the incinerator depresses housing prices Estimate this equation and interpret the results ii To the simple regression model in part i add the variables logintst logarea logland rooms baths and age where intst is distance from the home to the interstate area is square footage of the house land is the lot size in square feet rooms is total number of rooms baths is number of bathrooms and age is age of the house in years Now what do you conclude about the effects of the incinerator Explain why i and ii give conflicting results iii Add 3log1intst2 42 to the model from part ii Now what happens What do you conclude about the importance of functional form iv Is the square of logdist significant when you add it to the model from part iii C2 Use the data in WAGE1 for this exercise i Use OLS to estimate the equation log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 1 u and report the results using the usual format ii Is exper2 statistically significant at the 1 level iii Using the approximation Dwage 1001b 2 1 2b 3exper2Dexper find the approximate return to the fifth year of experience What is the approximate return to the twentieth year of experience iv At what value of exper does additional experience actually lower predicted logwage How many people have more experience in this sample C3 Consider a model where the return to education depends upon the amount of work experience and vice versa log1wage2 5 b0 1 b1educ 1 b2exper 1 b3educ exper 1 u i Show that the return to another year of education in decimal form holding exper fixed is b1 1 b3exper ii State the null hypothesis that the return to education does not depend on the level of exper What do you think is the appropriate alternative iii Use the data in WAGE2 to test the null hypothesis in ii against your stated alternative iv Let u1 denote the return to education in decimal form when exper 5 10 u1 5 b1 1 10b3 Obtain u 1 and a 95 confidence interval for u1 Hint Write b1 5 u1 2 10b3 and plug this into the equation then rearrange This gives the regression for obtaining the confidence interval for u1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 200 PART 1 Regression Analysis with CrossSectional Data C4 Use the data in GPA2 for this exercise i Estimate the model sat 5 b0 1 b1hsize 1 b2hsize2 1 u where hsize is the size of the graduating class in hundreds and write the results in the usual form Is the quadratic term statistically significant ii Using the estimated equation from part i what is the optimal high school size Justify your answer iii Is this analysis representative of the academic performance of all high school seniors Explain iv Find the estimated optimal high school size using logsat as the dependent variable Is it much different from what you obtained in part ii C5 Use the housing price data in HPRICE1 for this exercise i Estimate the model log1price2 5 b0 1 b1log1lotsize2 1 b2log1sqrft2 1 b3bdrms 1 u and report the results in the usual OLS format ii Find the predicted value of logprice when lotsize 5 20000 sqrft 5 2500 and bdrms 5 4 Using the methods in Section 64 find the predicted value of price at the same values of the explanatory variables iii For explaining variation in price decide whether you prefer the model from part i or the model price 5 b0 1 b1lotsize 1 b2sqrft 1 b3bdrms 1 u C6 Use the data in VOTE1 for this exercise i Consider a model with an interaction between expenditures voteA 5 b0 1 b1prtystrA 1 b2expendA 1 b3expendB 1 b4expendA expendB 1 u What is the partial effect of expendB on voteA holding prtystrA and expendA fixed What is the partial effect of expendA on voteA Is the expected sign for b4 obvious ii Estimate the equation in part i and report the results in the usual form Is the interaction term statistically significant iii Find the average of expendA in the sample Fix expendA at 300 for 300000 What is the estimated effect of another 100000 spent by Candidate B on voteA Is this a large effect iv Now fix expendB at 100 What is the estimated effect of DexpendA 5 100 on voteA Does this make sense v Now estimate a model that replaces the interaction with shareA Candidate As percentage share of total campaign expenditures Does it make sense to hold both expendA and expendB fixed while changing shareA vi Requires calculus In the model from part v find the partial effect of expendB on voteA holding prtystrA and expendA fixed Evaluate this at expendA 5 300 and expendB 5 0 and comment on the results C7 Use the data in ATTEND for this exercise i In the model of Example 63 argue that DstndfnlDpriGPA b2 1 2b4 priGPA 1 b6atndrte Use equation 619 to estimate the partial effect when priGPA 5 259 and atndrte 5 82 Interpret your estimate Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 201 ii Show that the equation can be written as stndfnl 5 u0 1 b1atndrte 1 u2 priGPA 1 b3 ACT 1 b41priGPA 2 2592 2 1 b5ACT2 1 b6 priGPA1atndrte 2 822 1 u where u2 5 b2 1 2b412592 1 b61822 Note that the intercept has changed but this is unimportant Use this to obtain the standard error of u 2 from part i iii Suppose that in place of priGPAatndrte 82 you put 1priGPA 2 2592 atndrte 82 Now how do you interpret the coefficients on atndrte and priGPA C8 Use the data in HPRICE1 for this exercise i Estimate the model price 5 b0 1 b1lotsize 1 b2sqrft 1 b3bdrms 1 u and report the results in the usual form including the standard error of the regression Obtain predicted price when we plug in lotsize 5 10000 sqrft 5 2300 and bdrms 5 4 round this price to the nearest dollar ii Run a regression that allows you to put a 95 confidence interval around the predicted value in part i Note that your prediction will differ somewhat due to rounding error iii Let price0 be the unknown future selling price of the house with the characteristics used in parts i and ii Find a 95 CI for price0 and comment on the width of this confidence interval C9 The data set NBASAL contains salary information and career statistics for 269 players in the National Basketball Association NBA i Estimate a model relating pointspergame points to years in the league exper age and years played in college coll Include a quadratic in exper the other variables should appear in level form Report the results in the usual way ii Holding college years and age fixed at what value of experience does the next year of experience actually reduce pointspergame Does this make sense iii Why do you think coll has a negative and statistically significant coefficient Hint NBA players can be drafted before finishing their college careers and even directly out of high school iv Add a quadratic in age to the equation Is it needed What does this appear to imply about the effects of age once experience and education are controlled for v Now regress logwage on points exper exper2 age and coll Report the results in the usual format vi Test whether age and coll are jointly significant in the regression from part v What does this imply about whether age and education have separate effects on wage once productivity and seniority are accounted for C10 Use the data in BWGHT2 for this exercise i Estimate the equation log1bwght2 5 b0 1 b1npvis 1 b2npvis2 1 u by OLS and report the results in the usual way Is the quadratic term significant ii Show that based on the equation from part i the number of prenatal visits that maximizes logbwght is estimated to be about 22 How many women had at least 22 prenatal visits in the sample iii Does it make sense that birth weight is actually predicted to decline after 22 prenatal visits Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 202 PART 1 Regression Analysis with CrossSectional Data iv Add mothers age to the equation using a quadratic functional form Holding npvis fixed at what mothers age is the birth weight of the child maximized What fraction of women in the sample are older than the optimal age v Would you say that mothers age and number of prenatal visits explain a lot of the variation in logbwght vi Using quadratics for both npvis and age decide whether using the natural log or the level of bwght is better for predicting bwght C11 Use APPLE to verify some of the claims made in Section 63 i Run the regression ecolbs on ecoprc regprc and report the results in the usual form including the Rsquared and adjusted Rsquared Interpret the coefficients on the price variables and comment on their signs and magnitudes ii Are the price variables statistically significant Report the pvalues for the individual t tests iii What is the range of fitted values for ecolbs What fraction of the sample reports ecolbs 5 0 Comment iv Do you think the price variables together do a good job of explaining variation in ecolbs Explain v Add the variables faminc hhsize household size educ and age to the regression from part i Find the pvalue for their joint significance What do you conclude vi Run separate simple regressions of ecolbs on ecoprc and then ecolbs on regprc How do the simple regression coefficients compare with the multiple regression from part i Find the correlation coefficient between ecoprc and regprc to help explain your findings vii Consider a model that adds family income and the quantity demanded for regular apples ecolbs 5 b0 1 b1ecoprc 1 b2regprc 1 b3 faminc 1 b4reglbs 1 u From basic economic theory which explanatory variable does not belong to the equation When you drop the variables one at a time time do the sizes of the adjusted Rsquareds affect your answer C12 Use the subset of 401KSUBS with fsize 5 1 this restricts the analysis to singleperson households see also Computer Exercise C8 in Chapter 4 i What is the youngest age of people in this sample How many people are at that age ii In the model nettfa 5 b0 1 b1inc 1 b2age 1 b3age2 1 u what is the literal interpretation of b2 By itself is it of much interest iii Estimate the model from part ii and report the results in standard form Are you concerned that the coefficient on age is negative Explain iv Because the youngest people in the sample are 25 it makes sense to think that for a given level of income the lowest average amount of net total financial assets is at age 25 Recall that the partial effect of age on nettfa is b2 1 2b3age so the partial effect at age 25 is b2 1 2b31252 5 b2 1 50b3 call this u2 Find u 2 and obtain the twosided pvalue for testing H0 u2 5 0 You should conclude that u 2 is small and very statistically insignificant Hint One way to do this is to estimate the model nettfa 5 a0 1 b1inc 1 u2age 1 b31age 2 252 2 1 u where the intercept a0 is different from b0 There are other ways too v Because the evidence against H0 u2 5 0 is very weak set it to zero and estimate the model nettfa 5 a0 1 b1inc 1 b31age 2 252 2 1 u In terms of goodnessoffit does this model fit better than that in part ii vi For the estimated equation in part v set inc 5 30 roughly the average value and graph the relationship between nettfa and age but only for age 25 Describe what you see vii Check to see whether including a quadratic in inc is necessary Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 6 Multiple Regression Analysis Further Issues 203 C13 Use the data in MEAP00 to answer this question i Estimate the model math4 5 b0 1 b2lexppp 1 b2lenroll 1 b3lunch 1 u by OLS and report the results in the usual form Is each explanatory variable statistically significant at the 5 level ii Obtain the fitted values from the regression in part i What is the range of fitted values How does it compare with the range of the actual data on math4 iii Obtain the residuals from the regression in part i What is the building code of the school that has the largest positive residual Provide an interpretation of this residual iv Add quadratics of all explanatory variables to the equation and test them for joint significance Would you leave them in the model v Returning to the model in part i divide the dependent variable and each explanatory variable by its sample standard deviation and rerun the regression Include an intercept unless you also first subtract the mean from each variable In terms of standard deviation units which explanatory variable has the largest effect on the math pass rate C14 Use the data in BENEFITS to answer this question It is a schoollevel data set at the K5 level on aver age teacher salary and benefits See Example 410 for background i Regress lavgsal on bs and report the results in the usual form Can you reject H0 bbs 5 0 against a twosided alternative Can you reject H0 bbs 5 21 against H1 bbs 21 Report the pvalues for both tests ii Define lbs 5 log1bs2 Find the range of values for lbs and find its standard deviation How do these compare to the range and standard deviation for bs iii Regress lavgsal on lbs Does this fit better than the regression from part i iv Estimate the equation lavgsal 5 b0 1 b1bs 1 b2lenroll 1 b3lstaff 1 b4lunch 1 u and report the results in the usual form What happens to the coefficient on bs Is it now statistically different from zero v Interpret the coefficient on lstaff Why do you think it is negative vi Add lunch2 to the equation from part iv Is it statistically significant Compute the turning point minimum value in the quadratic and show that it is within the range of the observed data on lunch How many values of lunch are higher than the calculated turning point vii Based on the findings from part vi describe how teacher salaries relate to school poverty rates In terms of teacher salary and holding other factors fixed is it better to teach at a school with lunch 5 0 no poverty lunch 5 50 or lunch 5 100 all kids eligible for the free lunch program APPEndix 6A 6a a Brief introduction to Bootstrapping In many cases where formulas for standard errors are hard to obtain mathematically or where they are thought not to be very good approximations to the true sampling variation of an estimator we can rely on a resampling method The general idea is to treat the observed data as a population that we can draw samples from The most common resampling method is the bootstrap There are actually several versions of the bootstrap but the most general and most easily applied is called the non parametric bootstrap and that is what we describe here Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 204 PART 1 Regression Analysis with CrossSectional Data Suppose we have an estimate u of a population parameter u We obtained this estimate which could be a function of OLS estimates or estimates that we cover in later chapters from a random sample of size n We would like to obtain a standard error for u that can be used for constructing t statistics or confidence intervals Remarkably we can obtain a valid standard error by computing the estimate from different random samples drawn from the original data Implementation is easy If we list our observations from 1 through n we draw n numbers ran domly with replacement from this list This produces a new data set of size n that consists of the original data but with many observations appearing multiple times except in the rather unusual case that we resample the original data Each time we randomly sample from the original data we can estimate u using the same procedure that we used on the original data Let u 1b2 denote the estimate from bootstrap sample b Now if we repeat the resampling and estimation m times we have m new estimates 5u 1b2 b 5 1 2 p m6 The bootstrap standard error of u is just the sample standard deviation of the u 1b2 namely bse1u 2 5 c 1m 2 12 21 a m b51 1u 1b2 2 u 2 2d 1 2 650 where u is the average of the bootstrap estimates If obtaining an estimate of u on a sample of size n requires little computational time as in the case of OLS and all the other estimators we encounter in this text we can afford to choose mthe number of bootstrap replicationsto be large A typical value is m 5 1000 but even m 5 500 or a somewhat smaller value can produce a reliable standard error Note that the size of mthe num ber of times we resample the original datahas nothing to do with the sample size n For certain estimation problems beyond the scope of this text a large n can force one to do fewer bootstrap replications Many statistics and econometrics packages have builtin bootstrap commands and this makes the calculation of bootstrap standard errors simple especially compared with the work often required to obtain an analytical formula for an asymptotic standard error One can actually do better in most cases by using the bootstrap sample to compute pvalues for t statistics and F statistics or for obtaining confidence intervals rather than obtaining a bootstrap standard error to be used in the construction of t statistics or confidence intervals See Horowitz 2001 for a comprehensive treatment Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 205 c h a p t e r 7 Multiple Regression Analysis with Qualitative Information Binary or Dummy Variables I n previous chapters the dependent and independent variables in our multiple regression models have had quantitative meaning Just a few examples include hourly wage rate years of education college grade point average amount of air pollution level of firm sales and number of arrests In each case the magnitude of the variable conveys useful information In empirical work we must also incorporate qualitative factors into regression models The gender or race of an individual the industry of a firm manufacturing retail and so on and the region in the United States where a city is located South North West and so on are all considered to be qualitative factors Most of this chapter is dedicated to qualitative independent variables After we discuss the appro priate ways to describe qualitative information in Section 71 we show how qualitative explanatory variables can be easily incorporated into multiple regression models in Sections 72 73 and 74 These sections cover almost all of the popular ways that qualitative independent variables are used in crosssectional regression analysis In Section 75 we discuss a binary dependent variable which is a particular kind of qualitative dependent variable The multiple regression model has an interesting interpretation in this case and is called the linear probability model While much maligned by some econometricians the simplicity of the linear probability model makes it useful in many empirical contexts We will describe its draw backs in Section 75 but they are often secondary in empirical work 71 Describing Qualitative Information Qualitative factors often come in the form of binary information a person is female or male a person does or does not own a personal computer a firm offers a certain kind of employee pension plan or it does not a state administers capital punishment or it does not In all of these examples the relevant Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 206 information can be captured by defining a binary variable or a zeroone variable In econometrics binary variables are most commonly called dummy variables although this name is not especially descriptive In defining a dummy variable we must decide which event is assigned the value one and which is assigned the value zero For example in a study of indi vidual wage determination we might define female to be a binary variable taking on the value one for females and the value zero for males The name in this case indicates the event with the value one The same infor mation is captured by defining male to be one if the person is male and zero if the person is female Either of these is better than using gender because this name does not make it clear when the dummy variable is one does gender 5 1 correspond to male or female What we call our variables is unimportant for getting regression results but it always helps to choose names that clarify equations and expositions Suppose in the wage example that we have chosen the name female to indicate gender Further we define a binary variable married to equal one if a person is married and zero if otherwise Table 71 gives a partial listing of a wage data set that might result We see that Person 1 is female and not married Person 2 is female and married Person 3 is male and not married and so on Why do we use the values zero and one to describe qualitative information In a sense these values are arbitrary any two different values would do The real benefit of capturing qualitative infor mation using zeroone variables is that it leads to regression models where the parameters have very natural interpretations as we will see now 72 A Single Dummy Independent Variable How do we incorporate binary information into regression models In the simplest case with only a single dummy explanatory variable we just add it as an independent variable in the equation For example consider the following simple model of hourly wage determination wage 5 b0 1 d0 female 1 b1educ 1 u 71 TAble 71 A Partial Listing of the Data in WAGE1 person wage educ exper female married 1 310 11 2 1 0 2 324 12 22 1 1 3 300 11 2 0 0 4 600 8 44 0 1 5 530 12 7 0 1 525 1156 16 5 0 1 526 350 14 5 1 0 Suppose that in a study comparing elec tion outcomes between Democratic and Republican candidates you wish to indicate the party of each candidate Is a name such as party a wise choice for a binary variable in this case What would be a better name Exploring FurthEr 71 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 207 We use d0 as the parameter on female in order to highlight the interpretation of the parameters multi plying dummy variables later we will use whatever notation is most convenient In model 71 only two observed factors affect wage gender and education Because female 5 1 when the person is female and female 5 0 when the person is male the parameter d0 has the follow ing interpretation d0 is the difference in hourly wage between females and males given the same amount of education and the same error term u Thus the coefficient d0 determines whether there is discrimination against women if d0 0 then for the same level of other factors women earn less than men on average In terms of expectations if we assume the zero conditional mean assumption E1u0female educ2 5 0 then d0 5 E1wage0 female 5 1educ2 2 E1wage0 female 5 0educ2 Because female 5 1 corresponds to females and female 5 0 corresponds to males we can write this more simply as d0 5 E1wage0 femaleeduc2 2 E1wage0maleeduc2 72 The key here is that the level of education is the same in both expectations the difference d0 is due to gender only The situation can be depicted graphically as an intercept shift between males and females In Figure 71 the case d0 0 is shown so that men earn a fixed amount more per hour than women The difference does not depend on the amount of education and this explains why the wageeducation profiles for women and men are parallel educ slope 1 wage 0 1 0 men wage 0 1 1educ women wage 0 1 0 1 educ 0 0 FiguRe 71 Graph of wage 5 b0 1 d0 female 1 b1 educ for d0 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 208 At this point you may wonder why we do not also include in 71 a dummy variable say male which is one for males and zero for females This would be redundant In 71 the intercept for males is b0 and the intercept for females is b0 1 d0 Because there are just two groups we only need two different intercepts This means that in addition to b0 we need to use only one dummy variable we have chosen to include the dummy variable for females Using two dummy variables would intro duce perfect collinearity because female 1 male 5 1 which means that male is a perfect linear func tion of female Including dummy variables for both genders is the simplest example of the socalled dummy variable trap which arises when too many dummy variables describe a given number of groups We will discuss this problem in detail later In 71 we have chosen males to be the base group or benchmark group that is the group against which comparisons are made This is why b0 is the intercept for males and d0 is the difference in intercepts between females and males We could choose females as the base group by writing the model as wage 5 a0 1 g0 male 1 b1educ 1 u where the intercept for females is a0 and the intercept for males is a0 1 g0 this implies that a0 5 b0 1 d0 and a0 1 g0 5 b0 In any application it does not matter how we choose the base group but it is important to keep track of which group is the base group Some researchers prefer to drop the overall intercept in the model and to include dummy vari ables for each group The equation would then be wage 5 b0male 1 a0female 1 b1educ 1 u where the intercept for men is b0 and the intercept for women is a0 There is no dummy variable trap in this case because we do not have an overall intercept However this formulation has little to offer since testing for a difference in the intercepts is more difficult and there is no generally agreed upon way to compute Rsquared in regressions without an intercept Therefore we will always include an overall intercept for the base group Nothing much changes when more explanatory variables are involved Taking males as the base group a model that controls for experience and tenure in addition to education is wage 5 b0 1 d0 female 1 b1educ 1 b2exper 1 b3 tenure 1 u 73 If educ exper and tenure are all relevant productivity characteristics the null hypothesis of no dif ference between men and women is H0 d0 5 0 The alternative that there is discrimination against women is H1 d0 0 How can we actually test for wage discrimination The answer is simple just estimate the model by OLS exactly as before and use the usual t statistic Nothing changes about the mechanics of OLS or the statistical theory when some of the independent variables are defined as dummy variables The only difference with what we have done up until now is in the interpretation of the coefficient on the dummy variable ExamplE 71 Hourly Wage Equation Using the data in WAGE1 we estimate model 73 For now we use wage rather than logwage as the dependent variable wage 5 2157 2 181 female 1 572 educ1 025 exper 1 141 tenure 1722 1262 10492 10122 10212 74 n 5 526 R2 5 364 The negative interceptthe intercept for men in this caseis not very meaningful because no one has zero values for all of educ exper and tenure in the sample The coefficient on female is interest ing because it measures the average difference in hourly wage between a man and a woman who have the same levels of educ exper and tenure If we take a woman and a man with the same levels of Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 209 education experience and tenure the woman earns on average 181 less per hour than the man Recall that these are 1976 wages It is important to remember that because we have performed multiple regression and controlled for educ exper and tenure the 181 wage differential cannot be explained by different average lev els of education experience or tenure between men and women We can conclude that the differen tial of 181 is due to gender or factors associated with gender that we have not controlled for in the regression In 2013 dollars the wage differential is about 40911812 740 It is informative to compare the coefficient on female in equation 74 to the estimate we get when all other explanatory variables are dropped from the equation wage 5 710 2 251 female 1212 1302 75 n 5 526 R2 5 116 The coefficients in 75 have a simple interpretation The intercept is the average wage for men in the sample let female 5 0 so men earn 710 per hour on average The coefficient on female is the difference in the average wage between women and men Thus the average wage for women in the sample is 710 2 251 5 459 or 459 per hour Incidentally there are 274 men and 252 women in the sample Equation 75 provides a simple way to carry out a comparisonofmeans test between the two groups which in this case are men and women The estimated difference 2251 has a t statistic of 2837 which is very statistically significant and of course 251 is economically large as well Generally simple regression on a constant and a dummy variable is a straightforward way to compare the means of two groups For the usual t test to be valid we must assume that the homoskedasticity assumption holds which means that the population variance in wages for men is the same as that for women The estimated wage differential between men and women is larger in 75 than in 74 because 75 does not control for differences in education experience and tenure and these are lower on average for women than for men in this sample Equation 74 gives a more reliable estimate of the ceteris paribus gender wage gap it still indicates a very large differential In many cases dummy independent variables reflect choices of individuals or other economic units as opposed to something predetermined such as gender In such situations the matter of cau sality is again a central issue In the following example we would like to know whether personal computer ownership causes a higher college grade point average ExamplE 72 Effects of Computer Ownership on College Gpa In order to determine the effects of computer ownership on college grade point average we estimate the model colGPA 5 b0 1 d0 PC 1 b1hsGPA 1 b2 ACT 1 u where the dummy variable PC equals one if a student owns a personal computer and zero otherwise There are various reasons PC ownership might have an effect on colGPA A students schoolwork might be of higher quality if it is done on a computer and time can be saved by not having to wait at a computer lab Of course a student might be more inclined to play computer games or surf the Internet if he or she owns a PC so it is not obvious that d0 is positive The variables hsGPA high school GPA and ACT achievement test score are used as controls it could be that stronger students as measured Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 210 by high school GPA and ACT scores are more likely to own computers We control for these factors because we would like to know the average effect on colGPA if a student is picked at random and given a personal computer Using the data in GPA1 we obtain colGPA 5 126 1 157 PC 1 447 hsGPA 1 0087 ACT 1332 10572 10942 101052 76 n 5 141 R2 5 219 This equation implies that a student who owns a PC has a predicted GPA about 16 points higher than a comparable student without a PC remember both colGPA and hsGPA are on a fourpoint scale The effect is also very statistically significant with tPC 5 157057 275 What happens if we drop hsGPA and ACT from the equation Clearly dropping the latter vari able should have very little effect as its coefficient and t statistic are very small But hsGPA is very significant and so dropping it could affect the estimate of bPC Regressing colGPA on PC gives an estimate on PC equal to about 170 with a standard error of 063 in this case b PC and its t statistic do not change by much In the exercises at the end of the chapter you will be asked to control for other factors in the equation to see if the computer ownership effect disappears or if it at least gets notably smaller Each of the previous examples can be viewed as having relevance for policy analysis In the first example we were interested in gender discrimination in the workforce In the second example we were concerned with the effect of computer ownership on college performance A special case of policy analysis is program evaluation where we would like to know the effect of economic or social programs on individuals firms neighborhoods cities and so on In the simplest case there are two groups of subjects The control group does not participate in the program The experimental group or treatment group does take part in the program These names come from literature in the experimental sciences and they should not be taken literally Except in rare cases the choice of the control and treatment groups is not random However in some cases multiple regression analysis can be used to control for enough other factors in order to estimate the causal effect of the program ExamplE 73 Effects of Training Grants on Hours of Training Using the 1988 data for Michigan manufacturing firms in JTRAIN we obtain the following estimated equation hrsemp 5 4667 1 2625 grant 2 98 log1sales2 2 607 log1employ2 143412 15592 13542 13882 77 n 5 105 R2 5 237 The dependent variable is hours of training per employee at the firm level The variable grant is a dummy variable equal to one if the firm received a job training grant for 1988 and zero otherwise The variables sales and employ represent annual sales and number of employees respectively We cannot enter hrsemp in logarithmic form because hrsemp is zero for 29 of the 105 firms used in the regression The variable grant is very statistically significant with tgrant 5 470 Controlling for sales and employment firms that received a grant trained each worker on average 2625 hours more Because Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 211 the average number of hours of per worker training in the sample is about 17 with a maximum value of 164 grant has a large effect on training as is expected The coefficient on logsales is small and very insignificant The coefficient on logemploy means that if a firm is 10 larger it trains its workers about 61 hour less Its t statistic is 2156 which is only marginally statistically significant As with any other independent variable we should ask whether the measured effect of a qualita tive variable is causal In equation 77 is the difference in training between firms that receive grants and those that do not due to the grant or is grant receipt simply an indicator of something else It might be that the firms receiving grants would have on average trained their workers more even in the absence of a grant Nothing in this analysis tells us whether we have estimated a causal effect we must know how the firms receiving grants were determined We can only hope we have controlled for as many factors as possible that might be related to whether a firm received a grant and to its levels of training We will return to policy analysis with dummy variables in Section 76 as well as in later chapters 72a Interpreting Coefficients on Dummy Explanatory Variables When the Dependent Variable Is logy A common specification in applied work has the dependent variable appearing in logarithmic form with one or more dummy variables appearing as independent variables How do we interpret the dummy variable coefficients in this case Not surprisingly the coefficients have a percentage interpretation ExamplE 74 Housing price Regression Using the data in HPRICE1 we obtain the equation log1price2 5 2135 1 168 log1lotsize2 1 707 log1sqrft2 1652 10382 10932 1 027 bdrms 1 054 colonial 78 10292 10452 n 5 88 R2 5 649 All the variables are selfexplanatory except colonial which is a binary variable equal to one if the house is of the colonial style What does the coefficient on colonial mean For given levels of lotsize sqrft and bdrms the difference in log1price2 between a house of colonial style and that of another style is 054 This means that a colonialstyle house is predicted to sell for about 54 more holding other factors fixed This example shows that when logy is the dependent variable in a model the coefficient on a dummy variable when multiplied by 100 is interpreted as the percentage difference in y holding all other factors fixed When the coefficient on a dummy variable suggests a large proportionate change in y the exact percentage difference can be obtained exactly as with the semielasticity calculation in Section 62 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 212 ExamplE 75 log Hourly Wage Equation Let us reestimate the wage equation from Example 71 using logwage as the dependent variable and adding quadratics in exper and tenure log1wage2 5 417 2 297 female 1 080 educ 1 029 exper 10992 10362 10072 10052 2 00058 exper2 1 032 tenure 2 00059 tenure2 79 1000102 10072 1000232 n 5 526 R2 5 441 Using the same approximation as in Example 74 the coefficient on female implies that for the same levels of educ exper and tenure women earn about 100297 5 297 less than men We can do better than this by computing the exact percentage difference in predicted wages What we want is the proportionate difference in wages between females and males holding other factors fixed 1wageF 2 wageM2wageM What we have from 79 is log1wageF2 2 log1wageM2 5 2297 Exponentiating and subtracting one gives 1wageF 2 wageM2 wageM 5 exp122972 2 1 2257 This more accurate estimate implies that a womans wage is on average 257 below a comparable mans wage If we had made the same correction in Example 74 we would have obtained exp10542 2 1 0555 or about 56 The correction has a smaller effect in Example 74 than in the wage example because the magnitude of the coefficient on the dummy variable is much smaller in 78 than in 79 Generally if b 1 is the coefficient on a dummy variable say x1 when logy is the dependent vari able the exact percentage difference in the predicted y when x1 5 1 versus when x1 5 0 is 100 3exp1b 12 2 14 710 The estimate b 1 can be positive or negative and it is important to preserve its sign in computing 710 The logarithmic approximation has the advantage of providing an estimate between the mag nitudes obtained by using each group as the base group In particular although equation 710 gives us a better estimate than 100 b 1 of the percentage by which y for x1 5 1 is greater than y for x1 5 0 710 is not a good estimate if we switch the base group In Example 75 we can estimate the percentage by which a mans wage exceeds a comparable womans wage and this estimate is 100 3exp12b 12 2 14 5 100 3exp12972 2 14 346 The approximation based on 100 b 1 297 is between 257 and 346 and close to the middle Therefore it makes sense to report that the differ ence in predicted wages between men and women is about 297 without having to take a stand on which is the base group 73 Using Dummy Variables for Multiple Categories We can use several dummy independent variables in the same equation For example we could add the dummy variable married to equation 79 The coefficient on married gives the approximate proportional differential in wages between those who are and are not married holding gender educ exper and tenure fixed When we estimate this model the coefficient on married with standard error Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 213 in parentheses is 053 041 and the coefficient on female becomes 229010362 Thus the mar riage premium is estimated to be about 53 but it is not statistically different from zero 1t 5 1292 An important limitation of this model is that the marriage premium is assumed to be the same for men and women this is relaxed in the following example ExamplE 76 log Hourly Wage Equation Let us estimate a model that allows for wage differences among four groups married men married women single men and single women To do this we must select a base group we choose single men Then we must define dummy variables for each of the remaining groups Call these marrmale marrfem and singfem Putting these three variables into 79 and of course dropping female since it is now redundant gives log1wage2 5 321 1 213 marrmale 2 198 marrfem 11002 10552 10582 2 110 singfem 1 079 educ 1 027 exper 2 00054 exper2 10562 10072 10052 1000112 711 1 029 tenure 2 00053 tenure2 10072 1000232 n 5 526 R2 5 461 All of the coefficients with the exception of singfem have t statistics well above two in absolute value The t statistic for singfem is about 2196 which is just significant at the 5 level against a twosided alternative To interpret the coefficients on the dummy variables we must remember that the base group is single males Thus the estimates on the three dummy variables measure the proportionate difference in wage relative to single males For example married men are estimated to earn about 213 more than single men holding levels of education experience and tenure fixed The more precise estimate from 710 is about 237 A married woman on the other hand earns a predicted 198 less than a single man with the same levels of the other variables Because the base group is represented by the intercept in 711 we have included dummy vari ables for only three of the four groups If we were to add a dummy variable for single males to 711 we would fall into the dummy variable trap by introducing perfect collinearity Some regression pack ages will automatically correct this mistake for you while others will just tell you there is perfect collinearity It is best to carefully specify the dummy variables because then we are forced to properly interpret the final model Even though single men is the base group in 711 we can use this equation to obtain the esti mated difference between any two groups Because the overall intercept is common to all groups we can ignore that in finding differences Thus the estimated proportionate difference between single and married women is 2110 2 121982 5 088 which means that single women earn about 88 more than married women Unfortunately we cannot use equation 711 for testing whether the esti mated difference between single and married women is statistically significant Knowing the standard errors on marrfem and singfem is not enough to carry out the test see Section 44 The easiest thing to do is to choose one of these groups to be the base group and to reestimate the equation Nothing substantive changes but we get the needed estimate and its standard error directly When we use mar ried women as the base group we obtain log1wage2 5 123 1 411 marrmale 1 198 singmale 1 088 singfem 1 p 11062 10562 10582 10522 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 214 where of course none of the unreported coefficients or standard errors have changed The estimate on singfem is as expected 088 Now we have a standard error to go along with this estimate The t statistic for the null that there is no difference in the population between married and single women is tsingfem 5 088052 169 This is marginal evidence against the null hypothesis We also see that the estimated difference between married men and married women is very statistically significant 1tmarrmale 5 7342 The previous example illustrates a general principle for including dummy variables to indicate different groups if the regression model is to have different intercepts for say g groups or catego ries we need to include g 2 1 dummy variables in the model along with an intercept The inter cept for the base group is the overall intercept in the model and the dummy variable coefficient for a particular group represents the estimated differ ence in intercepts between that group and the base group Including g dummy variables along with an intercept will result in the dummy variable trap An alternative is to include g dummy variables and to exclude an overall intercept Including g dummies without an overall intercept is sometimes useful but it has two practical drawbacks First it makes it more cumbersome to test for differences relative to a base group Second regression packages usually change the way Rsquared is computed when an overall intercept is not included In particular in the formula R2 5 1 2 SSRSST the total sum of squares SST is replaced with a total sum of squares that does not center yi about its mean say SST0 5 g n i51y2 i The resulting Rsquared say R2 0 5 1 2 SSRSST0 is sometimes called the uncentered Rsquared Unfortunately R2 0 is rarely suitable as a goodnessof fit measure It is always true that SST0 SST with equality only if y 5 0 Often SST0 is much larger that SST which means that R2 0 is much larger than R2 For example if in the previous example we regress logwage on marrmale singmale marrfem singfem and the other explanatory variables without an interceptthe reported Rsquared from Stata which is R2 0 is 948 This high Rsquared is an artifact of not centering the total sum of squares in the calculation The correct Rsquared is given in equation 711 as 461 Some regression packages including Stata have an option to force calcu lation of the centered Rsquared even though an overall intercept has not been included and using this option is generally a good idea In the vast majority of cases any Rsquared based on compar ing an SSR and SST should have SST computed by centering the yi about y We can think of this SST as the sum of squared residuals obtained if we just use the sample average y to predict each yi Surely we are setting the bar pretty low for any model if all we measure is its fit relative to using a constant predictor For a model without an intercept that fits poorly it is possible that SSR SST which means R2 would be negative The uncentered Rsquared will always be between zero and one which likely explains why it is usually the default when an intercept is not estimated in regression models 73a Incorporating Ordinal Information by Using Dummy Variables Suppose that we would like to estimate the effect of city credit ratings on the municipal bond interest rate MBR Several financial companies such as Moodys Investors Service and Standard and Poors rate the quality of debt for local governments where the ratings depend on things like probability of default Local governments prefer lower interest rates in order to reduce their costs of borrow ing For simplicity suppose that rankings take on the integer values 50 1 2 3 46 with zero being the worst credit rating and four being the best This is an example of an ordinal variable Call this In the baseball salary data found in MLB1 players are given one of six positions frstbase scndbase thrdbase shrtstop outfield or catcher To allow for salary dif ferentials across position with outfield ers as the base group which dummy variables would you include as independent variables Exploring FurthEr 72 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 215 variable CR for concreteness The question we need to address is How do we incorporate the variable CR into a model to explain MBR One possibility is to just include CR as we would include any other explanatory variable MBR 5 b0 1 b1CR 1 other factors where we do not explicitly show what other factors are in the model Then b1 is the percentage point change in MBR when CR increases by one unit holding other factors fixed Unfortunately it is rather hard to interpret a oneunit increase in CR We know the quantitative meaning of another year of edu cation or another dollar spent per student but things like credit ratings typically have only ordinal meaning We know that a CR of four is better than a CR of three but is the difference between four and three the same as the difference between one and zero If not then it might not make sense to assume that a oneunit increase in CR has a constant effect on MBR A better approach which we can implement because CR takes on relatively few values is to define dummy variables for each value of CR Thus let CR1 5 1 if CR 5 1 and CR1 5 0 otherwise CR2 5 1 if CR 5 2 and CR2 5 0 otherwise and so on Effectively we take the single credit rating and turn it into five categories Then we can estimate the model MBR 5 b0 1 d1CR1 1 d2CR2 1 d3CR3 1 d4CR4 1 other factors 712 Following our rule for including dummy variables in a model we include four dummy variables because we have five categories The omitted category here is a credit rating of zero and so it is the base group This is why we do not need to define a dummy variable for this category The coefficients are easy to interpret d1 is the difference in MBR other factors fixed between a municipality with a credit rating of one and a municipality with a credit rating of zero d2 is the difference in MBR between a municipality with a credit rating of two and a munici pality with a credit rating of zero and so on The movement between each credit rating is allowed to have a different effect so using 712 is much more flexible than simply putting CR in as a single variable Once the dummy variables are defined estimating 712 is straightforward Equation 712 contains the model with a constant partial effect as a special case One way to write the three restrictions that imply a constant partial effect is d2 5 2d1 d3 5 3d1 and d4 5 4d1 When we plug these into equation 712 and rearrange we get MBR 5 b0 1 d11CR1 1 2CR2 1 3CR3 1 4CR42 1 other factors Now the term multiplying d1 is simply the origi nal credit rating variable CR To obtain the F statistic for testing the constant partial effect restrictions we obtain the unrestricted Rsquared from 712 and the restricted Rsquared from the regression of MBR on CR and the other factors we have controlled for The F statistic is obtained as in equation 441 with q 5 3 ExamplE 77 Effects of physical attractiveness on Wage Hamermesh and Biddle 1994 used measures of physical attractiveness in a wage equation The file BEAUTY contains fewer variables but more observations than used by Hamermesh and Biddle See Computer Exercise C12 Each person in the sample was ranked by an interviewer for physi cal attractiveness using five categories homely quite plain average good looking and strikingly beautiful or handsome Because there are so few people at the two extremes the authors put people into one of three groups for the regression analysis average below average and above average where the base group is average Using data from the 1977 Quality of Employment Survey after In model 712 how would you test the null hypothesis that credit rating has no effect on MBR Exploring FurthEr 73 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 216 controlling for the usual productivity characteristics Hamermesh and Biddle estimated an equation for men log1wage2 5 b 0 2 164 belavg 1 016 abvavg 1 other factors 10462 10332 n 5 700 R2 5 403 and an equation for women log1wage2 5 b 0 2 124 belavg 1 035 abvavg 1 other factors 10662 10492 n 5 409 R2 5 330 The other factors controlled for in the regressions include education experience tenure marital sta tus and race see Table 3 in Hamermesh and Biddles paper for a more complete list In order to save space the coefficients on the other variables are not reported in the paper and neither is the intercept For men those with below average looks are estimated to earn about 164 less than an average looking man who is the same in other respects including education experience tenure marital sta tus and race The effect is statistically different from zero with t 5 2357 Men with above average looks are estimated to earn only 16 more than men with average looks and the effect is not statisti cally significant 1t 52 A woman with below average looks earns about 124 less than an otherwise comparable averagelooking woman with t 5 2188 As was the case for men the estimate on abvavg is much smaller in magnitude and not statistically different from zero In related work Biddle and Hamermesh 1998 revisit the effects of looks on earnings using a more homogeneous group graduates of a particular law school The authors continue to find that physical appearance has an effect on annual earnings something that is perhaps not too surprising among people practicing law In some cases the ordinal variable takes on too many values so that a dummy variable cannot be included for each value For example the file LAWSCH85 contains data on median starting salaries for law school graduates One of the key explanatory variables is the rank of the law school Because each law school has a different rank we clearly cannot include a dummy variable for each rank If we do not wish to put the rank directly in the equation we can break it down into categories The follow ing example shows how this is done ExamplE 78 Effects of law School Rankings on Starting Salaries Define the dummy variables top10 r1125 r2640 r4160 r61100 to take on the value unity when the variable rank falls into the appropriate range We let schools ranked below 100 be the base group The estimated equation is log1salary2 5 917 1 700 top10 1 594 r1125 1 375 r2640 1412 10532 10392 10342 1 263 r4160 1 132 r61100 1 0057 LSAT 10282 10212 100312 713 1 041 GPA 1 036 log1libvol2 1 0008 log1cost2 10742 10262 102512 n 5 136 R2 5 911 R2 5 905 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 217 We see immediately that all of the dummy variables defining the different ranks are very statisti cally significant The estimate on r61100 means that holding LSAT GPA libvol and cost fixed the median salary at a law school ranked between 61 and 100 is about 132 higher than that at a law school ranked below 100 The difference between a top 10 school and a below 100 school is quite large Using the exact calculation given in equation 710 gives exp17002 2 1 1014 and so the predicted median salary is more than 100 higher at a top 10 school than it is at a below 100 school As an indication of whether breaking the rank into different groups is an improvement we can compare the adjusted Rsquared in 713 with the adjusted Rsquared from including rank as a single variable the former is 905 and the latter is 836 so the additional flexibility of 713 is warranted Interestingly once the rank is put into the admittedly somewhat arbitrary given categories all of the other variables become insignificant In fact a test for joint significance of LSAT GPA loglibvol and logcost gives a pvalue of 055 which is borderline significant When rank is included in its original form the pvalue for joint significance is zero to four decimal places One final comment about this example In deriving the properties of ordinary least squares we assumed that we had a random sample The current application violates that assumption because of the way rank is defined a schools rank necessarily depends on the rank of the other schools in the sample and so the data cannot represent independent draws from the population of all law schools This does not cause any serious problems provided the error term is uncorrelated with the explanatory variables 74 Interactions Involving Dummy Variables 74a Interactions among Dummy Variables Just as variables with quantitative meaning can be interacted in regression models so can dummy variables We have effectively seen an example of this in Example 76 where we defined four catego ries based on marital status and gender In fact we can recast that model by adding an interaction term between female and married to the model where female and married appear separately This allows the marriage premium to depend on gender just as it did in equation 711 For purposes of comparison the estimated model with the femalemarried interaction term is log1wage2 5 321 2 110 female 1 231 married 11002 10562 10552 714 2 301 female married 1 p 10722 where the rest of the regression is necessarily identical to 711 Equation 714 shows explicitly that there is a statistically significant interaction between gender and marital status This model also allows us to obtain the estimated wage differential among all four groups but here we must be careful to plug in the correct combination of zeros and ones Setting female 5 0 and married 5 0 corresponds to the group single men which is the base group since this eliminates female married and female married We can find the intercept for married men by setting female 5 0 and married 5 1 in 714 this gives an intercept of 321 1 213 5 534 and so on Equation 714 is just a different way of finding wage differentials across all gendermarital status combinations It allows us to easily test the null hypothesis that the gender differential does not depend on marital status equivalently that the marriage differential does not depend on gender Equation 711 is more convenient for testing for wage differentials between any group and the base group of single men Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 218 ExamplE 79 Effects of Computer Usage on Wages Krueger 1993 estimates the effects of computer usage on wages He defines a dummy variable which we call compwork equal to one if an individual uses a computer at work Another dummy variable comphome equals one if the person uses a computer at home Using 13379 people from the 1989 Current Population Survey Krueger 1993 Table 4 obtains log1wage2 5 b 0 1 177 compwork 1 070 comphome 10092 10192 715 1 017 compwork comphome 1 other factors 10232 The other factors are the standard ones for wage regressions including education experience gender and marital status see Kruegers paper for the exact list Krueger does not report the intercept because it is not of any importance all we need to know is that the base group consists of people who do not use a computer at home or at work It is worth noticing that the estimated return to using a computer at work but not at home is about 177 The more precise estimate is 194 Similarly people who use computers at home but not at work have about a 7 wage premium over those who do not use a computer at all The differential between those who use a computer at both places relative to those who use a computer in neither place is about 264 obtained by adding all three coefficients and mul tiplying by 100 or the more precise estimate 302 obtained from equation 710 The interaction term in 715 is not statistically significant nor is it very big economically But it is causing little harm by being in the equation 74b Allowing for Different Slopes We have now seen several examples of how to allow different intercepts for any number of groups in a multiple regression model There are also occasions for interacting dummy variables with explanatory variables that are not dummy variables to allow for a difference in slopes Continuing with the wage example suppose that we wish to test whether the return to education is the same for men and women allowing for a constant wage differential between men and women a differential for which we have already found evidence For simplicity we include only education and gender in the model What kind of model allows for different returns to education Consider the model log1wage2 5 1b0 1 d0 female2 1 1b1 1 d1 female2educ 1 u 716 If we plug female 5 0 into 716 then we find that the intercept for males is b0 and the slope on education for males is b1 For females we plug in female 5 1 thus the intercept for females is b0 1 d0 and the slope is b1 1 d1 Therefore d0 measures the difference in intercepts between women and men and d1 measures the difference in the return to education between women and men Two of the four cases for the signs of d0 and d1 are presented in Figure 72 Graph a shows the case where the intercept for women is below that for men and the slope of the line is smaller for women than for men This means that women earn less than men at all levels of education and the gap increases as educ gets larger In graph b the intercept for women is below that for men but the slope on education is larger for women This means that women earn less than men at low levels of educa tion but the gap narrows as education increases At some point a woman earns more than a man with the same level of education and this amount of education is easily found once we have the estimated equation How can we estimate model 716 To apply OLS we must write the model with an interaction between female and educ log1wage2 5 b0 1 d0 female 1 b1educ 1 d1 female educ 1 u 717 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 219 The parameters can now be estimated from the regression of logwage on female educ and female educ Obtaining the interaction term is easy in any regression package Do not be daunted by the odd nature of female educ which is zero for any man in the sample and equal to the level of edu cation for any woman in the sample An important hypothesis is that the return to education is the same for women and men In terms of model 717 this is stated as H0 d1 5 0 which means that the slope of logwage with respect to educ is the same for men and women Note that this hypothesis puts no restrictions on the difference in intercepts d0 A wage differential between men and women is allowed under this null but it must be the same at all levels of education This situation is described by Figure 71 We are also interested in the hypothesis that average wages are identical for men and women who have the same levels of education This means that d0 and d1 must both be zero under the null hypothe sis In equation 717 we must use an F test to test H0 d0 5 0 d1 5 0 In the model with just an inter cept difference we reject this hypothesis because H0 d0 5 0 is soundly rejected against H1 d0 0 ExamplE 710 log Hourly Wage Equation We add quadratics in experience and tenure to 717 log1wage2 5 389 2 227 female 1 082 educ 11192 11682 10082 2 0056 female educ 1 029 exper 2 00058 exper2 101312 10052 1000112 718 1 032 tenure 2 00059 tenure2 10072 1000242 n 5 526 R2 5 441 logwage a educ men women logwage b educ men women FiguRe 72 Graphs of equation 716 a d0 0 d1 0 b d0 0 d1 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 220 The estimated return to education for men in this equation is 082 or 82 For women it is 082 2 0056 5 0764 or about 76 The difference 256 or just over onehalf a percent age point less for women is not economically large nor statistically significant the t statistic is 200560131 243 Thus we conclude that there is no evidence against the hypothesis that the return to education is the same for men and women The coefficient on female while remaining economically large is no longer significant at con ventional levels 1t 5 21352 Its coefficient and t statistic in the equation without the interaction were 2297 and 2825 respectively see equation 79 Should we now conclude that there is no statisti cally significant evidence of lower pay for women at the same levels of educ exper and tenure This would be a serious error Because we have added the interaction femaleeduc to the equation the coef ficient on female is now estimated much less precisely than it was in equation 79 the standard error has increased by almost fivefold 1168036 4672 This occurs because female and female educ are highly correlated in the sample In this example there is a useful way to think about the multicol linearity in equation 717 and the more general equation estimated in 718 d0 measures the wage differential between women and men when educ 5 0 Very few people in the sample have very low levels of education so it is not surprising that we have a difficult time estimating the differential at educ 5 0 nor is the differential at zero years of education very informative More interesting would be to estimate the gender differential at say the average education level in the sample about 125 To do this we would replace female educ with female 1educ 21252 and rerun the regression this only changes the coefficient on female and its standard error See Computer Exercise C7 If we compute the F statistic for H0 d0 5 0 d1 5 0 we obtain F 5 3433 which is a huge value for an F random variable with numerator df 5 2 and denominator df 5 518 the pvalue is zero to four decimal places In the end we prefer model 79 which allows for a constant wage differential between women and men As a more complicated example involving interac tions we now look at the effects of race and city racial composition on major league baseball player salaries ExamplE 711 Effects of Race on Baseball player Salaries Using MLB1 the following equation is estimated for the 330 major league baseball players for which city racial composition statistics are available The variables black and hispan are binary indicators for the individual players The base group is white players The variable percblck is the percentage of the teams city that is black and perchisp is the percentage of Hispanics The other variables meas ure aspects of player productivity and longevity Here we are interested in race effects after control ling for these other factors In addition to including black and hispan in the equation we add the interactions blackpercblck and hispanperchisp The estimated equation is log1salary2 5 1034 1 0673 years 1 0089 gamesyr 12182 101292 100342 1 00095 bavg 1 0146 hrunsyr 1 0045 rbisyr 1001512 101642 100762 1 0072 runsyr 1 0011 fldperc 1 0075 allstar 100462 100212 100292 How would you augment the model esti mated in 718 to allow the return to tenure to differ by gender Exploring FurthEr 74 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 221 2 198 black 2 190 hispan 1 0125 black percblck 11252 11532 100502 1 0201 hispan perchisp 100982 n 5 330 R2 5 638 719 First we should test whether the four race variables black hispan blackpercblck and hispanperchisp are jointly significant Using the same 330 players the Rsquared when the four race variables are dropped is 626 Since there are four restrictions and df 5 330 2 13 in the unrestricted model the F statistic is about 263 which yields a pvalue of 034 Thus these variables are jointly significant at the 5 level though not at the 1 level How do we interpret the coefficients on the race variables In the following discussion all pro ductivity factors are held fixed First consider what happens for black players holding perchisp fixed The coefficient 2198 on black literally means that if a black player is in a city with no blacks 1percblck 5 02 then the black player earns about 198 less than a comparable white player As percblck increaseswhich means the white population decreases since perchisp is held fixedthe salary of blacks increases relative to that for whites In a city with 10 blacks logsalary for blacks compared to that for whites is 2198 1 01251102 5 2073 so salary is about 73 less for blacks than for whites in such a city When percblck 5 20 blacks earn about 52 more than whites The largest percentage of blacks in a city is about 74 Detroit Similarly Hispanics earn less than whites in cities with a low percentage of Hispanics But we can easily find the value of perchisp that makes the differential between whites and Hispanics equal zero it must make 2190 1 0201 perchisp 5 0 which gives perchisp 945 For cities in which the percentage of Hispanics is less than 945 Hispanics are predicted to earn less than whites for a given black population and the opposite is true if the percentage of Hispanics is above 945 Twelve of the 22 cities represented in the sample have Hispanic populations that are less than 945 of the total population The largest percentage of Hispanics is about 31 How do we interpret these findings We cannot simply claim discrimination exists against blacks and Hispanics because the estimates imply that whites earn less than blacks and Hispanics in cities heavily populated by minorities The importance of city composition on salaries might be due to player preferences perhaps the best black players live disproportionately in cities with more blacks and the best Hispanic players tend to be in cities with more Hispanics The estimates in 719 allow us to determine that some relationship is present but we cannot distinguish between these two hypotheses 74c Testing for Differences in Regression Functions across Groups The previous examples illustrate that interacting dummy variables with other independent variables can be a powerful tool Sometimes we wish to test the null hypothesis that two populations or groups follow the same regression function against the alternative that one or more of the slopes differ across the groups We will also see examples of this in Chapter 13 when we discuss pooling different cross sections over time Suppose we want to test whether the same regression model describes college grade point aver ages for male and female college athletes The equation is cumgpa 5 b0 1 b1sat 1 b2hsperc 1 b3tothrs 1 u where sat is SAT score hsperc is high school rank percentile and tothrs is total hours of college courses We know that to allow for an intercept difference we can include a dummy variable for either males or females If we want any of the slopes to depend on gender we simply interact the appropriate variable with say female and include it in the equation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 222 If we are interested in testing whether there is any difference between men and women then we must allow a model where the intercept and all slopes can be different across the two groups cumgpa 5 b0 1 d0 female 1 b1sat 1 d1 female sat 1 b2hsperc 720 1 d2 female hsperc 1 b3tothrs 1 d3 female tothrs 1 u The parameter d0 is the difference in the intercept between women and men d1 is the slope difference with respect to sat between women and men and so on The null hypothesis that cumgpa follows the same model for males and females is stated as H0 d0 5 0 d1 5 0 d2 5 0 d3 5 0 721 If one of the dj is different from zero then the model is different for men and women Using the spring semester data from the file GPA3 the full model is estimated as cumgpa 5 148 2 353 female 1 0011 sat 1 00075 female sat 10212 14112 100022 1000392 20085 hsperc 2 00055 female hsperc 1 0023 tothrs 100142 1003162 100092 722 200012 female tothrs 1001632 n 5 366 R2 5 406 R2 5 394 None of the four terms involving the female dummy variable is very statistically significant only the femalesat interaction has a t statistic close to two But we know better than to rely on the individual t statistics for testing a joint hypothesis such as 721 To compute the F statistic we must estimate the restricted model which results from dropping female and all of the interactions this gives an R2 the restricted R2 of about 352 so the F statistic is about 814 the pvalue is zero to five decimal places which causes us to soundly reject 721 Thus men and women athletes do follow different GPA models even though each term in 722 that allows women and men to be different is individually insignificant at the 5 level The large standard errors on female and the interaction terms make it difficult to tell exactly how men and women differ We must be very careful in interpreting equation 722 because in obtaining differences between women and men the interaction terms must be taken into account If we look only at the female variable we would wrongly conclude that cumgpa is about 353 less for women than for men holding other factors fixed This is the estimated difference only when sat hsperc and tothrs are all set to zero which is not close to being a possible scenario At sat 5 1 100 hsperc 5 10 and tothrs 5 50 the predicted difference between a woman and a man is 2353 1 00075111002 2 000551102 2 000121502 461 That is the female athlete is pre dicted to have a GPA that is almost onehalf a point higher than the comparable male athlete In a model with three variables sat hsperc and tothrs it is pretty simple to add all of the inter actions to test for group differences In some cases many more explanatory variables are involved and then it is convenient to have a different way to compute the statistic It turns out that the sum of squared residuals form of the F statistic can be computed easily even when many independent vari ables are involved In the general model with k explanatory variables and an intercept suppose we have two groups call them g 5 1 and g 5 2 We would like to test whether the intercept and all slopes are the same across the two groups Write the model as y 5 bg 0 1 bg 1x1 1 bg 2x2 1 p 1 bg kxk 1 u 723 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 223 for g 5 1 and g 5 2 The hypothesis that each beta in 723 is the same across the two groups involves k 1 1 restrictions in the GPA example k 1 1 5 4 The unrestricted model which we can think of as having a group dummy variable and k interaction terms in addition to the inter cept and variables themselves has n 2 21k 1 12 degrees of freedom In the GPA example n 2 21k 1 12 5 366 2 2142 5 358 So far there is nothing new The key insight is that the sum of squared residuals from the unrestricted model can be obtained from two separate regressions one for each group Let SSR1 be the sum of squared residuals obtained estimating 723 for the first group this involves n1 observations Let SSR2 be the sum of squared residuals obtained from estimating the model using the second group n2 observations In the previous example if group 1 is females then n1 5 90 and n2 5 276 Now the sum of squared residuals for the unrestricted model is simply SSRur 5 SSR1 1 SSR2 The restricted sum of squared residuals is just the SSR from pooling the groups and estimating a single equation say SSRP Once we have these we compute the F statistic as usual F 5 3SSRP 2 1SSR1 1 SSR22 4 SSR1 1 SSR2 3n 2 21k 1 12 4 k 1 1 724 where n is the total number of observations This particular F statistic is usually called the Chow statistic in econometrics Because the Chow test is just an F test it is only valid under homoskedas ticity In particular under the null hypothesis the error variances for the two groups must be equal As usual normality is not needed for asymptotic analysis To apply the Chow statistic to the GPA example we need the SSR from the regression that pooled the groups together this is SSRP 5 85515 The SSR for the 90 women in the sample is SSR1 5 19603 and the SSR for the men is SSR2 5 58752 Thus SSRur 5 19603 1 58752 5 78355 The F sta tistic is 3 185515 2 783552783554135842 8184 of course subject to rounding error this is what we get using the Rsquared form of the test in the models with and without the interaction terms A word of caution there is no simple Rsquared form of the test if separate regressions have been estimated for each group the Rsquared form of the test can be used only if interactions have been included to create the unrestricted model One important limitation of the traditional Chow test regardless of the method used to imple ment it is that the null hypothesis allows for no differences at all between the groups In many cases it is more interesting to allow for an intercept difference between the groups and then to test for slope differences we saw one example of this in the wage equation in Example 710 There are two ways to allow the intercepts to differ under the null hypothesis One is to include the group dummy and all interaction terms as in equation 722 but then test joint significance of the interaction terms only The second approach which produces an identical statistic is to form a sumofsquaredresiduals F statistic as in equation 724 but where the restricted SSR called SSRP in equation 724 is obtained using a regression that contains only an intercept shift Because we are testing k restrictions rather than k 1 1 the F statistic becomes F 5 3SSRP 2 1SSR1 1 SSR22 4 SSR1 1 SSR2 3n 2 21k 1 12 4 k Using this approach in the GPA example SSRP is obtained from the regression cumgpa on female sat hsperc and tothrs using the data for both male and female studentathletes Because there are relatively few explanatory variables in the GPA example it is easy to estimate 720 and test H0 d1 5 0 d2 5 0 d3 5 0 with d0 unrestricted under the null The F statistic for the three exclusion restrictions gives a pvalue equal to 205 and so we do not reject the null hypothesis at even the 20 significance level Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 224 Failure to reject the hypothesis that the parameters multiplying the interaction terms are all zero suggests that the best model allows for an intercept difference only cumgpa 5 139 1 310 female 1 0012 sat 2 0084 hsperc 1182 10592 100022 100122 1 0025 tothrs 725 100072 n 5 366 R2 5 398 R2 5 392 The slope coefficients in 725 are close to those for the base group males in 722 dropping the interactions changes very little However female in 725 is highly significant its t statistic is over 5 and the estimate implies that at given levels of sat hsperc and tothrs a female athlete has a predicted GPA that is 31 point higher than that of a male athlete This is a practically important difference 75 A Binary Dependent Variable The Linear Probability Model By now we have learned much about the properties and applicability of the multiple linear regression model In the last several sections we studied how through the use of binary independent variables we can incorporate qualitative information as explanatory variables in a multiple regression model In all of the models up until now the dependent variable y has had quantitative meaning for example y is a dollar amount a test score a percentage or the logs of these What happens if we want to use multiple regression to explain a qualitative event In the simplest case and one that often arises in practice the event we would like to explain is a binary outcome In other words our dependent variable y takes on only two values zero and one For example y can be defined to indicate whether an adult has a high school education y can indicate whether a college student used illegal drugs during a given school year or y can indicate whether a firm was taken over by another firm during a given year In each of these examples we can let y 5 1 denote one of the outcomes and y 5 0 the other outcome What does it mean to write down a multiple regression model such as y 5 b0 1 b1x1 1 p 1 bkxk 1 u 726 when y is a binary variable Because y can take on only two values bj cannot be interpreted as the change in y given a oneunit increase in xj holding all other factors fixed y either changes from zero to one or from one to zero or does not change Nevertheless the bj still have useful interpretations If we assume that the zero conditional mean assumption MLR4 holds that is E1u0x1 p xk2 5 0 then we have as always E1y0x2 5 b0 1 b1x1 1 p 1 bkxk where x is shorthand for all of the explanatory variables The key point is that when y is a binary variable taking on the values zero and one it is always true that P1y 5 10x2 5 E1y0x2 the probability of successthat is the probability that y 5 1is the same as the expected value of y Thus we have the important equation P1y 5 10x2 5 b0 1 b1x1 1 p 1 bkxk 727 which says that the probability of success say p1x2 5 P1y 5 10x2 is a linear function of the xj Equation 727 is an example of a binary response model and P1y 5 10x2 is also called the response Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 225 probability We will cover other binary response models in Chapter 17 Because probabilities must sum to one P1y 5 00x2 5 1 2 P1y 5 10x2 is also a linear function of the xj The multiple linear regression model with a binary dependent variable is called the linear prob ability model LPM because the response probability is linear in the parameters bj In the LPM bj measures the change in the probability of success when xj changes holding other factors fixed DP1y 5 10x2 5 bjDxj 728 With this in mind the multiple regression model can allow us to estimate the effect of various explan atory variables on qualitative events The mechanics of OLS are the same as before If we write the estimated equation as y 5 b 0 1 b 1x1 1 p 1 b kxk we must now remember that y is the predicted probability of success Therefore b 0 is the predicted probability of success when each xj is set to zero which may or may not be interesting The slope coefficient b 1 measures the predicted change in the probability of success when x1 increases by one unit To correctly interpret a linear probability model we must know what constitutes a success Thus it is a good idea to give the dependent variable a name that describes the event y 5 1 As an example let inlf in the labor force be a binary variable indicating labor force participation by a married woman during 1975 inlf 5 1 if the woman reports working for a wage outside the home at some point during the year and zero otherwise We assume that labor force participation depends on other sources of income including husbands earnings nwifeinc measured in thousands of dollars years of education educ past years of labor market experience exper age number of children less than six years old kidslt6 and number of kids between 6 and 18 years of age kidsge6 Using the data in MROZ from Mroz 1987 we estimate the following linear probability model where 428 of the 753 women in the sample report being in the labor force at some point during 1975 inlf 5 586 2 0034 nwifeinc 1 038 educ 1 039 exper 11542 100142 10072 10062 2 00060 exper2 2 016 age 2 262 kidslt6 1 013 kidsge6 729 1000182 10022 10342 10132 n 5 753 R2 5 264 Using the usual t statistics all variables in 729 except kidsge6 are statistically significant and all of the significant variables have the effects we would expect based on economic theory or common sense To interpret the estimates we must remember that a change in the independent variable changes the probability that inlf 5 1 For example the coefficient on educ means that everything else in 729 held fixed another year of education increases the probability of labor force participation by 038 If we take this equation literally 10 more years of education increases the probability of being in the labor force by 0381102 5 38 which is a pretty large increase in a probability The relation ship between the probability of labor force participation and educ is plotted in Figure 73 The other independent variables are fixed at the values nwifeinc 5 50 exper 5 5 age 5 30 kidslt6 5 1 and kidsge6 5 0 for illustration purposes The predicted probability is negative until education equals 384 years This should not cause too much concern because in this sample no woman has less than five years of education The largest reported education is 17 years and this leads to a predicted probability of 5 If we set the other independent variables at different values the range of predicted probabilities would change But the marginal effect of another year of education on the probability of labor force participation is always 038 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 226 The coefficient on nwifeinc implies that if Dnwifeinc 5 10 which means an increase of 10000 the probability that a woman is in the labor force falls by 034 This is not an especially large effect given that an increase in income of 10000 is substantial in terms of 1975 dollars Experience has been entered as a quadratic to allow the effect of past experience to have a diminishing effect on the labor force participation probability Holding other factors fixed the estimated change in the probabil ity is approximated as 039 2 2100062exper 5 039 2 0012 exper The point at which past experi ence has no effect on the probability of labor force participation is 0390012 5 325 which is a high level of experience only 13 of the 753 women in the sample have more than 32 years of experience Unlike the number of older children the number of young children has a huge impact on labor force participation Having one additional child less than six years old reduces the probability of par ticipation by 2262 at given levels of the other variables In the sample just under 20 of the women have at least one young child This example illustrates how easy linear probability models are to estimate and interpret but it also highlights some shortcomings of the LPM First it is easy to see that if we plug certain combina tions of values for the independent variables into 729 we can get predictions either less than zero or greater than one Since these are predicted probabilities and probabilities must be between zero and one this can be a little embarrassing For example what would it mean to predict that a woman is in the labor force with a probability of 10 In fact of the 753 women in the sample 16 of the fitted values from 729 are less than zero and 17 of the fitted values are greater than one A related problem is that a probability cannot be linearly related to the independent variables for all their possible values For example 729 predicts that the effect of going from zero chil dren to one young child reduces the probability of working by 262 This is also the predicted drop if the woman goes from having one young child to two It seems more realistic that the first small child would reduce the probability by a large amount but subsequent children would have a smaller educ probability of labor force participation 384 5 0 146 slope 038 FiguRe 73 Estimated relationship between the probability of being in the labor force and years of education with other explanatory variables fixed Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 227 marginal effect In fact when taken to the extreme 729 implies that going from zero to four young children reduces the probability of working by D inlf 5 2621Dkidslt62 5 262142 5 1048 which is impossible Even with these problems the linear probability model is useful and often applied in economics It usually works well for values of the independent variables that are near the averages in the sample In the labor force participation example no women in the sample have four young children in fact only three women have three young children Over 96 of the women have either no young children or one small child and so we should probably restrict attention to this case when interpreting the estimated equation Predicted probabilities outside the unit interval are a little troubling when we want to make pre dictions Still there are ways to use the estimated probabilities even if some are negative or greater than one to predict a zeroone outcome As before let yi denote the fitted valueswhich may not be bounded between zero and one Define a predicted value as yi 5 1 if yi 5 and yi 5 0 if yi 5 Now we have a set of predicted values yi i 5 1 p n that like the yi are either zero or one We can use the data on yI and yi to obtain the frequencies with which we correctly predict yi 5 1 and yi 5 0 as well as the proportion of overall correct predictions The latter measure when turned into a percentage is a widely used goodnessoffit measure for binary dependent variables the percent correctly predicted An example is given in Computer Exercise C9v and further discussion in the context of more advanced models can be found in Section 171 Due to the binary nature of y the linear probability model does violate one of the GaussMarkov assumptions When y is a binary variable its variance conditional on x is Var1y0x2 5 p1x2 31 2 p1x2 4 730 where px is shorthand for the probability of success p1x2 5 b0 1 b1x1 1 p 1 bkxk This means that except in the case where the probability does not depend on any of the independent variables there must be heteroskedasticity in a linear probability model We know from Chapter 3 that this does not cause bias in the OLS estimators of the bj But we also know from Chapters 4 and 5 that homoske dasticity is crucial for justifying the usual t and F statistics even in large samples Because the stand ard errors in 729 are not generally valid we should use them with caution We will show how to correct the standard errors for heteroskedasticity in Chapter 8 It turns out that in many applications the usual OLS statistics are not far off and it is still acceptable in applied work to present a standard OLS analysis of a linear probability model ExamplE 712 a linear probability model of arrests Let arr86 be a binary variable equal to unity if a man was arrested during 1986 and zero otherwise The population is a group of young men in California born in 1960 or 1961 who have at least one arrest prior to 1986 A linear probability model for describing arr86 is arr86 5 b0 1 b1 pcnv 1 b2 avgsen 1 b3 tottime 1 b4 ptime86 1 b5 qemp86 1 u where pcnv 5 the proportion of prior arrests that led to a conviction avgsen 5 the average sentence served from prior convictions 1in months2 tottime 5 months spent in prison since age 18 prior to 1986 ptime86 5 months spent in prison in 1986 qemp86 5 the number of quarters 10 to 42 that the man was legally employed in 1986 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 228 The data we use are in CRIME1 the same data set used for Example 35 Here we use a binary dependent variable because only 72 of the men in the sample were arrested more than once About 277 of the men were arrested at least once during 1986 The estimated equation is arr86 5 441 2 162 pcnv 1 0061 avgsen 2 0023 tottime 10172 10212 100652 100502 2 022 ptime86 2 043 qemp86 731 10052 10052 n 5 2725 R2 5 0474 The intercept 441 is the predicted probability of arrest for someone who has not been convicted and so pcnv and avgsen are both zero has spent no time in prison since age 18 spent no time in prison in 1986 and was unemployed during the entire year The variables avgsen and tottime are insignificant both individually and jointly the F test gives pvalue 5 347 and avgsen has a coun terintuitive sign if longer sentences are supposed to deter crime Grogger 1991 using a superset of these data and different econometric methods found that tottime has a statistically significant posi tive effect on arrests and concluded that tottime is a measure of human capital built up in criminal activity Increasing the probability of conviction does lower the probability of arrest but we must be careful when interpreting the magnitude of the coefficient The variable pcnv is a proportion between zero and one thus changing pcnv from zero to one essentially means a change from no chance of being convicted to being convicted with certainty Even this large change reduces the probability of arrest only by 162 increasing pcnv by 5 decreases the probability of arrest by 081 The incarcerative effect is given by the coefficient on ptime86 If a man is in prison he cannot be arrested Since ptime86 is measured in months six more months in prison reduces the probability of arrest by 022162 5 132 Equation 731 gives another example of where the linear probability model cannot be true over all ranges of the independent variables If a man is in prison all 12 months of 1986 he cannot be arrested in 1986 Setting all other variables equal to zero the predicted proba bility of arrest when ptime86 5 12 is 441 2 0221122 5 177 which is not zero Nevertheless if we start from the unconditional probability of arrest 277 12 months in prison reduces the probability to essentially zero 277 2 0221122 5 013 Finally employment reduces the probability of arrest in a significant way All other factors fixed a man employed in all four quarters is 172 less likely to be arrested than a man who is not employed at all We can also include dummy independent variables in models with dummy dependent variables The coefficient measures the predicted difference in probability relative to the base group For example if we add two race dummies black and hispan to the arrest equation we obtain arr86 5 380 2 152 pcnv 1 0046 avgsen 2 0026 tottime 10192 10212 100642 100492 2 024 ptime86 2 038 qemp86 1 170 black 1 096 hispan 732 10052 10052 10242 10212 n 5 2725 R2 5 0682 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 229 The coefficient on black means that all other factors being equal a black man has a 17 higher chance of being arrested than a white man the base group Another way to say this is that the probability of arrest is 17 percentage points higher for blacks than for whites The difference is statistically significant as well Similarly Hispanic men have a 096 higher chance of being arrested than white men 76 More on Policy Analysis and Program Evaluation We have seen some examples of models containing dummy variables that can be useful for evaluating policy Example 73 gave an example of program evaluation where some firms received job training grants and others did not As we mentioned earlier we must be careful when evaluating programs because in most exam ples in the social sciences the control and treatment groups are not randomly assigned Consider again the Holzer et al 1993 study where we are now interested in the effect of the job training grants on worker productivity as opposed to amount of job training The equation of interest is log1scrap2 5 b0 1 b1grant 1 b2log1sales2 1 b3log1employ2 1 u where scrap is the firms scrap rate and the latter two variables are included as controls The binary variable grant indicates whether the firm received a grant in 1988 for job training Before we look at the estimates we might be worried that the unobserved factors affecting worker productivitysuch as average levels of education ability experience and tenuremight be correlated with whether the firm receives a grant Holzer et al point out that grants were given on a firstcome firstserved basis But this is not the same as giving out grants randomly It might be that firms with less productive workers saw an opportunity to improve productivity and therefore were more diligent in applying for the grants Using the data in JTRAIN for 1988when firms actually were eligible to receive the grantswe obtain log1scrap2 5 499 2 052 grant 2 455 log1sales2 14662 14312 13732 1 639 log1employ2 733 13652 n 5 50 R2 5 072 Seventeen out of the 50 firms received a training grant and the average scrap rate is 347 across all firms The point estimate of 2052 on grant means that for given sales and employ firms receiv ing a grant have scrap rates about 52 lower than firms without grants This is the direction of the expected effect if the training grants are effective but the t statistic is very small Thus from this crosssectional analysis we must conclude that the grants had no effect on firm productivity We will return to this example in Chapter 9 and show how adding information from a prior year leads to a much different conclusion Even in cases where the policy analysis does not involve assigning units to a control group and a treatment group we must be careful to include factors that might be systematically related to the binary independent variable of interest A good example of this is testing for racial discrimination Race is something that is not determined by an individual or by government administrators In fact What is the predicted probability of arrest for a black man with no prior convictionsso that pcnv avgsen tottime and ptime86 are all zerowho was employed all four quar ters in 1986 Does this seem reasonable Exploring FurthEr 75 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 230 race would appear to be the perfect example of an exogenous explanatory variable given that it is determined at birth However for historical reasons race is often related to other relevant factors there are systematic differences in backgrounds across race and these differences can be important in testing for current discrimination As an example consider testing for discrimination in loan approvals If we can collect data on say individual mortgage applications then we can define the dummy dependent variable approved as equal to one if a mortgage application was approved and zero otherwise A systematic difference in approval rates across races is an indication of discrimination However since approval depends on many other factors including income wealth credit ratings and a general ability to pay back the loan we must control for them if there are systematic differences in these factors across race A linear probability model to test for discrimination might look like the following approved 5 b0 1 b1nonwhite 1 b2income 1 b3wealth 1 b4credrate 1 other factors Discrimination against minorities is indicated by a rejection of H0 b1 5 0 in favor of H0 b1 0 because b1 is the amount by which the probability of a nonwhite getting an approval differs from the probability of a white getting an approval given the same levels of other variables in the equation If income wealth and so on are systematically different across races then it is important to control for these factors in a multiple regression analysis Another problem that often arises in policy and program evaluation is that individuals or firms or cities choose whether or not to participate in certain behaviors or programs For example individuals choose to use illegal drugs or drink alcohol If we want to examine the effects of such behaviors on unemployment status earnings or criminal behavior we should be concerned that drug usage might be correlated with other factors that can affect employment and criminal outcomes Children eligible for programs such as Head Start participate based on parental decisions Since family background plays a role in Head Start decisions and affects student outcomes we should control for these factors when examining the effects of Head Start see for example Currie and Thomas 1995 Individuals selected by employers or government agencies to participate in job training programs can participate or not and this decision is unlikely to be random see for example Lynch 1992 Cities and states choose whether to implement certain gun control laws and it is likely that this decision is systemati cally related to other factors that affect violent crime see for example Kleck and Patterson 1993 The previous paragraph gives examples of what are generally known as selfselection problems in economics Literally the term comes from the fact that individuals selfselect into certain behav iors or programs participation is not randomly determined The term is used generally when a binary indicator of participation might be systematically related to unobserved factors Thus if we write the simple model y 5 b0 1 b1partic 1 u 734 where y is an outcome variable and partic is a binary variable equal to unity if the individual firm or city participates in a behavior or a program or has a certain kind of law then we are worried that the average value of u depends on participation E1u0partic 5 12 2 E1u0partic 5 02 As we know this causes the simple regression estimator of b1 to be biased and so we will not uncover the true effect of participation Thus the selfselection problem is another way that an explanatory variable partic in this case can be endogenous By now we know that multiple regression analysis can to some degree alleviate the self selection problem Factors in the error term in 734 that are correlated with partic can be included in a multiple regression equation assuming of course that we can collect data on these factors Unfortunately in many cases we are worried that unobserved factors are related to participation in which case multiple regression produces biased estimators With standard multiple regression analysis using crosssectional data we must be aware of find ing spurious effects of programs on outcome variables due to the selfselection problem A good Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 231 example of this is contained in Currie and Cole 1993 These authors examine the effect of AFDC Aid to Families with Dependent Children participation on the birth weight of a child Even after controlling for a variety of family and background characteristics the authors obtain OLS estimates that imply participation in AFDC lowers birth weight As the authors point out it is hard to believe that AFDC participation itself causes lower birth weight See Currie 1995 for additional examples Using a different econometric method that we will discuss in Chapter 15 Currie and Cole find evi dence for either no effect or a positive effect of AFDC participation on birth weight When the selfselection problem causes standard multiple regression analysis to be biased due to a lack of sufficient control variables the more advanced methods covered in Chapters 1314 and 15 can be used instead 77 Interpreting Regression Results with Discrete Dependent Variables A binary response is the most extreme form of a discrete random variable it takes on only two val ues zero and one As we discussed in Section 75 the parameters in a linear probability model can be interpreted as measuring the change in the probability that y 5 1 due to a oneunit increase in an explanatory variable We also discussed that because y is a zeroone outcome P1y 5 12 5 E1y2 and this equality continues to hold when we condition on explanatory variables Other discrete dependent variables arise in practice and we have already seen some examples such as the number of times someone is arrested in a given year Example 35 Studies on factors affecting fertility often use the number of living children as the dependent variable in a regression analysis As with number of arrests the number of living children takes on a small set of integer val ues and zero is a common value The data in FERTIL2 which contains information on a large sample of women in Botswana is one such example Often demographers are interested in the effects of edu cation on fertility with special attention to trying to determine whether education has a causal effect on fertility Such examples raise a question about how one interprets regression coefficients after all one cannot have a fraction of a child To illustrate the issues the regression below uses the data in FERTIL2 children 5 21997 1 175 age 2 090 educ 10942 10032 10062 735 n 5 4361 R2 5 560 At this time we ignore the issue of whether this regression adequately controls for all factors that affect fertility Instead we focus on interpreting the regression coefficients Consider the main coefficient of interest b educ 5 2090 If we take this estimate literally it says that each additional year of education reduces the estimated number of children by 090something obviously impossible for any particular woman A similar problem arises when trying to interpret b age 5 175 How can we make sense of these coefficients To interpret regression results generally even in cases where y is discrete and takes on a small number of values it is useful to remember the interpretation of OLS as estimating the effects of the xj on the expected or average value of y Generally under Assumptions MLR1 and MLR4 E1y0x1 x2 p xk2 5 b0 1 b1x1 1 p 1 bkxk 736 Therefore bj is the effect of a ceteris paribus increase of xj on the expected value of y As we discussed in Section 64 for a given set of xj values we interpret the predicted value b 0 1 b 1x1 1 p 1 b kxk as an estimate of E1y0x1 x2 p xk2 Therefore b j is our estimate of how the average of y changes when Dxj 5 1 keeping other factors fixed Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 232 PART 1 Regression Analysis with CrossSectional Data Seen in this light we can now provide meaning to regression results as in equation 735 The coefficient b educ 5 2090 means that we estimate that average fertility falls by 09 children given one more year of education A nice way to summarize this interpretation is that if each woman in a group of 100 obtains another year of education we estimate there will be nine fewer children among them Adding dummy variables to regressions when y is itself discrete causes no problems when we interpret the estimated effect in terms of average values Using the data in FERTIL2 we get children 5 22071 1 177 age 2 079 educ 2 362 electric 737 10952 10032 10062 10682 n 5 4358 R2 5 562 where electric is a dummy variable equal to one if the woman lives in a home with electricity Of course it cannot be true that a particular woman who has electricity has 362 less children than an otherwise comparable woman who does not But we can say that when comparing 100 women with electricity to 100 women withoutat the same age and level of educationwe estimate the former group to have about 36 fewer children Incidentally when y is discrete the linear model does not always provide the best estimates of par tial effects on E1y0x1 x2 p xk2 Chapter 17 contains more advanced models and estimation methods that tend to fit the data better when the range of y is limited in some substantive way Nevertheless a linear model estimated by OLS often provides a good approximation to the true partial effects at least on average Summary In this chapter we have learned how to use qualitative information in regression analysis In the sim plest case a dummy variable is defined to distinguish between two groups and the coefficient estimate on the dummy variable estimates the ceteris paribus difference between the two groups Allowing for more than two groups is accomplished by defining a set of dummy variables if there are g groups then g 2 1 dummy variables are included in the model All estimates on the dummy variables are inter preted relative to the base or benchmark group the group for which no dummy variable is included in the model Dummy variables are also useful for incorporating ordinal information such as a credit or a beauty rating in regression models We simply define a set of dummy variables representing different outcomes of the ordinal variable allowing one of the categories to be the base group Dummy variables can be interacted with quantitative variables to allow slope differences across different groups In the extreme case we can allow each group to have its own slope on every vari able as well as its own intercept The Chow test can be used to detect whether there are any dif ferences across groups In many cases it is more interesting to test whether after allowing for an intercept difference the slopes for two different groups are the same A standard F test can be used for this purpose in an unrestricted model that includes interactions between the group dummy and all variables The linear probability model which is simply estimated by OLS allows us to explain a binary re sponse using regression analysis The OLS estimates are now interpreted as changes in the probability of success 1y 5 12 given a oneunit increase in the corresponding explanatory variable The LPM does have some drawbacks it can produce predicted probabilities that are less than zero or greater than one it implies a constant marginal effect of each explanatory variable that appears in its original form and it contains heteroskedasticity The first two problems are often not serious when we are obtaining estimates Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 7 Multiple Regression Analysis with Qualitative Information 233 of the partial effects of the explanatory variables for the middle ranges of the data Heteroskedasticity does invalidate the usual OLS standard errors and test statistics but as we will see in the next chapter this is easily fixed in large enough samples Section 76 provides a discussion of how binary variables are used to evaluate policies and programs As in all regression analysis we must remember that program participation or some other binary regressor with policy implications might be correlated with unobserved factors that affect the dependent variable resulting in the usual omitted variables bias We ended this chapter with a general discussion of how to interpret regression equations when the dependent variable is discrete The key is to remember that the coefficients can be interpreted as the effects on the expected value of the dependent variable Key Terms Base Group Benchmark Group Binary Variable Chow Statistic Control Group Difference in Slopes Dummy Variable Trap Dummy Variables Experimental Group Interaction Term Intercept Shift Linear Probability Model LPM Ordinal Variable Percent Correctly Predicted Policy Analysis Program Evaluation Response Probability SelfSelection Treatment Group Uncentered RSquared ZeroOne Variable Problems 1 Using the data in SLEEP75 see also Problem 3 in Chapter 3 we obtain the estimated equation sleep 5 384083 2 163 totwrk 2 1171 educ 2 870 age 1235112 10182 15862 111212 1 128 age2 1 8775 male 11342 134332 n 5 706 R2 5 123 R2 5 117 The variable sleep is total minutes per week spent sleeping at night totwrk is total weekly minutes spent working educ and age are measured in years and male is a gender dummy i All other factors being equal is there evidence that men sleep more than women How strong is the evidence ii Is there a statistically significant tradeoff between working and sleeping What is the estimated tradeoff iii What other regression do you need to run to test the null hypothesis that holding other factors fixed age has no effect on sleeping 2 The following equations were estimated using the data in BWGHT log1bwght2 5 466 2 0044 cigs 1 0093 log1faminc2 1 016 parity 1222 100092 100592 10062 1 027 male 1 055 white 10102 10132 n 5 1388 R2 5 0472 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 234 PART 1 Regression Analysis with CrossSectional Data and log1bwght2 5 465 2 0052 cigs 1 0110 log1faminc2 1 017 parity 1382 100102 100852 10062 1 034 male 1 045 white 2 0030 motheduc 1 0032 fatheduc 10112 10152 100302 100262 n 5 1191 R2 5 0493 The variables are defined as in Example 49 but we have added a dummy variable for whether the child is male and a dummy variable indicating whether the child is classified as white i In the first equation interpret the coefficient on the variable cigs In particular what is the effect on birth weight from smoking 10 more cigarettes per day ii How much more is a white child predicted to weigh than a nonwhite child holding the other factors in the first equation fixed Is the difference statistically significant iii Comment on the estimated effect and statistical significance of motheduc iv From the given information why are you unable to compute the F statistic for joint significance of motheduc and fatheduc What would you have to do to compute the F statistic 3 Using the data in GPA2 the following equation was estimated sat 5 102810 1 1930 hsize 2 219 hsize2 2 4509 female 16292 13832 1532 14292 2 16981 black 1 6231 female black 112712 118152 n 5 4137 R2 5 0858 The variable sat is the combined SAT score hsize is size of the students high school graduating class in hundreds female is a gender dummy variable and black is a race dummy variable equal to one for blacks and zero otherwise i Is there strong evidence that hsize2 should be included in the model From this equation what is the optimal high school size ii Holding hsize fixed what is the estimated difference in SAT score between nonblack females and nonblack males How statistically significant is this estimated difference iii What is the estimated difference in SAT score between nonblack males and black males Test the null hypothesis that there is no difference between their scores against the alternative that there is a difference iv What is the estimated difference in SAT score between black females and nonblack females What would you need to do to test whether the difference is statistically significant 4 An equation explaining chief executive officer salary is log1salary2 5 459 1 257 log1sales2 1 011 roe 1 158 finance 1302 10322 10042 10892 1 181 consprod 2 283 utility 10852 10992 n 5 209 R2 5 357 The data used are in CEOSAL1 where finance consprod and utility are binary variables indicating the financial consumer products and utilities industries The omitted industry is transportation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 235 i Compute the approximate percentage difference in estimated salary between the utility and transportation industries holding sales and roe fixed Is the difference statistically significant at the 1 level ii Use equation 710 to obtain the exact percentage difference in estimated salary between the utility and transportation industries and compare this with the answer obtained in part i iii What is the approximate percentage difference in estimated salary between the consumer products and finance industries Write an equation that would allow you to test whether the difference is statistically significant 5 In Example 72 let noPC be a dummy variable equal to one if the student does not own a PC and zero otherwise i If noPC is used in place of PC in equation 76 what happens to the intercept in the estimated equation What will be the coefficient on noPC Hint Write PC 5 1 2 noPC and plug this into the equation colGPA 5 b 0 1 d 0PC 1 b 1hsGPA 1 b 2ACT ii What will happen to the Rsquared if noPC is used in place of PC iii Should PC and noPC both be included as independent variables in the model Explain 6 To test the effectiveness of a job training program on the subsequent wages of workers we specify the model log1wage2 5 b0 1 b1train 1 b2educ 1 b3exper 1 u where train is a binary variable equal to unity if a worker participated in the program Think of the error term u as containing unobserved worker ability If less able workers have a greater chance of being selected for the program and you use an OLS analysis what can you say about the likely bias in the OLS estimator of b1 Hint Refer back to Chapter 3 7 In the example in equation 729 suppose that we define outlf to be one if the woman is out of the labor force and zero otherwise i If we regress outlf on all of the independent variables in equation 729 what will happen to the intercept and slope estimates Hint inlf 5 1 2 outlf Plug this into the population equation inlf 5 b0 1 b1nwifeinc 1 b2educ 1 and rearrange ii What will happen to the standard errors on the intercept and slope estimates iii What will happen to the Rsquared 8 Suppose you collect data from a survey on wages education experience and gender In addition you ask for information about marijuana usage The original question is On how many separate occasions last month did you smoke marijuana i Write an equation that would allow you to estimate the effects of marijuana usage on wage while controlling for other factors You should be able to make statements such as Smoking marijuana five more times per month is estimated to change wage by x ii Write a model that would allow you to test whether drug usage has different effects on wages for men and women How would you test that there are no differences in the effects of drug usage for men and women iii Suppose you think it is better to measure marijuana usage by putting people into one of four categories nonuser light user 1 to 5 times per month moderate user 6 to 10 times per month and heavy user more than 10 times per month Now write a model that allows you to estimate the effects of marijuana usage on wage iv Using the model in part iii explain in detail how to test the null hypothesis that marijuana usage has no effect on wage Be very specific and include a careful listing of degrees of freedom v What are some potential problems with drawing causal inference using the survey data that you collected Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 236 PART 1 Regression Analysis with CrossSectional Data 9 Let d be a dummy binary variable and let z be a quantitative variable Consider the model y 5 b0 1 d0d 1 b1z 1 d1d z 1 u this is a general version of a model with an interaction between a dummy variable and a quantitative variable An example is in equation 717 i Since it changes nothing important set the error to zero u 5 0 Then when d 5 0 we can write the relationship between y and z as the function f01z2 5 b0 1 b1z Write the same relationship when d 5 1 where you should use f11z2 on the lefthand side to denote the linear function of z ii Assuming that d1 2 0 which means the two lines are not parallel show that the value of z such that f01z2 5 f11z2 is z 5 2d0d1 This is the point at which the two lines intersect as in Figure 72 b Argue that z is positive if and only if d0 and d1 have opposite signs iii Using the data in TWOYEAR the following equation can be estimated log1wage2 5 2289 2 357 female 1 50 totcoll 1 030 female totcoll 100112 10152 10032 10052 n 5 6763 R2 5 202 where all coefficients and standard errors have been rounded to three decimal places Using this equation find the value of totcoll such that the predicted values of logwage are the same for men and women iv Based on the equation in part iii can women realistically get enough years of college so that their earnings catch up to those of men Explain 10 For a child i living in a particular school district let voucheri be a dummy variable equal to one if a child is selected to participate in a school voucher program and let scorei be that childs score on a subsequent standardized exam Suppose that the participation variable voucheri is completely ran domized in the sense that it is independent of both observed and unobserved factors that can affect the test score i If you run a simple regression scorei on voucheri using a random sample of size n does the OLS estimator provide an unbiased estimator of the effect of the voucher program ii Suppose you can collect additional background information such as family income family structure eg whether the child lives with both parents and parents education levels Do you need to control for these factors to obtain an unbiased estimator of the effects of the voucher program Explain iii Why should you include the family background variables in the regression Is there a situation in which you would not include the background variables 11 The following equations were estimated using the data in ECONMATH with standard errors reported under coefficients The average class score measured as a percentage is about 722 exactly 50 of the students are male and the average of colgpa grade point average at the start of the term is about 281 score 5 3231 1 1432 colgpa 12002 10702 n 5 856 R2 5 329 R2 5 328 score 5 2966 1 383 male 1 1457 colgpa 12042 10742 10692 n 5 856 R2 5 349 R2 5 348 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 7 Multiple Regression Analysis with Qualitative Information 237 score 5 3036 1 247 male 1 1433 colgpa 1 0479 male colgpa 12862 13962 10982 113832 n 5 856 R2 5 349 R2 5 347 score 5 3036 1 382 male 1 1433 colgpa 1 0479 male 1colgpa 2 2812 12862 10742 10982 113832 n 5 856 R2 5 349 R2 5 347 i Interpret the coefficient on male in the second equation and construct a 95 confidence interval for bmale Does the confidence interval exclude zero ii In the second equation how come the estimate on male is so imprecise Should we now conclude that there are no gender differences in score after controlling for colgpa Hint You might want to compute an F statistic for the null hypothesis that there is no gender difference in the model with the interaction iii Compared with the third equation how come the coefficient on male in the last equation is so much closer to that in the second equation and just as precisely estimated Computer Exercises C1 Use the data in GPA1 for this exercise i Add the variables mothcoll and fathcoll to the equation estimated in 76 and report the results in the usual form What happens to the estimated effect of PC ownership Is PC still statistically significant ii Test for joint significance of mothcoll and fathcoll in the equation from part i and be sure to report the pvalue iii Add hsGPA2 to the model from part i and decide whether this generalization is needed C2 Use the data in WAGE2 for this exercise i Estimate the model log1wage2 5 b0 1 b1educ 1 b2exper 1 b3tenure 1 b4married 1 b5black 1 b6south 1 b7urban 1 u and report the results in the usual form Holding other factors fixed what is the approximate difference in monthly salary between blacks and nonblacks Is this difference statistically significant ii Add the variables exper2 and tenure2 to the equation and show that they are jointly insignificant at even the 20 level iii Extend the original model to allow the return to education to depend on race and test whether the return to education does depend on race iv Again start with the original model but now allow wages to differ across four groups of people married and black married and nonblack single and black and single and nonblack What is the estimated wage differential between married blacks and married nonblacks C3 A model that allows major league baseball player salary to differ by position is log1salary2 5 b0 1 b1 years 1 b2 gamesyr 1 b3 bavg 1 b4 hrunsyr 1 b5 rbisyr 1 b6 runsyr 1 b7 fldperc 1 b8 allstar 1 b9 frstbase 1 b10scndbase 1 b11thrdbase 1 b12 shrtstop 1 b13 catcher 1 u where outfield is the base group Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 238 PART 1 Regression Analysis with CrossSectional Data i State the null hypothesis that controlling for other factors catchers and outfielders earn on average the same amount Test this hypothesis using the data in MLB1 and comment on the size of the estimated salary differential ii State and test the null hypothesis that there is no difference in average salary across positions once other factors have been controlled for iii Are the results from parts i and ii consistent If not explain what is happening C4 Use the data in GPA2 for this exercise i Consider the equation colgpa 5 b0 1 b1hsize 1 b2hsize2 1 b3hsperc 1 b4sat 1 b5female 1 b6athlete 1 u where colgpa is cumulative college grade point average hsize is size of high school graduating class in hundreds hsperc is academic percentile in graduating class sat is combined SAT score female is a binary gender variable and athlete is a binary variable which is one for studentathletes What are your expectations for the coefficients in this equation Which ones are you unsure about ii Estimate the equation in part i and report the results in the usual form What is the estimated GPA differential between athletes and nonathletes Is it statistically significant iii Drop sat from the model and reestimate the equation Now what is the estimated effect of being an athlete Discuss why the estimate is different than that obtained in part ii iv In the model from part i allow the effect of being an athlete to differ by gender and test the null hypothesis that there is no ceteris paribus difference between women athletes and women nonathletes v Does the effect of sat on colgpa differ by gender Justify your answer C5 In Problem 2 in Chapter 4 we added the return on the firms stock ros to a model explaining CEO salary ros turned out to be insignificant Now define a dummy variable rosneg which is equal to one if ros 0 and equal to zero if ros 0 Use CEOSAL1 to estimate the model log1salary2 5 b0 1 b1log1sales2 1 b2roe 1 b3rosneg 1 u Discuss the interpretation and statistical significance of b 3 C6 Use the data in SLEEP75 for this exercise The equation of interest is sleep 5 b0 1 b1totwrk 1 b2educ 1 b3age 1 b4age2 1 b5yngkid 1 u i Estimate this equation separately for men and women and report the results in the usual form Are there notable differences in the two estimated equations ii Compute the Chow test for equality of the parameters in the sleep equation for men and women Use the form of the test that adds male and the interaction terms maletotwrk maleyngkid and uses the full set of observations What are the relevant df for the test Should you reject the null at the 5 level iii Now allow for a different intercept for males and females and determine whether the interaction terms involving male are jointly significant iv Given the results from parts ii and iii what would be your final model C7 Use the data in WAGE1 for this exercise i Use equation 718 to estimate the gender differential when educ 5 125 Compare this with the estimated differential when educ 5 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 239 ii Run the regression used to obtain 718 but with female 1educ 2 1252 replacing femaleeduc How do you interpret the coefficient on female now iii Is the coefficient on female in part ii statistically significant Compare this with 718 and comment C8 Use the data in LOANAPP for this exercise The binary variable to be explained is approve which is equal to one if a mortgage loan to an individual was approved The key explanatory variable is white a dummy variable equal to one if the applicant was white The other applicants in the data set are black and Hispanic To test for discrimination in the mortgage loan market a linear probability model can be used approve 5 b0 1 b1white 1 other factors i If there is discrimination against minorities and the appropriate factors have been controlled for what is the sign of b1 ii Regress approve on white and report the results in the usual form Interpret the coefficient on white Is it statistically significant Is it practically large iii As controls add the variables hrat obrat loanprc unem male married dep sch cosign chist pubrec mortlat1 mortlat2 and vr What happens to the coefficient on white Is there still evidence of discrimination against nonwhites iv Now allow the effect of race to interact with the variable measuring other obligations as a percentage of income obrat Is the interaction term significant v Using the model from part iv what is the effect of being white on the probability of approval when obrat 5 32 which is roughly the mean value in the sample Obtain a 95 confidence interval for this effect C9 There has been much interest in whether the presence of 401k pension plans available to many US workers increases net savings The data set 401KSUBS contains information on net financial assets nettfa family income inc a binary variable for eligibility in a 401k plan e401k and several other variables i What fraction of the families in the sample are eligible for participation in a 401k plan ii Estimate a linear probability model explaining 401k eligibility in terms of income age and gender Include income and age in quadratic form and report the results in the usual form iii Would you say that 401k eligibility is independent of income and age What about gender Explain iv Obtain the fitted values from the linear probability model estimated in part ii Are any fitted values negative or greater than one v Using the fitted values e401ki from part iv define e401ki 5 1 if e401ki 5 and e401ki 5 0 if e401ki 5 Out of 9275 families how many are predicted to be eligible for a 401k plan vi For the 5638 families not eligible for a 401k what percentage of these are predicted not to have a 401k using the predictor e401ki For the 3637 families eligible for a 401k plan what percentage are predicted to have one It is helpful if your econometrics package has a tabulate command vii The overall percent correctly predicted is about 649 Do you think this is a complete description of how well the model does given your answers in part vi viii Add the variable pira as an explanatory variable to the linear probability model Other things equal if a family has someone with an individual retirement account how much higher is the estimated probability that the family is eligible for a 401k plan Is it statistically different from zero at the 10 level Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 240 PART 1 Regression Analysis with CrossSectional Data C10 Use the data in NBASAL for this exercise i Estimate a linear regression model relating points per game to experience in the league and position guard forward or center Include experience in quadratic form and use centers as the base group Report the results in the usual form ii Why do you not include all three position dummy variables in part i iii Holding experience fixed does a guard score more than a center How much more Is the difference statistically significant iv Now add marital status to the equation Holding position and experience fixed are married players more productive based on points per game v Add interactions of marital status with both experience variables In this expanded model is there strong evidence that marital status affects points per game vi Estimate the model from part iv but use assists per game as the dependent variable Are there any notable differences from part iv Discuss C11 Use the data in 401KSUBS for this exercise i Compute the average standard deviation minimum and maximum values of nettfa in the sample ii Test the hypothesis that average nettfa does not differ by 401k eligibility status use a two sided alternative What is the dollar amount of the estimated difference iii From part ii of Computer Exercise C9 it is clear that e401k is not exogenous in a simple regression model at a minimum it changes by income and age Estimate a multiple linear regression model for nettfa that includes income age and e401k as explanatory variables The income and age variables should appear as quadratics Now what is the estimated dollar effect of 401k eligibility iv To the model estimated in part iii add the interactions e401k 1age 2 412 and e401k 1age 2 412 2 Note that the average age in the sample is about 41 so that in the new model the coefficient on e401k is the estimated effect of 401k eligibility at the average age Which interaction term is significant v Comparing the estimates from parts iii and iv do the estimated effects of 401k eligibility at age 41 differ much Explain vi Now drop the interaction terms from the model but define five family size dummy variables fsize1 fsize2 fsize3 fsize4 and fsize5 The variable fsize5 is unity for families with five or more members Include the family size dummies in the model estimated from part iii be sure to choose a base group Are the family dummies significant at the 1 level vii Now do a Chow test for the model nettfa 5 b0 1 b1inc 1 b2inc2 1 b3age 1 b4age2 1 b5e401k 1 u across the five family size categories allowing for intercept differences The restricted sum of squared residuals SSRr is obtained from part vi because that regression assumes all slopes are the same The unrestricted sum of squared residuals is SSRur 5 SSR1 1 SSR2 1 p 1 SSR5 where SSRf is the sum of squared residuals for the equation estimated using only family size f You should convince yourself that there are 30 parameters in the unrestricted model 5 intercepts plus 25 slopes and 10 parameters in the restricted model 5 intercepts plus 5 slopes Therefore the number of restrictions being tested is q 5 20 and the df for the unrestricted model is 9275 2 30 5 9245 C12 Use the data set in BEAUTY which contains a subset of the variables but more usable observations than in the regressions reported by Hamermesh and Biddle 1994 i Find the separate fractions of men and women that are classified as having above average looks Are more people rated as having above average or below average looks Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTeR 7 Multiple Regression Analysis with Qualitative Information 241 ii Test the null hypothesis that the population fractions of aboveaveragelooking women and men are the same Report the onesided pvalue that the fraction is higher for women Hint Estimating a simple linear probability model is easiest iii Now estimate the model log1wage2 5 b0 1 b1belavg 1 b2abvavg 1 u separately for men and women and report the results in the usual form In both cases interpret the coefficient on belavg Explain in words what the hypothesis H0 b1 5 0 against H1 b1 0 means and find the pvalues for men and women iv Is there convincing evidence that women with above average looks earn more than women with average looks Explain v For both men and women add the explanatory variables educ exper exper2 union goodhlth black married south bigcity smllcity and service Do the effects of the looks variables change in important ways vi Use the SSR form of the Chow F statistic to test whether the slopes of the regression functions in part v differ across men and women Be sure to allow for an intercept shift under the null C13 Use the data in APPLE to answer this question i Define a binary variable as ecobuy 5 1 if ecolbs 0 and ecobuy 5 0 if ecolbs 5 0 In other words ecobuy indicates whether at the prices given a family would buy any ecologically friendly apples What fraction of families claim they would buy ecolabeled apples ii Estimate the linear probability model ecobuy 5 b0 1 b1ecoprc 1 b2 regprc 1 b3 faminc 1 b4 hhsize 1 b5 educ 1 b6 age 1 u and report the results in the usual form Carefully interpret the coefficients on the price variables iii Are the nonprice variables jointly significant in the LPM Use the usual F statistic even though it is not valid when there is heteroskedasticity Which explanatory variable other than the price variables seems to have the most important effect on the decision to buy ecolabeled apples Does this make sense to you iv In the model from part ii replace faminc with logfaminc Which model fits the data better using faminc or logfaminc Interpret the coefficient on logfaminc v In the estimation in part iv how many estimated probabilities are negative How many are bigger than one Should you be concerned vi For the estimation in part iv compute the percent correctly predicted for each outcome ecobuy 5 0 and ecobuy 5 1 Which outcome is best predicted by the model C14 Use the data in CHARITY to answer this question The variable respond is a dummy variable equal to one if a person responded with a contribution on the most recent mailing sent by a charitable organiza tion The variable resplast is a dummy variable equal to one if the person responded to the previous mailing avggift is the average of past gifts in Dutch guilders and propresp is the proportion of times the person has responded to past mailings i Estimate a linear probability model relating respond to resplast and avggift Report the results in the usual form and interpret the coefficient on resplast ii Does the average value of past gifts seem to affect the probability of responding iii Add the variable propresp to the model and interpret its coefficient Be careful here an increase of one in propresp is the largest possible change iv What happened to the coefficient on resplast when propresp was added to the regression Does this make sense v Add mailsyear the number of mailings per year to the model How big is its estimated effect Why might this not be a good estimate of the causal effect of mailings on responding Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 242 PART 1 Regression Analysis with CrossSectional Data C15 Use the data in FERTIL2 to answer this question i Find the smallest and largest values of children in the sample What is the average of children Does any woman have exactly the average number of children ii What percentage of women have electricity in the home iii Compute the average of children for those without electricity and do the same for those with electricity Comment on what you find Test whether the population means are the same using a simple regression iv From part iii can you infer that having electricity causes women to have fewer children Explain v Estimate a multiple regression model of the kind reported in equation 737 but add age2 urban and the three religious affiliation dummies How does the estimated effect of having electricity compare with that in part iii Is it still statistically significant vi To the equation in part v add an interaction between electric and educ Is its coefficient statistically significant What happens to the coefficient on electric vii The median and mode value for educ is 7 In the equation from part vi use the centered interaction term electric 1educ 2 72 in place of electric educ What happens to the coef ficient on electric compared with part vi Why How does the coefficient on electric compare with that in part v C16 Use the data in CATHOLIC to answer this question i In the entire sample what percentage of the students attend a Catholic high school What is the average of math12 in the entire sample ii Run a simple regression of math12 on cathhs and report the results in the usual way Interpret what you have found iii Now add the variables lfaminc motheduc and fatheduc to the regression from part ii How many observations are used in the regression What happens to the coefficient on cathhs along with its statistical significance iv Return to the simple regression of math12 on cathhs but restrict the regression to observations used in the multiple regression from part iii Do any important conclusions change v To the multiple regression in part iii add interactions between cathhs and each of the other explanatory variables Are the interaction terms individually or jointly significant vi What happens to the coefficient on cathhs in the regression from part v Explain why this coefficient is not very interesting vii Compute the average partial effect of cathhs in the model estimated in part v How does it compare with the coefficients on cathhs in parts iii and v Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 243 T he homoskedasticity assumption introduced in Chapter 3 for multiple regression states that the variance of the unobserved error u conditional on the explanatory variables is constant Homoskedasticity fails whenever the variance of the unobserved factors changes across dif ferent segments of the population where the segments are determined by the different values of the explanatory variables For example in a savings equation heteroskedasticity is present if the variance of the unobserved factors affecting savings increases with income In Chapters 4 and 5 we saw that homoskedasticity is needed to justify the usual t tests F tests and confidence intervals for OLS estimation of the linear regression model even with large sample sizes In this chapter we discuss the available remedies when heteroskedasticity occurs and we also show how to test for its presence We begin by briefly reviewing the consequences of heteroskedastic ity for ordinary least squares estimation 81 Consequences of Heteroskedasticity for OLS Consider again the multiple linear regression model y 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u 81 In Chapter 3 we proved unbiasedness of the OLS estimators b 0 b 1 b 2 p b k under the first four GaussMarkov assumptions MLR1 through MLR4 In Chapter 5 we showed that the same four assumptions imply consistency of OLS The homoskedasticity assumption MLR5 stated in terms of the error variance as Var1u0x1 x2 p xk2 5 s2 played no role in showing whether OLS was unbiased c h a p t e r 8 Heteroskedasticity Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 244 or consistent It is important to remember that heteroskedasticity does not cause bias or inconsistency in the OLS estimators of the bj whereas something like omitting an important variable would have this effect The interpretation of our goodnessoffit measures R2 and R2 is also unaffected by the pres ence of heteroskedasticity Why Recall from Section 63 that the usual Rsquared and the adjusted Rsquared are different ways of estimating the population Rsquared which is simply 1 2 s2 us2 y where s2 u is the population error variance and s2 y is the population variance of y The key point is that because both variances in the population Rsquared are unconditional variances the population Rsquared is unaffected by the presence of heteroskedasticity in Var1u0x1 p xk2 Further SSRn consistently estimates s2 u and SSTn consistently estimates s2 y whether or not Var1u0x1 p xk2 is constant The same is true when we use the degrees of freedom adjustments Therefore R2 and R2 are both consistent estimators of the population Rsquared whether or not the homoskedasticity assumption holds If heteroskedasticity does not cause bias or inconsistency in the OLS estimators why did we introduce it as one of the GaussMarkov assumptions Recall from Chapter 3 that the estimators of the variances Var1b j2 are biased without the homoskedasticity assumption Since the OLS standard errors are based directly on these variances they are no longer valid for constructing confidence in tervals and t statistics The usual OLS t statistics do not have t distributions in the presence of heter oskedasticity and the problem is not resolved by using large sample sizes We will see this explicitly for the simple regression case in the next section where we derive the variance of the OLS slope estimator under heteroskedasticity and propose a valid estimator in the presence of heteroskedasticity Similarly F statistics are no longer F distributed and the LM statistic no longer has an asymptotic chisquare distribution In summary the statistics we used to test hypotheses under the GaussMarkov assumptions are not valid in the presence of heteroskedasticity We also know that the GaussMarkov Theorem which says that OLS is best linear unbiased relies crucially on the homoskedasticity assumption If Var1u0x2 is not constant OLS is no longer BLUE In addition OLS is no longer asymptotically efficient in the class of estimators described in Theorem 53 As we will see in Section 84 it is possible to find estimators that are more efficient than OLS in the presence of heteroskedasticity although it requires knowing the form of the heter oskedasticity With relatively large sample sizes it might not be so important to obtain an efficient estimator In the next section we show how the usual OLS test statistics can be modified so that they are valid at least asymptotically 82 HeteroskedasticityRobust Inference after OLS Estimation Because testing hypotheses is such an important component of any econometric analysis and the usual OLS inference is generally faulty in the presence of heteroskedasticity we must decide if we should en tirely abandon OLS Fortunately OLS is still useful In the last two decades econometricians have learned how to adjust standard errors and t F and LM statistics so that they are valid in the presence of heteroske dasticity of unknown form This is very convenient because it means we can report new statistics that work regardless of the kind of heteroskedasticity present in the population The methods in this section are known as heteroskedasticityrobust procedures because they are validat least in large sampleswhether or not the errors have constant variance and we do not need to know which is the case We begin by sketching how the variances Var1b j2 can be estimated in the presence of heteroske dasticity A careful derivation of the theory is well beyond the scope of this text but the application of heteroskedasticityrobust methods is very easy now because many statistics and econometrics pack ages compute these statistics as an option First consider the model with a single independent variable where we include an i subscript for emphasis yi 5 b0 1 b1xi 1 ui Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 245 We assume throughout that the first four GaussMarkov assumptions hold If the errors contain heter oskedasticity then Var1ui0xi2 5 s2 i where we put an i subscript on s2 to indicate that the variance of the error depends upon the particular value of xi Write the OLS estimator as b 1 5 b1 1 a n i51 1xi 2 x2ui a n i51 1xi 2 x2 2 Under Assumptions MLR1 through MLR4 that is without the homoskedasticity assumption and conditioning on the values xi in the sample we can use the same arguments from Chapter 2 to show that Var1b 12 5 a n i51 1xi 2 x2 2s2 i SST2 x 82 where SSTx 5 g n i511xi 2 x2 2 is the total sum of squares of the xi When s2 i 5 s2 for all i this formula reduces to the usual form s2SSTx Equation 82 explicitly shows that for the simple regression case the variance formula derived under homoskedasticity is no longer valid when heteroskedasticity is present Since the standard error of b 1 is based directly on estimating Var1b 12 we need a way to estimate equation 82 when heteroskedasticity is present White 1980 showed how this can be done Let u i denote the OLS residuals from the initial regression of y on x Then a valid estimator of Var1b 12 for heteroskedasticity of any form including homoskedasticity is a n i51 1xi 2 x2 2u 2 i SST2 x 83 which is easily computed from the data after the OLS regression In what sense is 83 a valid estimator of Var1b 12 This is pretty subtle Briefly it can be shown that when equation 83 is multiplied by the sample size n it converges in probability to E3 1xi 2 mx2 2u2 i 41s2 x2 2 which is the probability limit of n times 82 Ultimately this is what is neces sary for justifying the use of standard errors to construct confidence intervals and t statistics The law of large numbers and the central limit theorem play key roles in establishing these convergences You can refer to Whites original paper for details but that paper is quite technical See also Wooldridge 2010 Chapter 4 A similar formula works in the general multiple regression model y 5 b0 1 b1x1 1 p 1 bkxk 1 u It can be shown that a valid estimator of Var1b j2 under Assumptions MLR1 through MLR4 is Var1b j2 5 a n i51 r2 iju 2 i SSR2 j 84 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 246 where rij denotes the ith residual from regressing xj on all other independent variables and SSRj is the sum of squared residuals from this regression see Section 32 for the partialling out representation of the OLS estimates The square root of the quantity in 84 is called the heteroskedasticityrobust stand ard error for b j In econometrics these robust standard errors are usually attributed to White 1980 Earlier works in statistics notably those by Eicker 1967 and Huber 1967 pointed to the possibility of obtaining such robust standard errors In applied work these are sometimes called White Huber or Eicker standard errors or some hyphenated combination of these names We will just refer to them as heteroskedasticityrobust standard errors or even just robust standard errors when the context is clear Sometimes as a degrees of freedom correction 84 is multiplied by n1n 2 k 2 12 before taking the square root The reasoning for this adjustment is that if the squared OLS residuals u 2 i were the same for all observations ithe strongest possible form of homoskedasticity in a samplewe would get the usual OLS standard errors Other modifications of 84 are studied in MacKinnon and White 1985 Since all forms have only asymptotic justification and they are asymptotically equivalent no form is uniformly pre ferred above all others Typically we use whatever form is computed by the regression package at hand Once heteroskedasticityrobust standard errors are obtained it is simple to construct a heteroskedasticityrobust t statistic Recall that the general form of the t statistic is t 5 estimate 2 hypothesized value standard error 85 Because we are still using the OLS estimates and we have chosen the hypothesized value ahead of time the only difference between the usual OLS t statistic and the heteroskedasticityrobust t statistic is in how the standard error in the denominator is computed The term SSRj in equation 84 can be replaced with SSTj11 2 R2 j 2 where SSTj is the total sum of squares of xj and R2 j is the usual Rsquared from regressing xj on all other explanatory vari ables We implicitly used this equivalence in deriving equation 351 Consequently little sample variation in xj or a strong linear relationship between xj and the other explanatory variablesthat is multicollinearitycan cause the heteroskedasticityrobust standard errors to be large We discussed these issues with the usual OLS standard errors in Section 34 ExamplE 81 log Wage Equation with HeteroskedasticityRobust Standard Errors We estimate the model in Example 76 but we report the heteroskedasticityrobust standard errors along with the usual OLS standard errors Some of the estimates are reported to more digits so that we can compare the usual standard errors with the heteroskedasticityrobust standard errors log1wage2 5 321 1 213 marrmale 2 198 marrfem 2 110 singfem 11002 10552 10582 10562 31094 30574 30584 30574 1 0789 educ 1 0268 exper 2 00054 exper2 100672 100552 1000112 86 300744 300514 3000114 1 0291 tenure 2 00053 tenure2 100682 1000232 300694 3000244 n 5 526 R2 5 461 The usual OLS standard errors are in parentheses below the corresponding OLS estimate and the heteroskedasticityrobust standard errors are in brackets The numbers in brackets are the only new things since the equation is still estimated by OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 247 Several things are apparent from equation 86 First in this particular application any variable that was statistically significant using the usual t statistic is still statistically significant using the heteroskedasticityrobust t statistic This occurs because the two sets of standard errors are not very different The associated pvalues will differ slightly because the robust t statistics are not identical to the usual nonrobust t statistics The largest relative change in standard errors is for the coefficient on educ the usual standard error is 0067 and the robust standard error is 0074 Still the robust stand ard error implies a robust t statistic above 10 Equation 86 also shows that the robust standard errors can be either larger or smaller than the usual standard errors For example the robust standard error on exper is 0051 whereas the usual standard error is 0055 We do not know which will be larger ahead of time As an empirical matter the robust standard errors are often found to be larger than the usual standard errors Before leaving this example we must emphasize that we do not know at this point whether het eroskedasticity is even present in the population model underlying equation 86 All we have done is report along with the usual standard errors those that are valid asymptotically whether or not heter oskedasticity is present We can see that no important conclusions are overturned by using the robust standard errors in this example This often happens in applied work but in other cases the differences between the usual and robust standard errors are much larger As an example of where the differences are substantial see Computer Exercise C2 At this point you may be asking the following question if the heteroskedasticityrobust stand ard errors are valid more often than the usual OLS standard errors why do we bother with the usual standard errors at all This is a sensible question One reason the usual standard errors are still used in crosssectional work is that if the homoskedasticity assumption holds and the errors are normally distributed then the usual t statistics have exact t distributions regardless of the sample size see Chapter 4 The robust standard errors and robust t statistics are justified only as the sample size be comes large even if the CLM assumptions are true With small sample sizes the robust t statistics can have distributions that are not very close to the t distribution and that could throw off our inference In large sample sizes we can make a case for always reporting only the heteroskedasticityrobust standard errors in crosssectional applications and this practice is being followed more and more in applied work It is also common to report both standard errors as in equation 86 so that a reader can determine whether any conclusions are sensitive to the standard error in use It is also possible to obtain F and LM statistics that are robust to heteroskedasticity of an un known arbitrary form The heteroskedasticityrobust F statistic or a simple transformation of it is also called a heteroskedasticityrobust Wald statistic A general treatment of the Wald statistic requires matrix algebra and is sketched in Appendix E see Wooldridge 2010 Chapter 4 for a more detailed treatment Nevertheless using heteroskedasticityrobust statistics for multiple exclusion restrictions is straightforward because many econometrics packages now compute such statistics routinely ExamplE 82 HeteroskedasticityRobust F Statistic Using the data for the spring semester in GPA3 we estimate the following equation cumgpa 5 147 1 00114 sat 2 00857 hsperc 1 00250 tothrs 87 1232 1000182 1001242 1000732 3224 3000194 3001404 3000734 1 303 female 2 128 black 2 059 white 10592 11472 11412 30594 31184 31104 n 5 366 R2 5 4006 R2 5 3905 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 248 Again the differences between the usual standard errors and the heteroskedasticityrobust standard errors are not very big and use of the robust t statistics does not change the statistical significance of any independent variable Joint significance tests are not much affected either Suppose we wish to test the null hypothesis that after the other factors are controlled for there are no differences in cumgpa by race This is stated as H0 bblack 5 0 bwhite 5 0 The usual F statistic is easily obtained once we have the Rsquared from the restricted model this turns out to be 3983 The F statistic is then 3 14006 2 3983211 2 40062 4135922 69 If heteroskedasticity is present this version of the test is invalid The heteroskedasticityrobust version has no simple form but it can be computed using certain statistical packages The value of the heteroskedasticityrobust F statistic turns out to be 75 which differs only slightly from the nonrobust version The pvalue for the robust test is 474 which is not close to standard significance levels We fail to reject the null hypothesis using either test Because the usual sum of squared residuals form of the F statistic is not valid under heteroskedas ticity we must be careful in computing a Chow test of common coefficients across two groups The form of the statistic in equation 724 is not valid if heteroskedasticity is present including the simple case where the error variance differs across the two groups Instead we can obtain a heteroskedasticity robust Chow test by including a dummy variable distinguishing the two groups along with interactions between that dummy variable and all other explanatory variables We can then test whether there is no difference in the two regression functionsby testing that the coefficients on the dummy variable and all interactions are zeroor just test whether the slopes are all the same in which case we leave the coefficient on the dummy variable unrestricted See Computer Exercise C14 for an example 82a Computing HeteroskedasticityRobust LM Tests Not all regression packages compute F statistics that are robust to heteroskedasticity Therefore it is sometimes convenient to have a way of obtaining a test of multiple exclusion restrictions that is robust to heteroskedasticity and does not require a particu lar kind of econometric software It turns out that a heteroskedasticityrobust LM statistic is easily obtained using virtually any regression package To illustrate computation of the robust LM statistic consider the model y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 b4x4 1 b5x5 1 u and suppose we would like to test H0 b4 5 0 b5 5 0 To obtain the usual LM statistic we would first estimate the restricted model that is the model without x4 and x5 to obtain the residuals u Then we would regress u on all of the independent variables and the LM 5 n Ru 2 where Ru 2 is the usual Rsquared from this regression Obtaining a version that is robust to heteroskedasticity requires more work One way to compute the statistic requires only OLS regressions We need the residuals say r1 from the regression of x4 on x1 x2 x3 Also we need the residuals say r2 from the regression of x5 on x1 x2 x3 Thus we regress each of the independent variables excluded under the null on all of the included independent variables We keep the residuals each time The final step appears odd but it is after all just a compu tational device Run the regression of 1 on r1u r2u 88 without an intercept Yes we actually define a dependent variable equal to the value one for all observations We regress this onto the products r1 u and r2u The robust LM statistic turns out to be n 2 SSR1 where SSR1 is just the usual sum of squared residuals from regression 88 Evaluate the following statement The heteroskedasticityrobust standard errors are always bigger than the usual standard errors Exploring FurthEr 81 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 249 The reason this works is somewhat technical Basically this is doing for the LM test what the robust standard errors do for the t test See Wooldridge 1991b or Davidson and MacKinnon 1993 for a more detailed discussion We now summarize the computation of the heteroskedasticityrobust LM statistic in the general case A HeteroskedasticityRobust LM Statistic 1 Obtain the residuals u from the restricted model 2 Regress each of the independent variables excluded under the null on all of the included independ ent variables if there are q excluded variables this leads to q sets of residuals 1 r1 r2 p rq2 3 Find the products between each rj and u for all observations 4 Run the regression of 1 on r1u r2u p rqu without an intercept The heteroskedasticityrobust LM statistic is n 2 SSR1 where SSR1 is just the usual sum of squared residuals from this final regression Under H0 LM is distributed approximately as x2 q Once the robust LM statistic is obtained the rejection rule and computation of pvalues are the same as for the usual LM statistic in Section 52 ExamplE 83 HeteroskedasticityRobust LM Statistic We use the data in CRIME1 to test whether the average sentence length served for past convictions affects the number of arrests in the current year 1986 The estimated model is narr86 5 561 2 136 pcnv 1 0178 avgsen 2 00052 avgsen2 10362 10402 100972 1000302 30404 30344 301014 3000214 2 0394 ptime86 2 0505 qemp86 2 00148 inc86 100872 101442 1000342 89 300624 301424 3000234 1 325 black 1 193 hispan 10452 10402 30584 30404 n 5 2725 R2 5 0728 In this example there are more substantial differences between some of the usual standard errors and the robust standard errors For example the usual t statistic on avgsen2 is about 173 while the ro bust t statistic is about 248 Thus avgsen2 is more significant using the robust standard error The effect of avgsen on narr86 is somewhat difficult to reconcile Because the relationship is quadratic we can figure out where avgsen has a positive effect on narr86 and where the effect be comes negative The turning point is 0178200052 L 1712 recall that this is measured in months Literally this means that narr86 is positively related to avgsen when avgsen is less than 17 months then avgsen has the expected deterrent effect after 17 months To see whether average sentence length has a statistically significant effect on narr86 we must test the joint hypothesis H0 bavgsen 5 0 bavgsen2 5 0 Using the usual LM statistic see Section 52 we obtain LM 5 354 in a chisquare distribution with two df this yields a pvalue 5 170 Thus we do not reject H0 at even the 15 level The heteroskedasticityrobust LM statistic is LM 5 400 rounded to two decimal places with a pvalue 5 135 This is still not very strong evidence against H0 avgsen does not appear to have a strong effect on narr86 Incidentally when avgsen appears alone in 89 that is without the quadratic term its usual t statistic is 658 and its robust t statistic is 592 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 250 83 Testing for Heteroskedasticity The heteroskedasticityrobust standard errors provide a simple method for computing t statistics that are asymptotically t distributed whether or not heteroskedasticity is present We have also seen that heteroskedasticityrobust F and LM statistics are available Implementing these tests does not require knowing whether or not heteroskedasticity is present Nevertheless there are still some good reasons for having simple tests that can detect its presence First as we mentioned in the previous section the usual t statistics have exact t distributions under the classical linear model assumptions For this reason many economists still prefer to see the usual OLS standard errors and test statistics reported unless there is evidence of heteroskedasticity Second if heteroskedasticity is present the OLS esti mator is no longer the best linear unbiased estimator As we will see in Section 84 it is possible to obtain a better estimator than OLS when the form of heteroskedasticity is known Many tests for heteroskedasticity have been suggested over the years Some of them while hav ing the ability to detect heteroskedasticity do not directly test the assumption that the variance of the error does not depend upon the independent variables We will restrict ourselves to more modern tests which detect the kind of heteroskedasticity that invalidates the usual OLS statistics This also has the benefit of putting all tests in the same framework As usual we start with the linear model y 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u 810 where Assumptions MLR1 through MLR4 are maintained in this section In particular we assume that E1u0x1 x2 p xk2 5 0 so that OLS is unbiased and consistent We take the null hypothesis to be that Assumption MLR5 is true H0 Var1u0x1 x2 p xk2 5 s2 811 That is we assume that the ideal assumption of homoskedasticity holds and we require the data to tell us otherwise If we cannot reject 811 at a sufficiently small significance level we usually conclude that heteroskedasticity is not a problem However remember that we never accept H0 we simply fail to reject it Because we are assuming that u has a zero conditional expectation Var1u0x2 5 E1u20x2 and so the null hypothesis of homoskedasticity is equivalent to H0 E1u20x1 x2 p xk2 5 E1u22 5 s2 This shows that in order to test for violation of the homoskedasticity assumption we want to test whether u2 is related in expected value to one or more of the explanatory variables If H0 is false the expected value of u2 given the independent variables can be virtually any function of the xj A simple approach is to assume a linear function u2 5 d0 1 d1x1 1 d2x2 1 p 1 dkxk 1 v 812 where v is an error term with mean zero given the xj Pay close attention to the dependent variable in this equation it is the square of the error in the original regression equation 810 The null hypoth esis of homoskedasticity is H0 d1 5 d2 5 p 5 dk 5 0 813 Under the null hypothesis it is often reasonable to assume that the error in 812 v is independ ent of x1 x2 p xk Then we know from Section 52 that either the F or LM statistics for the overall significance of the independent variables in explaining u2 can be used to test 813 Both statistics would have asymptotic justification even though u2 cannot be normally distributed For example if u is normally distributed then u2s2 is distributed as x2 1 If we could observe the u2 in the sample then we could easily compute this statistic by running the OLS regression of u2 on x1 x2 p xk using all n observations Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 251 As we have emphasized before we never know the actual errors in the population model but we do have estimates of them the OLS residual u i is an estimate of the error ui for observation i Thus we can estimate the equation u 2 5 d0 1 d1x1 1 d2x2 1 p 1 dkxk 1 error 814 and compute the F or LM statistics for the joint significance of x1 p xk It turns out that using the OLS residuals in place of the errors does not affect the large sample distribution of the F or LM statis tics although showing this is pretty complicated The F and LM statistics both depend on the Rsquared from regression 814 call this R2 u2 to dis tinguish it from the Rsquared in estimating equation 810 Then the F statistic is F 5 R22 u k 11 2 R2 2 u 2 1n 2 k 2 12 815 where k is the number of regressors in 814 this is the same number of independent variables in 810 Computing 815 by hand is rarely necessary because most regression packages automati cally compute the F statistic for overall significance of a regression This F statistic has approxi mately an Fk n2k21 distribution under the null hypothesis of homoskedasticity The LM statistic for heteroskedasticity is just the sample size times the Rsquared from 814 LM 5 n R22 u 816 Under the null hypothesis LM is distributed asymptotically as x2 k This is also very easy to obtain after running regression 814 The LM version of the test is typically called the BreuschPagan test for heteroskedasticity BP test Breusch and Pagan 1979 suggested a different form of the test that assumes the errors are normally distributed Koenker 1981 suggested the form of the LM statistic in 816 and it is gener ally preferred due to its greater applicability We summarize the steps for testing for heteroskedasticity using the BP test The BreuschPagan Test for Heteroskedasticity 1 Estimate the model 810 by OLS as usual Obtain the squared OLS residuals u 2 one for each observation 2 Run the regression in 814 Keep the Rsquared from this regression R22 u 3 Form either the F statistic or the LM statistic and compute the pvalue using the Fkn2k21 distri bution in the former case and the x2 k distribution in the latter case If the pvalue is sufficiently small that is below the chosen significance level then we reject the null hypothesis of homoske dasticity If the BP test results in a small enough pvalue some corrective measure should be taken One possibility is to just use the heteroskedasticityrobust standard errors and test statistics discussed in the previous section Another possibility is discussed in Section 84 ExamplE 84 Heteroskedasticity in Housing price Equations We use the data in HPRICE1 to test for heteroskedasticity in a simple housing price equation The estimated equation using the levels of all variables is price 5 2 2177 1 00207 lotsize 1 123 sqrft 1 1385 bdrms 129482 1000642 10132 19012 817 n 5 88 R2 5 672 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 252 This equation tells us nothing about whether the error in the population model is heteroskedastic We need to regress the squared OLS residuals on the independent variables The Rsquared from the regression of u 2 on lotsize sqrft and bdrms is R22 u 5 1601 With n 88 and k 5 3 this produces an F statistic for significance of the independent variables of F 5 3160111 2 16012 418432 534 The associated pvalue is 002 which is strong evidence against the null The LM statistic is 88116012 1409 this gives a pvalue 0028 using the x2 3 distribution giving essentially the same conclusion as the F statistic This means that the usual standard errors reported in 817 are not reliable In Chapter 6 we mentioned that one benefit of using the logarithmic functional form for the de pendent variable is that heteroskedasticity is often reduced In the current application let us put price lotsize and sqrft in logarithmic form so that the elasticities of price with respect to lotsize and sqrft are constant The estimated equation is log1price2 5 21 30 1 168 log1lotsize2 1 700 log1sqrft2 1 0 37 bdrms 818 1652 1 0382 10932 10282 n 5 88 R2 5 643 Regressing the squared OLS residuals from this regression on loglotsize logsqrft and bdrms gives R22 u 5 0480 Thus F 5 141 1pvalue 5 2452 and LM 5 422 1pvalue 5 2392 Therefore we fail to reject the null hypothesis of homoskedasticity in the model with the logarithmic functional forms The occurrence of less heteroskedasticity with the dependent variable in logarithmic form has been noticed in many empirical applications If we suspect that heteroskedasticity depends only upon certain independent variables we can easily mod ify the BreuschPagan test we simply regress u 2 on whatever independent variables we choose and carry out the appropriate F or LM test Remember that the appropriate degrees of freedom depends upon the num ber of independent variables in the regression with u 2 as the dependent variable the number of independent variables showing up in equation 810 is irrelevant If the squared residuals are regressed on only a single independent variable the test for heteroske dasticity is just the usual t statistic on the variable A significant t statistic suggests that heteroskedasticity is a problem 83a The White Test for Heteroskedasticity In Chapter 5 we showed that the usual OLS standard errors and test statistics are asymptotically valid provided all of the GaussMarkov assumptions hold It turns out that the homoskedasticity as sumption Var1u10x1 p xk2 5 s2 can be replaced with the weaker assumption that the squared error u2 is uncorrelated with all the independent variables 1xj2 the squares of the independent variables 1x2 j 2 and all the cross products 1xj xh for j 2 h2 This observation motivated White 1980 to propose a test for heteroskedasticity that adds the squares and cross products of all the independent variables to equation 814 The test is explicitly intended to test for forms of heteroskedasticity that invalidate the usual OLS standard errors and test statistics Consider wage equation 711 where you think that the conditional variance of logwage does not depend on educ exper or tenure However you are worried that the variance of logwage differs across the four demographic groups of married males married females single males and single females What regression would you run to test for heteroskedasticity What are the degrees of freedom in the F test Exploring FurthEr 82 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 253 When the model contains k 5 3 independent variables the White test is based on an estimation of u 2 5 d0 1 d1x1 1 d2x2 1 d3x3 1 d4x2 1 1 d5x2 2 1 d6x2 3 819 1 d7x1x2 1 d8x1x3 1 d9x2x3 1 error Compared with the BreuschPagan test this equation has six more regressors The White test for heteroskedasticity is the LM statistic for testing that all of the dj in equation 819 are zero except for the intercept Thus nine restrictions are being tested in this case We can also use an F test of this hypothesis both tests have asymptotic justification With only three independent variables in the original model equation 819 has nine independ ent variables With six independent variables in the original model the White regression would gener ally involve 27 regressors unless some are redundant This abundance of regressors is a weakness in the pure form of the White test it uses many degrees of freedom for models with just a moderate number of independent variables It is possible to obtain a test that is easier to implement than the White test and more conserving on degrees of freedom To create the test recall that the difference between the White and Breusch Pagan tests is that the former includes the squares and cross products of the independent variables We can preserve the spirit of the White test while conserving on degrees of freedom by using the OLS fitted values in a test for heteroskedasticity Remember that the fitted values are defined for each observation i by yi 5 b 0 1 b 1xi1 1 b 2xi2 1 p 1 b kxik These are just linear functions of the independent variables If we square the fitted values we get a particular function of all the squares and cross products of the independent variables This suggests testing for heteroskedasticity by estimating the equation u 2 5 d0 1 d1y 1 d2y2 1 error 820 where y stands for the fitted values It is important not to confuse y and y in this equation We use the fitted values because they are functions of the independent variables and the estimated parameters using y in 820 does not produce a valid test for heteroskedasticity We can use the F or LM statistic for the null hypothesis H0 d1 5 0 d2 5 0 in equation 820 This results in two restrictions in testing the null of homoskedasticity regardless of the number of independent variables in the original model Conserving on degrees of freedom in this way is often a good idea and it also makes the test easy to implement Since y is an estimate of the expected value of y given the xj using 820 to test for heteroske dasticity is useful in cases where the variance is thought to change with the level of the expected value E1y0x2 The test from 820 can be viewed as a special case of the White test since equation 820 can be shown to impose restrictions on the parameters in equation 819 A Special Case of the White Test for Heteroskedasticity 1 Estimate the model 810 by OLS as usual Obtain the OLS residuals u and the fitted values y Compute the squared OLS residuals u 2 and the squared fitted values y2 2 Run the regression in equation 820 Keep the Rsquared from this regression R2 u2 3 Form either the F or LM statistic and compute the pvalue using the F2n23 distribution in the former case and the x2 2 distribution in the latter case Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 254 ExamplE 85 Special Form of the White Test in the log Housing price Equation We apply the special case of the White test to equation 818 where we use the LM form of the statistic The important thing to remember is that the chisquare distribution always has two df The regression of u 2 on lprice 1lprice2 2 where lprice denotes the fitted values from 818 produces R2 u2 5 0392 thus LM 5 88103922 345 and the pvalue 5 178 This is stronger evidence of heteroskedasticity than is provided by the BreuschPagan test but we still fail to reject homoskedas ticity at even the 15 level Before leaving this section we should discuss one important caveat We have interpreted a rejec tion using one of the heteroskedasticity tests as evidence of heteroskedasticity This is appropriate provided we maintain Assumptions MLR1 through MLR4 But if MLR4 is violatedin particular if the functional form of E1y0x2 is misspecifiedthen a test for heteroskedasticity can reject H0 even if Var1y0x2 is constant For example if we omit one or more quadratic terms in a regression model or use the level model when we should use the log a test for heteroskedasticity can be significant This has led some economists to view tests for heteroskedasticity as general misspecification tests However there are better more direct tests for functional form misspecification and we will cover some of them in Section 91 It is better to use explicit tests for functional form first since functional form misspecification is more important than heteroskedasticity Then once we are satisfied with the functional form we can test for heteroskedasticity 84 Weighted Least Squares Estimation If heteroskedasticity is detected using one of the tests in Section 83 we know from Section 82 that one possible response is to use heteroskedasticityrobust statistics after estimation by OLS Before the development of heteroskedasticityrobust statistics the response to a finding of heteroskedasticity was to specify its form and use a weighted least squares method which we develop in this section As we will argue if we have correctly specified the form of the variance as a function of explana tory variables then weighted least squares WLS is more efficient than OLS and WLS leads to new t and F statistics that have t and F distributions We will also discuss the implications of using the wrong form of the variance in the WLS procedure 84a The Heteroskedasticity Is Known up to a Multiplicative Constant Let x denote all the explanatory variables in equation 810 and assume that Var1u0x2 5 s2h1x2 821 where h1x2 is some function of the explanatory variables that determines the heteroskedasticity Since variances must be positive h1x2 0 for all possible values of the independent variables For now we assume that the function h1x2 is known The population parameter s2 is unknown but we will be able to estimate it from a data sample For a random drawing from the population we can write s2 i 5 Var1ui0xi2 5 s2h1xi2 5 s2hi where we again use the notation xi to denote all independent variables for observation i and hi changes with each observation because the independent variables change across observations For example consider the simple savings function savi 5 b0 1 biinci 1 ui 822 Var1ui0inci2 5 s2inci 823 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 255 Here h1x2 5 h1inc2 5 inc the variance of the error is proportional to the level of income This means that as income increases the variability in savings increases If b1 0 the expected value of savings also increases with income Because inc is always positive the variance in equation 823 is always guaranteed to be positive The standard deviation of ui conditional on inci is sinci How can we use the information in equation 821 to estimate the bj Essentially we take the original equation yi 5 b0 1 b1xi1 1 b2xi2 1 p 1 bkxik 1 ui 824 which contains heteroskedastic errors and transform it into an equation that has homoskedastic errors and satisfies the other GaussMarkov assumptions Since hi is just a function of xi ui hi has a zero expected value conditional on xi Further since Var1ui0xi2 5 E1u2 i 0xi2 5 s2hi the variance of ui hi conditional on xi is s2 E1 1ui hi2 22 5 E1ui 22hi 5 1s2hi2hi 5 s2 where we have suppressed the conditioning on xi for simplicity We can divide equation 824 by hi to get yi hi 5 b0 hi 1 b11xi1 hi2 1 b21xi2 hi2 1 p 825 1 bk1xik hi2 1 1ui hi2 or yp i 5 b0xp i0 1 b1xp i1 1 p 1 bkxp ik 1 up i 826 where xp i0 5 1hi and the other starred variables denote the corresponding original variables divided by hi Equation 826 looks a little peculiar but the important thing to remember is that we derived it so we could obtain estimators of the bj that have better efficiency properties than OLS The intercept b0 in the original equation 824 is now multiplying the variable xp i0 5 1hi Each slope parameter in bj multiplies a new variable that rarely has a useful interpretation This should not cause problems if we recall that for interpreting the parameters and the model we always want to return to the original equation 824 In the preceding savings example the transformed equation looks like savi inci 5 b011 inci2 1 b1inci 1 up i where we use the fact that inciinci 5 inci Nevertheless b1 is the marginal propensity to save out of income an interpretation we obtain from equation 822 Equation 826 is linear in its parameters so it satisfies MLR1 and the random sampling as sumption has not changed Further up i has a zero mean and a constant variance 1s22 conditional on xp i This means that if the original equation satisfies the first four GaussMarkov assumptions then the transformed equation 826 satisfies all five GaussMarkov assumptions Also if ui has a normal distribution then up i has a normal distribution with variance s2 Therefore the transformed equation satisfies the classical linear model assumptions MLR1 through MLR6 if the original model does so except for the homoskedasticity assumption Since we know that OLS has appealing properties is BLUE for example under the Gauss Markov assumptions the discussion in the previous paragraph suggests estimating the parameters in equation 826 by ordinary least squares These estimators bp 0 bp 1 p bp k will be different from the OLS estimators in the original equation The bp j are examples of generalized least squares GLS estimators In this case the GLS estimators are used to account for heteroskedasticity in the errors We will encounter other GLS estimators in Chapter 12 Because equation 826 satisfies all of the ideal assumptions standard errors t statistics and F statistics can all be obtained from regressions using the transformed variables The sum of squared residuals from 826 divided by the degrees of freedom is an unbiased estimator of s2 Further the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 256 GLS estimators because they are the best linear unbiased estimators of the bj are necessarily more efficient than the OLS estimators b j obtained from the untransformed equation Essentially after we have transformed the variables we simply use standard OLS analysis But we must remember to in terpret the estimates in light of the original equation The GLS estimators for correcting heteroskedasticity are called weighted least squares WLS estimators This name comes from the fact that the bp j minimize the weighted sum of squared residu als where each squared residual is weighted by 1hi The idea is that less weight is given to observa tions with a higher error variance OLS gives each observation the same weight because it is best when the error variance is identical for all partitions of the population Mathematically the WLS estimators are the values of the bj that make a n i51 1yi 2 b0 2 b1xi1 2 b2xi2 2 p 2 bkxik2 2hi 827 as small as possible Bringing the square root of 1hi inside the squared residual shows that the weighted sum of squared residuals is identical to the sum of squared residuals in the transformed variables a n i51 1yp i 2 b0xp i0 2 b1xp i1 2 b2xp i2 2 p 2 bkxp ik2 2 Since OLS minimizes the sum of squared residuals regardless of the definitions of the dependent variable and independent variable it follows that the WLS estimators that minimize 827 are sim ply the OLS estimators from 826 Note carefully that the squared residuals in 827 are weighted by 1hi whereas the transformed variables in 826 are weighted by 1hi A weighted least squares estimator can be defined for any set of positive weights OLS is the special case that gives equal weight to all observations The efficient procedure GLS weights each squared residual by the inverse of the conditional variance of ui given xi Obtaining the transformed variables in equation 825 in order to manually perform weighted least squares can be tedious and the chance of making mistakes is nontrivial Fortunately most mod ern regression packages have a feature for computing weighted least squares Typically along with the dependent and independent variables in the original model we just specify the weighting func tion 1hi appearing in 827 That is we specify weights proportional to the inverse of the variance In addition to making mistakes less likely this forces us to interpret weighted least squares estimates in the original model In fact we can write out the estimated equation in the usual way The estimates and standard errors will be different from OLS but the way we interpret those estimates standard er rors and test statistics is the same Econometrics packages that have a builtin WLS option will report an Rsquared and adjusted Rsquared along with WLS estimates and standard errors Typically the WLS Rsquared is obtained from the weighted SSR obtained from minimizing equation 827 and a weighted total sum of squares SST obtained by using the same weights but setting all of the slope coefficients in equation 827 b1 b2 p bk to zero As a goodnessoffit measure this Rsquared is not especially useful as it effectively measures explained variation in yp i rather than yi Nevertheless the WLS Rsquareds computed as just described are appropriate for computing F statistics for exclusion restrictions pro vided we have properly specified the variance function As in the case of OLS the SST terms cancel and so we obtain the F statistic based on the weighted SSR The Rsquared from running the OLS regression in equation 826 is even less useful as a goodnessoffit measure as the computation of SST would make little sense one would necessar ily exclude an intercept from the regression in which case regression packages typically compute the SST without properly centering the yp i This is another reason for using a WLS option that is preprogrammed in a regression package because at least the reported Rsquared properly compares the model with all of the independent variables to a model with only an intercept Because the SST cancels out when testing exclusion restrictions improperly computing SST does not affect the Rsquared form of the F statistic Nevertheless computing such an Rsquared tempts one to think the equation fits better than it does Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 257 ExamplE 86 Financial Wealth Equation We now estimate equations that explain net total financial wealth nettfa measured in 1000s in terms of income inc also measured in 1000s and some other variables including age gender and an indicator for whether the person is eligible for a 401k pension plan We use the data on single people 1fsize 5 12 in 401KSUBS In Computer Exercise C12 in Chapter 6 it was found that a specific quadratic function in age namely 1age 2 252 2 fit the data just as well as an unrestricted quadratic Plus the restricted form gives a simplified interpretation because the minimum age in the sample is 25 nettfa is an increasing function of age after age 5 25 The results are reported in Table 81 Because we suspect heteroskedasticity we report the heteroskedasticityrobust standard errors for OLS The weighted least squares estimates and their standard errors are obtained under the assumption Var1u0inc2 5 s2inc Without controlling for other factors another dollar of income is estimated to increase nettfa by about 82 when OLS is used the WLS estimate is smaller about 79 The difference is not large we certainly do not expect them to be identical The WLS coefficient does have a smaller standard error than OLS almost 40 smaller provided we assume the model Var1nettfa0inc2 5 s2inc is correct Adding the other controls reduced the inc coefficient somewhat with the OLS estimate still larger than the WLS estimate Again the WLS estimate of binc is more precise Age has an increasing effect starting at age 5 25 with the OLS estimate showing a larger effect The WLS estimate of bage is more precise in this case Gender does not have a statistically significant effect on nettfa but being eligible for a 401k plan does the OLS estimate is that those eligible holding fixed income age and gender have net total financial assets about 6890 higher The WLS estimate is substantially below the OLS estimate and suggests a misspecification of the functional form in the mean equation One possibility is to interact e401k and inc see Computer Exercise C11 Using WLS the F statistic for joint significance of 1age 2 252 2 male and e401k is about 308 if we use the Rsquareds reported in Table 81 With 3 and 2012 degrees of freedom the pvalue is zero to more than 15 decimal places of course this is not surpris ing given the very large t statistics for the age and 401k variables TAblE 81 Dependent Variable nettfa Independent Variables 1 OLS 2 WLS 3 OLS 4 WLS inc 821 104 787 063 771 100 740 064 1age 2 252 2 0251 0043 0175 0019 male 248 206 184 156 e401k 689 229 519 170 intercept 21057 253 2958 165 22098 350 21670 196 Observations 2017 2017 2017 2017 Rsquared 0827 0709 1279 1115 Using the OLS residuals obtained from the OLS regression reported in column 1 of Table 81 the regression of u 2 on inc yields a t statistic of 296 Does it appear we should worry about heteroskedasticity in the financial wealth equation Exploring FurthEr 83 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 258 Assuming that the error variance in the financial wealth equation has a variance proportional to income is essentially arbitrary In fact in most cases our choice of weights in WLS has a degree of arbitrariness However there is one case where the weights needed for WLS arise naturally from an underlying econometric model This happens when instead of using individuallevel data we only have averages of data across some group or geographic region For example suppose we are inter ested in determining the relationship between the amount a worker contributes to his or her 401k pension plan as a function of the plan generosity Let i denote a particular firm and let e denote an employee within the firm A simple model is contribie 5 b0 1 b1earnsie 1 b2ageie 1 b3mratei 1 uie 828 where contribie is the annual contribution by employee e who works for firm i earnsie is annual earn ings for this person and ageie is the persons age The variable mratei is the amount the firm puts into an employees account for every dollar the employee contributes If 828 satisfies the GaussMarkov assumptions then we could estimate it given a sample on individuals across various employers Suppose however that we only have average values of contri butions earnings and age by employer In other words individuallevel data are not available Thus let contribi denote average contribution for people at firm i and similarly for earnsi and agei Let mi denote the number of employees at firm i we assume that this is a known quantity Then if we aver age equation 828 across all employees at firm i we obtain the firmlevel equation contribi 5 b0 1 b1earnsi 1 b2agei 1 b3mratei 1 ui 829 where ui 5 m21 i g mi e51 uie is the average error across all employees in firm i If we have n firms in our sample then 829 is just a standard multiple linear regression model that can be estimated by OLS The estimators are unbiased if the original model 828 satisfies the GaussMarkov assumptions and the individual errors uie are independent of the firms size mi because then the expected value of ui given the explanatory variables in 829 is zero If the individuallevel equation 828 satisfies the homoskedasticity assumption and the errors within firm i are uncorrelated across employees then we can show that the firmlevel equation 829 has a particular kind of heteroskedasticity Specifically if Var1uie2 5 s2 for all i and e and Cov1uie uig2 5 0 for every pair of employees e 2 g within firm i then Var1ui2 5 s2mi this is just the usual formula for the variance of an average of uncorrelated random variables with common vari ance In other words the variance of the error term ui decreases with firm size In this case hi 5 1mi and so the most efficient procedure is weighted least squares with weights equal to the number of employees at the firm 11hi 5 mi2 This ensures that larger firms receive more weight This gives us an efficient way of estimating the parameters in the individuallevel model when we only have aver ages at the firm level A similar weighting arises when we are using per capita data at the city county state or coun try level If the individuallevel equation satisfies the GaussMarkov assumptions then the error in the per capita equation has a variance proportional to one over the size of the population Therefore weighted least squares with weights equal to the population is appropriate For example suppose we have citylevel data on per capita beer consumption in ounces the percentage of people in the popu lation over 21 years old average adult education levels average income levels and the city price of beer Then the citylevel model beerpc 5 b0 1 b1perc21 1 b2avgeduc 1 b3incpc 1 b4 price 1 u can be estimated by weighted least squares with the weights being the city population The advantage of weighting by firm size city population and so on relies on the underlying individual equation being homoskedastic If heteroskedasticity exists at the individual level then the proper weighting depends on the form of heteroskedasticity Further if there is correlation across errors within a group say firm then Var1ui2 2 s2mi see Problem 7 Uncertainty about the form of Var1ui2 in equations such as 829 is why more and more researchers simply use OLS and compute Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 259 robust standard errors and test statistics when estimating models using per capita data An alternative is to weight by group size but to report the heteroskedasticityrobust statistics in the WLS estimation This ensures that while the estimation is efficient if the individuallevel model satisfies the Gauss Markov assumptions heteroskedasticity at the individual level or withingroup correlation are ac counted for through robust inference 84b The Heteroskedasticity Function Must Be Estimated Feasible GLS In the previous subsection we saw some examples of where the heteroskedasticity is known up to a multiplicative form In most cases the exact form of heteroskedasticity is not obvious In other words it is difficult to find the function h1xi2 of the previous section Nevertheless in many cases we can model the function h and use the data to estimate the unknown parameters in this model This results in an esti mate of each hi denoted as h i Using h i instead of hi in the GLS transformation yields an estimator called the feasible GLS FGLS estimator Feasible GLS is sometimes called estimated GLS or EGLS There are many ways to model heteroskedasticity but we will study one particular fairly flexible approach Assume that Var1u0x2 5 s2exp1d0 1 d1x1 1 d2x2 1 p 1 dkxk2 830 where x1 x2 p xk are the independent variables appearing in the regression model see equation 81 and the dj are unknown parameters Other functions of the xj can appear but we will focus primarily on 830 In the notation of the previous subsection h1x2 5 exp1d0 1 d1x1 1 d2x2 1 p 1 dkxk2 You may wonder why we have used the exponential function in 830 After all when testing for heteroskedasticity using the BreuschPagan test we assumed that heteroskedasticity was a linear function of the xj Linear alternatives such as 812 are fine when testing for heteroskedasticity but they can be problematic when correcting for heteroskedasticity using weighted least squares We have encountered the reason for this problem before linear models do not ensure that predicted values are positive and our estimated variances must be positive in order to perform WLS If the parameters dj were known then we would just apply WLS as in the previous subsection This is not very realistic It is better to use the data to estimate these parameters and then to use these estimates to construct weights How can we estimate the dj Essentially we will transform this equa tion into a linear form that with slight modification can be estimated by OLS Under assumption 830 we can write u2 5 s2exp1d0 1 d1x1 1 d2x2 1 p 1 dkxk2v where v has a mean equal to unity conditional on x 5 x1 x2 xk If we assume that v is actually independent of x we can write log1u22 5 a0 1 d1x1 1 d2x2 1 p 1 dkxk 1 e 831 where e has a zero mean and is independent of x the intercept in this equation is different from d0 but this is not important in implementing WLS The dependent variable is the log of the squared error Since 831 satisfies the GaussMarkov assumptions we can get unbiased estimators of the dj by using OLS As usual we must replace the unobserved u with the OLS residuals Therefore we run the regression of log1u 22 on x1 x2 p xk 832 Actually what we need from this regression are the fitted values call these g i Then the estimates of hi are simply h i 5 exp1g i2 833 We now use WLS with weights 1h i in place of 1hi in equation 827 We summarize the steps Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 260 A Feasible GLS Procedure to Correct for Heteroskedasticity 1 Run the regression of y on x1 x2 p xk and obtain the residuals u 2 Create log1u 22 by first squaring the OLS residuals and then taking the natural log 3 Run the regression in equation 832 and obtain the fitted values g 4 Exponentiate the fitted values from 832 h 5 exp1g 2 5 Estimate the equation y 5 b0 1 b1x1 1 p 1 bkxk 1 u by WLS using weights 1h In other words we replace hi with h i in equation 827 Remember the squared residual for observation i gets weighted by 1h i If instead we first transform all vari ables and run OLS each variable gets multiplied by 1h i including the intercept If we could use hi rather than h i in the WLS procedure we know that our estimators would be unbiased in fact they would be the best linear unbiased estimators assuming that we have properly modeled the heteroskedasticity Having to estimate hi using the same data means that the FGLS esti mator is no longer unbiased so it cannot be BLUE either Nevertheless the FGLS estimator is con sistent and asymptotically more efficient than OLS This is difficult to show because of estimation of the variance parameters But if we ignore thisas it turns out we maythe proof is similar to show ing that OLS is efficient in the class of estimators in Theorem 53 At any rate for large sample sizes FGLS is an attractive alternative to OLS when there is evidence of heteroskedasticity that inflates the standard errors of the OLS estimates We must remember that the FGLS estimators are estimators of the parameters in the usual popu lation model y 5 b0 1 b1x1 1 p 1 bkxk 1 u Just as the OLS estimates measure the marginal impact of each xj on y so do the FGLS estimates We use the FGLS estimates in place of the OLS estimates because the FGLS estimators are more efficient and have associated test statistics with the usual t and F distributions at least in large samples If we have some doubt about the variance specified in equation 830 we can use heteroskedasticityrobust standard errors and test statistics in the transformed equation Another useful alternative for estimating hi is to replace the independent variables in regression 832 with the OLS fitted values and their squares In other words obtain the g i as the fitted values from the regression of log1u 22 on y y2 834 and then obtain the h i exactly as in equation 833 This changes only step 3 in the previous procedure If we use regression 832 to estimate the variance function you may be wondering if we can simply test for heteroskedasticity using this same regression an F or LM test can be used In fact Park 1966 suggested this Unfortunately when compared with the tests discussed in Section 83 the Park test has some problems First the null hypothesis must be something stronger than homoskedas ticity effectively u and x must be independent This is not required in the BreuschPagan or White tests Second using the OLS residuals u in place of u in 832 can cause the F statistic to deviate from the F distribution even in large sample sizes This is not an issue in the other tests we have cov ered For these reasons the Park test is not recommended when testing for heteroskedasticity Regres sion 832 works well for weighted least squares because we only need consistent estimators of the dj and regression 832 certainly delivers those Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 261 ExamplE 87 Demand for Cigarettes We use the data in SMOKE to estimate a demand function for daily cigarette consumption Since most people do not smoke the dependent variable cigs is zero for most observations A linear model is not ideal because it can result in negative predicted values Nevertheless we can still learn some thing about the determinants of cigarette smoking by using a linear model The equation estimated by ordinary least squares with the usual OLS standard errors in paren theses is cigs 5 2364 1 880 log1income2 2 751 log1cigpric2 124082 17282 157732 2 501 educ 1 771 age 2 0090 age2 2 283 restaurn 835 11672 11602 100172 11112 n 5 807 R2 5 0526 where cigs 5 number of cigarettes smoked per day income 5 annual income cigpric 5 the perpack price of cigarettes in cents educ 5 years of schooling age 5 age measured in years restaurn 5 a binary indicator equal to unity if the person resides in a state with restaurant smoking restrictions Since we are also going to do weighted least squares we do not report the heteroskedasticityrobust standard errors for OLS Incidentally 13 out of the 807 fitted values are less than zero this is less than 2 of the sample and is not a major cause for concern Neither income nor cigarette price is statistically significant in 835 and their effects are not practically large For example if income increases by 10 cigs is predicted to increase by 18801002 1102 5 088 or less than onetenth of a cigarette per day The magnitude of the price ef fect is similar Each year of education reduces the average cigarettes smoked per day by onehalf of a cigarette and the effect is statistically significant Cigarette smoking is also related to age in a quadratic fash ion Smoking increases with age up until age 5 7713210092 4 4283 and then smoking decreases with age Both terms in the quadratic are statistically significant The presence of a restriction on smoking in restaurants decreases cigarette smoking by almost three cigarettes per day on average Do the errors underlying equation 835 contain heteroskedasticity The BreuschPagan regres sion of the squared OLS residuals on the independent variables in 835 see equation 814 pro duces R2 u2 5 040 This small Rsquared may seem to indicate no heteroskedasticity but we must remember to compute either the F or LM statistic If the sample size is large a seemingly small R2 u2 can result in a very strong rejection of homoskedasticity The LM statistic is LM 5 80710402 5 3228 and this is the outcome of a x2 6 random variable The pvalue is less than 000015 which is very strong evidence of heteroskedasticity Therefore we estimate the equation using the feasible GLS procedure based on equation 832 The weighted least squares estimates are cigs 5 564 1 130 log1income2 2 294 log1cigpric2 117802 1442 14462 2 463 educ 1 482 age 2 0056 age2 2 346 restaurn 836 11202 10972 100092 1802 n 5 807 R2 5 1134 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 262 The income effect is now statistically significant and larger in magnitude The price effect is also notably bigger but it is still statistically insignificant One reason for this is that cigpric varies only across states in the sample and so there is much less variation in logcigpric than in logincome educ and age The estimates on the other variables have naturally changed somewhat but the basic story is still the same Cigarette smoking is negatively related to schooling has a quadratic relationship with age and is negatively affected by restaurant smoking restrictions We must be a little careful in computing F statistics for testing multiple hypotheses after estima tion by WLS This is true whether the sum of squared residuals or Rsquared form of the F statistic is used It is important that the same weights be used to estimate the unrestricted and restricted models We should first estimate the unrestricted model by OLS Once we have obtained the weights we can use them to estimate the restricted model as well The F statistic can be computed as usual Fortu nately many regression packages have a simple command for testing joint restrictions after WLS estimation so we need not perform the restricted regression ourselves Example 87 hints at an issue that sometimes arises in applications of weighted least squares the OLS and WLS estimates can be substantially different This is not such a big problem in the demand for cigarettes equation because all the coefficients maintain the same signs and the biggest changes are on variables that were statistically insignificant when the equation was estimated by OLS The OLS and WLS estimates will always differ due to sam pling error The issue is whether their difference is enough to change important conclusions If OLS and WLS produce statistically significant estimates that differ in signfor example the OLS price elasticity is positive and significant while the WLS price elasticity is negative and significant or the difference in magnitudes of the estimates is practically large we should be suspicious Typi cally this indicates that one of the other Gauss Markov assumptions is false particularly the zero conditional mean assumption on the error MLR4 If E1y0x2 2 b0 1 b1x1 1 p 1 bkxk then OLS and WLS have different expected values and probability limits For WLS to be consistent for the bj it is not enough for u to be uncorrelated with each xj we need the stronger assumption MLR4 in the linear model MLR1 Therefore a significant difference between OLS and WLS can indicate a functional form mis specification in E1y0x2 The Hausman test Hausman 1978 can be used to formally compare the OLS and WLS estimates to see if they differ by more than sampling error suggests they should but this test is beyond the scope of this text In many cases an informal eyeballing of the estimates is sufficient to detect a problem 84c What If the Assumed Heteroskedasticity Function Is Wrong We just noted that if OLS and WLS produce very different estimates it is likely that the conditional mean E1y0x2 is misspecified What are the properties of WLS if the variance function we use is mis specified in the sense that Var1y0x2 2 s2h1x2 for our chosen function hx The most important issue Let u i be the WLS residuals from 836 which are not weighted and let cigsi be the fitted values These are obtained us ing the same formulas as OLS they differ because of different estimates of the bj One way to determine whether heteroske dasticity has been eliminated is to use the u 2 i h i 5 1u ih i2 2 in a test for heteroskedas ticity If hi 5 Var1ui0xi2 then the transformed residuals should have little evidence of het eroskedasticity There are many possibili ties but onebased on Whites test in the transformed equationis to regress u 2 i h i on cigsi h i and cigs2 i h i including an in tercept The joint F statistic when we use SMOKE is 1115 Does it appear that our correction for heteroskedasticity has actually eliminated the heteroskedasticity Exploring FurthEr 84 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 263 is whether misspecification of hx causes bias or inconsistency in the WLS estimator Fortunately the answer is no at least under MLR4 Recall that if E1u0x2 5 0 then any function of x is uncorrelated with u and so the weighted error u h1x2 is uncorrelated with the weighted regressors xj h1x2 for any function hx that is always positive This is why as we just discussed we can take large dif ferences between the OLS and WLS estimators as indicative of functional form misspecification If we estimate parameters in the function say h1x d 2 then we can no longer claim that WLS is unbi ased but it will generally be consistent whether or not the variance function is correctly specified If WLS is at least consistent under MLR1 to MLR4 what are the consequences of using WLS with a misspecified variance function There are two The first which is very important is that the usual WLS standard errors and test statistics computed under the assumption that Var1y0x2 5 s2h1x2 are no longer valid even in large samples For example the WLS estimates and standard errors in column 4 of Table 81 assume that Var1nettfa0inc age male e401k2 5 Var1nettfa0inc2 5 s2inc so we are assuming not only that the variance depends just on income but also that it is a linear function of income If this assumption is false the standard errors and any statistics we obtain using those standard errors are not valid Fortunately there is an easy fix just as we can obtain standard errors for the OLS estimates that are robust to arbitrary heteroskedasticity we can obtain standard errors for WLS that allow the variance function to be arbitrarily misspecified It is easy to see why this works Write the transformed equation as yi hi 5 b011 hi2 1 b11xi1 hi2 1 p 1 bk1xik hi2 1 ui hi Now if Var1ui0xi2 2 s2hi then the weighted error ui hi is heteroskedastic So we can just apply the usual heteroskedasticityrobust standard errors after estimating this equation by OLSwhich remember is identical to WLS To see how robust inference with WLS works in practice column 1 of Table 82 reproduces the last column of Table 81 and column 2 contains standard errors robust to Var1ui0xi2 2 s2inci The standard errors in column 2 allow the variance function to be misspecified We see that for the income and age variables the robust standard errors are somewhat above the usual WLS standard errorscertainly by enough to stretch the confidence intervals On the other hand the robust standard errors for male and e401k are actually smaller than those that assume a correct variance function We saw this could happen with the heteroskedasticityrobust standard errors for OLS too Even if we use flexible forms of variance functions such as that in 830 there is no guarantee that we have the correct model While exponential heteroskedasticity is appealing and reasonably flex ible it is after all just a model Therefore it is always a good idea to compute fully robust standard errors and test statistics after WLS estimation TAblE 82 WLS Estimation of the nettfa Equation Independent Variables With Nonrobust Standard Errors With Robust Standard Errors inc 740 064 740 075 age 252 0175 0019 0175 0026 male 184 156 184 131 e401k 519 170 519 157 intercept 21670 196 21670 224 Observations 2017 2017 Rsquared 1115 1115 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 264 A modern criticism of WLS is that if the variance function is misspecified it is not guaranteed to be more efficient than OLS In fact that is the case if Var1y0x2 is neither constant nor equal to s2h1x2 where hx is the proposed model of heteroskedasticity then we cannot rank OLS and WLS in terms of variances or asymptotic variances when the variance parameters must be estimated However this theoretically correct criticism misses an important practical point Namely in cases of strong heteroskedasticity it is often better to use a wrong form of heteroskedasticity and apply WLS than to ignore heteroskedasticity altogether in estimation and use OLS Models such as 830 can well approximate a variety of heteroskedasticity functions and may produce estimators with smaller asymptotic variances Even in Example 86 where the form of heteroskedasticity was assumed to have the simple form Var1nettfa0x2 5 s2inc the fully robust standard errors for WLS are well below the fully robust standard errors for OLS Comparing robust standard errors for the two estimators puts them on equal footing we assume neither homoskedasticity nor that the variance has the form s2inc For example the robust standard error for the WLS estimator of binc is about 075 which is 25 lower than the robust standard error for OLS about 100 For the coefficient on 1age 2 252 2 the robust standard error of WLS is about 0026 almost 40 below the robust standard error for OLS about 0043 84d Prediction and Prediction Intervals with Heteroskedasticity If we start with the standard linear model under MLR1 to MLR4 but allow for heteroskedasticity of the form Var1y0x2 5 s2h1x2 see equation 821 the presence of heteroskedasticity affects the point prediction of y only insofar as it affects estimation of the bj Of course it is natural to use WLS on a sample of size n to obtain the b j Our prediction of an unobserved outcome y0 given known values of the explanatory variables x0 has the same form as in Section 64 y0 5 b 0 1 x0b This makes sense once we know E1y0x2 we base our prediction on it the structure of Var1y0x2 plays no direct role On the other hand prediction intervals do depend directly on the nature of Var1y0x2 Recall in Section 64 that we constructed a prediction interval under the classical linear model assumptions Suppose now that all the CLM assumptions hold except that 821 replaces the homoskedasticity as sumption MLR5 We know that the WLS estimators are BLUE and because of normality have con ditional normal distributions We can obtain se1y02 using the same method in Section 64 except that now we use WLS A simple approach is to write yi 5 u0 1 b11xi1 2 x0 12 1 p 1 bk1xik 2 x0 k2 1 ui where the x0 j are the values of the explanatory variables for which we want a predicted value of y We can estimate this equation by WLS and then obtain y0 5 u 0 and se1y02 5 se1u 02 We also need to estimate the standard deviation of u0 the unobserved part of y0 But Var1u00x 5 x02 5 s2h1x02 and so se1u02 5 s h1x02 where s is the standard error of the regression from the WLS estimation Therefore a 95 prediction interval is y0 6 t025 se1e02 837 where se1e02 5 53se1y02 42 1 s 2h1x02 612 This interval is exact only if we do not have to estimate the variance function If we estimate parameters as in model 830 then we cannot obtain an exact interval In fact accounting for the es timation error in the b j and the d j the variance parameters becomes very difficult We saw two exam ples in Section 64 where the estimation error in the parameters was swamped by the variation in the unobservables u0 Therefore we might still use equation 837 with h1x02 simply replaced by h 1x02 In fact if we are to ignore the parameter estimation error entirely we can drop se1y02 from se1e02 Remember se1y02 converges to zero at the rate 1n while se1u 02 is roughly constant We can also obtain a prediction for y in the model log1y2 5 b0 1 b1x1 1 p 1 bkxk 1 u 838 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 265 where u is heteroskedastic We assume that u has a conditional normal distribution with a specific form of heteroskedasticity We assume the exponential form in equation 830 but add the normality assumption u0x1 x2 p xk Normal30 exp1d0 1 d1x1 1 p 1 dkxk2 4 839 As a notational shorthand write the variance function as exp1d0 1 xd 2 Then because logy given x has a normal distribution with mean b0 1 xb and variance exp1d0 1 xd 2 it follows that E1y0x2 5 exp1b0 1 xb 1 exp1d0 1 xd 222 840 Now we estimate the bj and dj using WLS estimation of 838 That is after using OLS to obtain the residuals run the regression in 832 to obtain fitted values g i 5 a 0 1 d 1xi1 1 p 1 d kxik 841 and then compute the h i as in 833 Using these h i obtain the WLS estimates b j and also com pute s 2 from the weighted squared residuals Now compared with the original model for Var1u0x2 d0 5 a0 1 log1s22 and so Var1u0x2 5 s2 exp1a0 1 d1x1 1 p 1 dkxk2 Therefore the estimated variance is s 2 exp1g i2 5 s 2h i and the fitted value for yi is yi 5 exp1logyi 1 s 2h i 22 842 We can use these fitted values to obtain an Rsquared measure as described in Section 64 use the squared correlation coefficient between yi and yi For any values of the explanatory variables x0 we can estimate E1y0x 5 x02 as E 1y0x 5 x02 5 exp1b 0 1 x0b 1 s 2 exp1a 0 1 x0d 222 843 where b j 5 the WLS estimates a 0 5 the intercept in 841 d j 5 the slopes from the same regression s 2 is obtained from the WLS estimation Obtaining a proper standard error for the prediction in 842 is very complicated analytically but as in Section 64 it would be fairly easy to obtain a standard error using a resampling method such as the bootstrap described in Appendix 6A Obtaining a prediction interval is more of a challenge when we estimate a model for heteroske dasticity and a full treatment is complicated Nevertheless we saw in Section 64 two examples where the error variance swamps the estimation error and we would make only a small mistake by ignoring the estimation error in all parameters Using arguments similar to those in Section 64 an approximate 95 prediction interval for large sample sizes is exp32196 s h 1x02 4 exp1b 0 1 x0b 2 to exp3196 s h 1x02 4 exp1b 0 1 x0b 2 where h 1x02 is the estimated variance function evaluated at x0 h 1x02 5 exp1a 0 1 d 1x0 1 1 p 1 d kx0 k2 As in Section 64 we obtain this approximate interval by simply exponentiating the endpoints 85 The Linear Probability Model Revisited As we saw in Section 75 when the dependent variable y is a binary variable the model must contain heteroskedasticity unless all of the slope parameters are zero We are now in a position to deal with this problem The simplest way to deal with heteroskedasticity in the linear probability model is to continue to use OLS estimation but to also compute robust standard errors in test statistics This ignores the fact that we actually know the form of heteroskedasticity for the LPM Nevertheless OLS estimation of the LPM is simple and often produces satisfactory results Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 266 ExamplE 88 labor Force participation of married Women In the labor force participation example in Section 75 see equation 729 we reported the usual OLS standard errors Now we compute the heteroskedasticityrobust standard errors as well These are reported in brackets below the usual standard errors inlf 5 586 2 0034 nwifeinc 1 038 educ 1 039 exper 11542 100142 10072 10062 31514 300154 30074 30064 2 00060 exper2 2 016 age 2 262 kidslt6 1 0130 kidsge6 844 1000182 10022 10342 101322 3000194 30024 30324 301354 n 5 753 R2 5 264 Several of the robust and OLS standard errors are the same to the reported degree of precision in all cases the differences are practically very small Therefore while heteroskedasticity is a problem in theory it is not in practice at least not for this example It often turns out that the usual OLS standard errors and test statistics are similar to their heteroskedasticityrobust counterparts Furthermore it requires a minimal effort to compute both Generally the OLS estimators are inefficient in the LPM Recall that the conditional variance of y in the LPM is Var1y0x2 5 p1x2 31 2 p1x2 4 845 where p1x2 5 b0 1 b1x1 1 p 1 bkxk 846 is the response probability probability of success y 5 1 It seems natural to use weighted least squares but there are a couple of hitches The probability px clearly depends on the unknown popu lation parameters bj Nevertheless we do have unbiased estimators of these parameters namely the OLS estimators When the OLS estimators are plugged into equation 846 we obtain the OLS fitted values Thus for each observation i Var1yi0xi2 is estimated by h i 5 yi11 2 yi2 847 where yi is the OLS fitted value for observation i Now we apply feasible GLS just as in Section 84 Unfortunately being able to estimate hi for each i does not mean that we can proceed directly with WLS estimation The problem is one that we briefly discussed in Section 75 the fitted values yi need not fall in the unit interval If either yi 0 or yi 1 equation 847 shows that h i will be nega tive Since WLS proceeds by multiplying observation i by 1 h i the method will fail if h i is negative or zero for any observation In other words all of the weights for WLS must be positive In some cases 0 yi 1 for all i in which case WLS can be used to estimate the LPM In cases with many observations and small probabilities of success or failure it is very common to find some fitted values outside the unit interval If this happens as it does in the labor force participation example in equation 844 it is easiest to abandon WLS and to report the heteroskedasticityrobust statistics An alternative is to adjust those fitted values that are less than zero or greater than unity and then to apply WLS One suggestion is to set yi 5 01 if yi 0 and yi 5 99 if yi 1 Unfortunately this requires an arbitrary choice on the part of the researcherfor example why not use 001 and 999 as the adjusted values If many fitted values are outside the unit interval the adjustment to the fitted values can affect the results in this situation it is probably best to just use OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 267 Estimating the Linear Probability Model by Weighted Least Squares 1 Estimate the model by OLS and obtain the fitted values y 2 Determine whether all of the fitted values are inside the unit interval If so proceed to step 3 If not some adjustment is needed to bring all fitted values into the unit interval 3 Construct the estimated variances in equation 847 4 Estimate the equation y 5 b0 1 b1x1 1 p 1 bkxk 1 u by WLS using weights 1h ExamplE 89 Determinants of personal Computer Ownership We use the data in GPA1 to estimate the probability of owning a computer Let PC denote a binary in dicator equal to unity if the student owns a computer and zero otherwise The variable hsGPA is high school GPA ACT is achievement test score and parcoll is a binary indicator equal to unity if at least one parent attended college Separate college indicators for the mother and the father do not yield individually significant results as these are pretty highly correlated The equation estimated by OLS is PC 5 20004 1 065 hsGPA 1 0006 ACT 1 221 parcoll 149052 11372 101552 10932 848 348884 31394 301584 30874 n 5 141 R2 5 0415 Just as with Example 88 there are no striking differences between the usual and robust standard errors Nevertheless we also estimate the model by WLS Because all of the OLS fitted values are inside the unit interval no adjustments are needed PC 5 026 1 033 hsGPA 1 0043 ACT 1 215 parcoll 14772 11302 101552 10862 849 n 5 142 R2 5 0464 There are no important differences in the OLS and WLS estimates The only significant explanatory variable is parcoll and in both cases we estimate that the probability of PC ownership is about 22 higher if at least one parent attended college Summary We began by reviewing the properties of ordinary least squares in the presence of heteroskedasticity Heter oskedasticity does not cause bias or inconsistency in the OLS estimators but the usual standard errors and test statistics are no longer valid We showed how to compute heteroskedasticityrobust standard errors and t statistics something that is routinely done by many regression packages Most regression packages also compute a heteroskedasticityrobust Ftype statistic We discussed two common ways to test for heteroskedasticity the BreuschPagan test and a special case of the White test Both of these statistics involve regressing the squared OLS residuals on either the independent variables BP or the fitted and squared fitted values White A simple F test is asymptotically valid there are also Lagrange multiplier versions of the tests OLS is no longer the best linear unbiased estimator in the presence of heteroskedasticity When the form of heteroskedasticity is known GLS estimation can be used This leads to weighted least squares as Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 268 PART 1 Regression Analysis with CrossSectional Data a means of obtaining the BLUE estimator The test statistics from the WLS estimation are either exactly valid when the error term is normally distributed or asymptotically valid under nonnormality This as sumes of course that we have the proper model of heteroskedasticity More commonly we must estimate a model for the heteroskedasticity before applying WLS The re sulting feasible GLS estimator is no longer unbiased but it is consistent and asymptotically efficient The usual statistics from the WLS regression are asymptotically valid We discussed a method to ensure that the estimated variances are strictly positive for all observations something needed to apply WLS As we discussed in Chapter 7 the linear probability model for a binary dependent variable necessarily has a heteroskedastic error term A simple way to deal with this problem is to compute heteroskedasticity robust statistics Alternatively if all the fitted values that is the estimated probabilities are strictly be tween zero and one weighted least squares can be used to obtain asymptotically efficient estimators Key Terms BreuschPagan Test for Heteroskedasticity BP Test Feasible GLS FGLS Estimator Generalized Least Squares GLS Estimators Heteroskedasticity of Unknown Form HeteroskedasticityRobust F Statistic HeteroskedasticityRobust LM Statistic HeteroskedasticityRobust Standard Error HeteroskedasticityRobust t Statistic Weighted Least Squares WLS Estimators White Test for Heteroskedasticity Problems 1 Which of the following are consequences of heteroskedasticity i The OLS estimators b j are inconsistent ii The usual F statistic no longer has an F distribution iii The OLS estimators are no longer BLUE 2 Consider a linear model to explain monthly beer consumption beer 5 b0 1 b1inc 1 b2 price 1 b3educ 1 b4 female 1 u E1u0inc price educ female2 5 0 Var1u0inc price educ female2 5 s2inc2 Write the transformed equation that has a homoskedastic error term 3 True or False WLS is preferred to OLS when an important variable has been omitted from the model 4 Using the data in GPA3 the following equation was estimated for the fall and second semester students trmgpa 5 2212 1 900 crsgpa 1 193 cumgpa 1 0014 tothrs 1552 11752 10642 100122 3554 31664 30744 300124 1 0018 sat 2 0039 hsperc 1 351 female 2 157 season 100022 100182 10852 10982 300024 300194 30794 30804 n 5 269 R2 5 465 Here trmgpa is term GPA crsgpa is a weighted average of overall GPA in courses taken cumgpa is GPA prior to the current semester tothrs is total credit hours prior to the semester sat is SAT score hsperc is graduating percentile in high school class female is a gender dummy and season is a dummy variable equal to unity if the students sport is in season during the fall The usual and heteroskedasticityrobust standard errors are reported in parentheses and brackets respectively Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 269 i Do the variables crsgpa cumgpa and tothrs have the expected estimated effects Which of these variables are statistically significant at the 5 level Does it matter which standard errors are used ii Why does the hypothesis H0 bcrsgpa 5 1 make sense Test this hypothesis against the twosided alternative at the 5 level using both standard errors Describe your conclusions iii Test whether there is an inseason effect on term GPA using both standard errors Does the significance level at which the null can be rejected depend on the standard error used 5 The variable smokes is a binary variable equal to one if a person smokes and zero otherwise Using the data in SMOKE we estimate a linear probability model for smokes smokes 5 656 2 069 log1cigpric2 1 012 log1income2 2 029 educ 18552 12042 10262 10062 38564 32074 30264 30064 1 020 age 2 00026 age2 2 101 restaurn 2 026 white 10062 1000062 10392 10522 30054 3000064 30384 30504 n 5 807 R2 5 062 The variable white equals one if the respondent is white and zero otherwise the other independent variables are defined in Example 87 Both the usual and heteroskedasticityrobust standard errors are reported i Are there any important differences between the two sets of standard errors ii Holding other factors fixed if education increases by four years what happens to the estimated probability of smoking iii At what point does another year of age reduce the probability of smoking iv Interpret the coefficient on the binary variable restaurn a dummy variable equal to one if the person lives in a state with restaurant smoking restrictions v Person number 206 in the data set has the following characteristics cigpric 5 6744 income 5 6500 educ 5 16 age 5 77 restaurn 5 0 white 5 0 and smokes 5 0 Compute the predicted probability of smoking for this person and comment on the result 6 There are different ways to combine features of the BreuschPagan and White tests for heteroskedastic ity One possibility not covered in the text is to run the regression u 2 i on xi1 xi2 p xik y2 i i 5 1 p n where the u i are the OLS residuals and the yi are the OLS fitted values Then we would test joint significance of xi1 xi2 p xik and y2 i Of course we always include an intercept in this regression i What are the df associated with the proposed F test for heteroskedasticity ii Explain why the Rsquared from the regression above will always be at least as large as the Rsquareds for the BP regression and the special case of the White test iii Does part ii imply that the new test always delivers a smaller pvalue than either the BP or special case of the White statistic Explain iv Suppose someone suggests also adding yi to the newly proposed test What do you think of this idea 7 Consider a model at the employee level yie 5 b0 1 b1xie1 1 b2xie2 1 p 1 bk xiek 1 fi 1 vie where the unobserved variable fi is a firm effect to each employee at a given firm i The error term vie is specific to employee e at firm i The composite error is uie 5 fi 1 vie such as in equation 828 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 270 PART 1 Regression Analysis with CrossSectional Data i Assume that Var1fi2 5 s2 f Var1vie2 5 s2 v and fi and vie are uncorrelated Show that Var1uie2 5 s2 f 1 s2 v call this s2 ii Now suppose that for e 2 g vie and vig are uncorrelated Show that Cov1uie uig2 5 s2 f iii Let ui 5 m21 i g mi e51 uie be the average of the composite errors within a firm Show that Var1ui2 5 s2 f 1 s2 vmi iv Discuss the relevance of part iii for WLS estimation using data averaged at the firm level where the weight used for observation i is the usual firm size 8 The following equations were estimated using the data in ECONMATH The first equation is for men and the second is for women The third and fourth equations combine men and women score 5 2052 1 13 60 colgpa 1 0 670 act 13722 10942 101502 n 5 406 R2 5 4025 SSR 5 3878138 score 5 13 79 1 11 89 colgpa 1 1 03 act 14112 11092 10182 n 5 408 R2 5 3666 SSR 5 4802982 score 5 1560 1 317 male 1 1282 colgpa 1 0838 act 12802 10732 10722 101162 n 5 814 R2 5 3946 SSR 5 8712896 score 5 1379 1 6 73 male 1 11 89 colgpa 1 1 03 act 1 1 72 male colgpa 2 0 364 male act 13912 15552 11042 10172 11442 102322 n 5 814 R2 5 3968 SSR 5 8681120 i Compute the usual Chow statistic for testing the null hypothesis that the regression equations are the same for men and women Find the pvalue of the test ii Compute the usual Chow statistic for testing the null hypothesis that the slope coefficients are the same for men and women and report the pvalue iii Do you have enough information to compute heteroskedasticityrobust versions of the tests in ii and iii Explain Computer Exercises C1 Consider the following model to explain sleeping behavior sleep 5 b0 1 b1totwrk 1 b2 educ 1 b3 age 1 b4 age2 1 b5 yngkid 1 b6 male 1 u i Write down a model that allows the variance of u to differ between men and women The vari ance should not depend on other factors ii Use the data in SLEEP75 to estimate the parameters of the model for heteroskedasticity You have to estimate the sleep equation by OLS first to obtain the OLS residuals Is the estimated variance of u higher for men or for women iii Is the variance of u statistically different for men and for women C2 i Use the data in HPRICE1 to obtain the heteroskedasticityrobust standard errors for equation 817 Discuss any important differences with the usual standard errors ii Repeat part i for equation 818 iii What does this example suggest about heteroskedasticity and the transformation used for the dependent variable Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 271 C3 Apply the full White test for heteroskedasticity see equation 819 to equation 818 Using the chi square form of the statistic obtain the pvalue What do you conclude C4 Use VOTE1 for this exercise i Estimate a model with voteA as the dependent variable and prtystrA democA logexpendA and logexpendB as independent variables Obtain the OLS residuals u i and regress these on all of the independent variables Explain why you obtain R2 5 0 ii Now compute the BreuschPagan test for heteroskedasticity Use the F statistic version and report the pvalue iii Compute the special case of the White test for heteroskedasticity again using the F statistic form How strong is the evidence for heteroskedasticity now C5 Use the data in PNTSPRD for this exercise i The variable sprdcvr is a binary variable equal to one if the Las Vegas point spread for a college basketball game was covered The expected value of sprdcvr say m is the probability that the spread is covered in a randomly selected game Test H0 m 5 5 against H1 m 2 5 at the 10 significance level and discuss your findings Hint This is easily done using a t test by regress ing sprdcvr on an intercept only ii How many games in the sample of 553 were played on a neutral court iii Estimate the linear probability model sprdcvr 5 b0 1 b1 favhome 1 b2 neutral 1 b3 fav25 1 b4 und25 1 u and report the results in the usual form Report the usual OLS standard errors and the heteroskedasticityrobust standard errors Which variable is most significant both practically and statistically iv Explain why under the null hypothesis H0 b1 5 b2 5 b3 5 b4 5 0 there is no heteroskedasticity in the model v Use the usual F statistic to test the hypothesis in part iv What do you conclude vi Given the previous analysis would you say that it is possible to systematically predict whether the Las Vegas spread will be covered using information available prior to the game C6 In Example 712 we estimated a linear probability model for whether a young man was arrested dur ing 1986 arr86 5 b0 1 b1 pcnv 1 b2 avgsen 1 b3 tottime 1 b4 ptime86 1 b5 qemp86 1 u i Using the data in CRIME1 estimate this model by OLS and verify that all fitted values are strictly between zero and one What are the smallest and largest fitted values ii Estimate the equation by weighted least squares as discussed in Section 85 iii Use the WLS estimates to determine whether avgsen and tottime are jointly significant at the 5 level C7 Use the data in LOANAPP for this exercise i Estimate the equation in part iii of Computer Exercise C8 in Chapter 7 computing the heteroskedasticityrobust standard errors Compare the 95 confidence interval on bwhite with the nonrobust confidence interval ii Obtain the fitted values from the regression in part i Are any of them less than zero Are any of them greater than one What does this mean about applying weighted least squares C8 Use the data set GPA1 for this exercise i Use OLS to estimate a model relating colGPA to hsGPA ACT skipped and PC Obtain the OLS residuals ii Compute the special case of the White test for heteroskedasticity In the regression of u 2 i on colGPAi colGPA2 i obtain the fitted values say h i Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 272 PART 1 Regression Analysis with CrossSectional Data iii Verify that the fitted values from part ii are all strictly positive Then obtain the weighted least squares estimates using weights 1h i Compare the weighted least squares estimates for the effect of skipping lectures and the effect of PC ownership with the corresponding OLS estimates What about their statistical significance iv In the WLS estimation from part iii obtain heteroskedasticityrobust standard errors In other words allow for the fact that the variance function estimated in part ii might be misspecified See Question 84 Do the standard errors change much from part iii C9 In Example 87 we computed the OLS and a set of WLS estimates in a cigarette demand equation i Obtain the OLS estimates in equation 835 ii Obtain the h i used in the WLS estimation of equation 836 and reproduce equation 836 From this equation obtain the unweighted residuals and fitted values call these u i and yi respectively For example in Stata the unweighted residuals and fitted values are given by default iii Let ui 5 u i h i and yi 5 yi h i be the weighted quantities Carry out the special case of the White test for heteroskedasticity by regressing u2 i on yi y2 i being sure to include an intercept as always Do you find heteroskedasticity in the weighted residuals iv What does the finding from part iii imply about the proposed form of heteroskedasticity used in obtaining 836 v Obtain valid standard errors for the WLS estimates that allow the variance function to be misspecified C10 Use the data set 401KSUBS for this exercise i Using OLS estimate a linear probability model for e401k using as explanatory variables inc inc2 age age2 and male Obtain both the usual OLS standard errors and the heteroskedasticity robust versions Are there any important differences ii In the special case of the White test for heteroskedasticity where we regress the squared OLS residuals on a quadratic in the OLS fitted values u 2 i on yi y2 i i 5 1 p n argue that the probability limit of the coefficient on yi should be one the probability limit of the coefficient on y2 i should be 21 and the probability limit of the intercept should be zero Hint Remember that Var1y0x1 p xk2 5 p1x2 31 2 p1x2 4 where p1x2 5 b0 1 b1x1 1 p 1 bkxk iii For the model estimated from part i obtain the White test and see if the coefficient estimates roughly correspond to the theoretical values described in part ii iv After verifying that the fitted values from part i are all between zero and one obtain the weighted least squares estimates of the linear probability model Do they differ in important ways from the OLS estimates C11 Use the data in 401KSUBS for this question restricting the sample to fsize 5 1 i To the model estimated in Table 81 add the interaction term e401k inc Estimate the equation by OLS and obtain the usual and robust standard errors What do you conclude about the statis tical significance of the interaction term ii Now estimate the more general model by WLS using the same weights 1inci as in Table 81 Compute the usual and robust standard error for the WLS estimator Is the interaction term statistically significant using the robust standard error iii Discuss the WLS coefficient on e401k in the more general model Is it of much interest by itself Explain iv Reestimate the model by WLS but use the interaction term e401k inc 30 the average income in the sample is about 2944 Now interpret the coefficient on e401k C12 Use the data in MEAP00 to answer this question i Estimate the model math4 5 b0 1 b1lunch 1 b2log1enroll2 1 b3log1exppp2 1 u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 8 Heteroskedasticity 273 by OLS and obtain the usual standard errors and the fully robust standard errors How do they generally compare ii Apply the special case of the White test for heteroskedasticity What is the value of the F test What do you conclude iii Obtain g i as the fitted values from the regression log1u 2 i 2 on math4i math42 i where math4i are the OLS fitted values and the u i are the OLS residuals Let h i 5 exp1g i2 Use the h i to obtain WLS estimates Are there big differences with the OLS coefficients iv Obtain the standard errors for WLS that allow misspecification of the variance function Do these differ much from the usual WLS standard errors v For estimating the effect of spending on math4 does OLS or WLS appear to be more precise C13 Use the data in FERTIL2 to answer this question i Estimate the model children 5 b0 1 b1age 1 b2 age2 1 b3 educ 1 b4 electric 1 b5 urban 1 u and report the usual and heteroskedasticityrobust standard errors Are the robust standard errors always bigger than the nonrobust ones ii Add the three religious dummy variables and test whether they are jointly significant What are the pvalues for the nonrobust and robust tests iii From the regression in part ii obtain the fitted values y and the residuals u Regress u 2 on y y2 and test the joint significance of the two regressors Conclude that heteroskedasticity is present in the equation for children iv Would you say the heteroskedasticity you found in part iii is practically important C14 Use the data in BEAUTY for this question i Using the data pooled for men and women estimate the equation lwage 5 b0 1 b1belavg 1 b2abvavg 1 b3female 1 b4educ 1 b5exper 1 b5exper2 1 u and report the results using heteroskedasticityrobust standard errors below coefficients Are any of the coefficients surprising in either their signs or magnitudes Is the coefficient on female practically large and statistically significant ii Add interactions of female with all other explanatory variables in the equation from part i five interactions in all Compute the usual F test of joint significance of the five interactions and a heteroskedasticityrobust version Does using the heteroskedasticityrobust version change the outcome in any important way iii In the full model with interactions determine whether those involving the looks variables female belavg and female abvavgare jointly significant Are their coefficients practically small Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 274 c h a p t e r 9 More on Specification and Data Issues I n Chapter 8 we dealt with one failure of the GaussMarkov assumptions While heteroskedasticity in the errors can be viewed as a problem with a model it is a relatively minor one The presence of heteroskedasticity does not cause bias or inconsistency in the OLS estimators Also it is fairly easy to adjust confidence intervals and t and F statistics to obtain valid inference after OLS estimation or even to get more efficient estimators by using weighted least squares In this chapter we return to the much more serious problem of correlation between the error u and one or more of the explanatory variables Remember from Chapter 3 that if u is for whatever reason correlated with the explanatory variable xj then we say that xj is an endogenous explanatory variable We also provide a more detailed discussion on three reasons why an explanatory variable can be endogenous in some cases we discuss possible remedies We have already seen in Chapters 3 and 5 that omitting a key variable can cause correlation between the error and some of the explanatory variables which generally leads to bias and incon sistency in all of the OLS estimators In the special case that the omitted variable is a function of an explanatory variable in the model the model suffers from functional form misspecification We begin in the first section by discussing the consequences of functional form misspecification and how to test for it In Section 92 we show how the use of proxy variables can solve or at least mitigate omitted variables bias In Section 93 we derive and explain the bias in OLS that can arise under certain forms of measurement error Additional data problems are discussed in Section 94 All of the procedures in this chapter are based on OLS estimation As we will see certain prob lems that cause correlation between the error and some explanatory variables cannot be solved by using OLS on a single cross section We postpone a treatment of alternative estimation methods until Part 3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 275 91 Functional Form Misspecification A multiple regression model suffers from functional form misspecification when it does not properly account for the relationship between the dependent and the observed explanatory variables For exam ple if hourly wage is determined by log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 1 u but we omit the squared experience term exper2 then we are committing a functional form misspecifica tion We already know from Chapter 3 that this generally leads to biased estimators of b0 b1 and b2 We do not estimate b3 because exper2 is excluded from the model Thus misspecifying how exper affects log1wage2 generally results in a biased estimator of the return to education b1 The amount of this bias depends on the size of b3 and the correlation among educ exper and exper2 Things are worse for estimating the return to experience even if we could get an unbiased estima tor of b2 we would not be able to estimate the return to experience because it equals b2 1 2b3exper in decimal form Just using the biased estimator of b2 can be misleading especially at extreme val ues of exper As another example suppose the log1wage2 equation is log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 91 1 b4 female 1 b5 female educ 1 u where female is a binary variable If we omit the interaction term female educ then we are mis specifying the functional form In general we will not get unbiased estimators of any of the other parameters and since the return to education depends on gender it is not clear what return we would be estimating by omitting the interaction term Omitting functions of independent variables is not the only way that a model can suffer from misspecified functional form For example if 91 is the true model satisfying the first four Gauss Markov assumptions but we use wage rather than log1wage2 as the dependent variable then we will not obtain unbiased or consistent estimators of the partial effects The tests that follow have some ability to detect this kind of functional form problem but there are better tests that we will mention in the subsection on testing against nonnested alternatives Misspecifying the functional form of a model can certainly have serious consequences Nevertheless in one important respect the problem is minor by definition we have data on all the necessary variables for obtaining a functional relationship that fits the data well This can be con trasted with the problem addressed in the next section where a key variable is omitted on which we cannot collect data We already have a very powerful tool for detecting misspecified functional form the F test for joint exclusion restrictions It often makes sense to add quadratic terms of any significant variables to a model and to perform a joint test of significance If the additional quadratics are significant they can be added to the model at the cost of complicating the interpretation of the model However significant quadratic terms can be symptomatic of other functional form problems such as using the level of a variable when the logarithm is more appropriate or vice versa It can be difficult to pinpoint the precise reason that a functional form is misspecified Fortunately in many cases using logarithms of certain variables and adding quadratics are sufficient for detecting many important nonlinear rela tionships in economics ExamplE 91 Economic model of Crime Table 91 contains OLS estimates of the economic model of crime see Example 83 We first esti mate the model without any quadratic terms those results are in column 1 In column 2 the squares of pcnv ptime86 and inc86 are added we chose to include the squares of these variables because each level term is significant in column 1 The variable qemp86 is a dis crete variable taking on only five values so we do not include its square in column 2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 276 Each of the squared terms is significant and together they are jointly very significant F 5 3137 with df 5 3 and 2713 the pvalue is essentially zero Thus it appears that the initial model over looked some potentially important nonlinearities The presence of the quadratics makes interpret ing the model somewhat difficult For example pcnv no longer has a strict deterrent effect the relation ship between narr86 and pcnv is positive up until pcnv 5 365 and then the relationship is negative We might conclude that there is little or no deterrent effect at lower values of pcnv the effect only kicks in at higher prior conviction rates We would have to use more sophisticated functional forms than the quadratic to verify this conclusion It may be that pcnv is not entirely exogenous For example men who have not been convicted in the past so that pcnv 5 0 are perhaps casual criminals and so they are less likely to be arrested in 1986 This could be biasing the estimates Similarly the relationship between narr86 and ptime86 is positive up until ptime86 5 485 almost five months in prison and then the relationship is negative The vast majority of men in the sample spent no time in prison in 1986 so again we must be careful in interpreting the results Why do we not include the squares of black and hispan in column 2 of Table 91 Would it make sense to add interac tions of black and hispan with some of the other variables reported in the table Exploring FurthEr 91 TAblE 91 Dependent Variable narr86 Independent Variables 1 2 pcnv 2133 040 533 154 pcnv2 2730 156 avgsen 2011 012 2017 012 tottime 012 009 012 009 ptime86 2041 009 287 004 ptime862 20296 0039 qemp86 2051 014 2014 017 inc86 20015 0003 20034 0008 inc862 2000007 000003 black 327 045 292 045 hispan 194 040 164 039 intercept 596 036 505 037 Observations Rsquared 2725 0723 2725 1035 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 277 Legal income has a negative effect on narr86 until inc86 5 24285 since income is measured in hundreds of dollars this means an annual income of 24285 Only 46 of the men in the sample have incomes above this level Thus we can conclude that narr86 and inc86 are negatively related with a diminishing effect Example 91 is a tricky functional form problem due to the nature of the dependent variable Other models are theoretically better suited for handling dependent variables taking on a small num ber of integer values We will briefly cover these models in Chapter 17 91a RESET as a General Test for Functional Form Misspecification Some tests have been proposed to detect general functional form misspecification Ramseys 1969 regression specification error test RESET has proven to be useful in this regard The idea behind RESET is fairly simple If the original model y 5 b0 1 b1x1 1 p 1 bkxk 1 u 92 satisfies MLR4 then no nonlinear functions of the independent variables should be significant when added to equation 92 In Example 91 we added quadratics in the significant explanatory vari ables Although this often detects functional form problems it has the drawback of using up many degrees of freedom if there are many explanatory variables in the original model much as the straight form of the White test for heteroskedasticity consumes degrees of freedom Further certain kinds of neglected nonlinearities will not be picked up by adding quadratic terms RESET adds polynomials in the OLS fitted values to equation 92 to detect general kinds of functional form misspecification To implement RESET we must decide how many functions of the fitted values to include in an expanded regression There is no right answer to this question but the squared and cubed terms have proven to be useful in most applications Let y denote the OLS fitted values from estimating 92 Consider the expanded equation y 5 b0 1 b1x1 1 p 1 bkxk 1 d1y2 1 d2y3 1 error 93 This equation seems a little odd because functions of the fitted values from the initial estimation now appear as explanatory variables In fact we will not be interested in the estimated parameters from 93 we only use this equation to test whether 92 has missed important nonlinearities The thing to remember is that y2 and y3 are just nonlinear functions of the xj The null hypothesis is that 92 is correctly specified Thus RESET is the F statistic for test ing H0 d1 5 0 d2 5 0 in the expanded model 93 A significant F statistic suggests some sort of functional form problem The distribution of the F statistic is approximately F2n2k23 in large samples under the null hypothesis and the GaussMarkov assumptions The df in the expanded equation 93 is n 2 k 2 1 2 2 5 n 2 k 2 3 An LM version is also available and the chisquare distribution will have two df Further the test can be made robust to heteroskedasticity using the methods discussed in Section 82 ExamplE 92 Housing price Equation We estimate two models for housing prices The first one has all variables in level form price 5 b0 1 b1lotsize 1 b2sqrft 1 b3bdrms 1 u 94 The second one uses the logarithms of all variables except bdrms lprice 5 b0 1 b1llotsize 1 b2lsqrft 1 b3bdrms 1 u 95 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 278 Using n 5 88 houses in HPRICE1 the RESET statistic for equation 94 turns out to be 467 this is the value of an F282 random variable 1n 5 88 k 5 32 and the associated pvalue is 012 This is evidence of functional form misspecification in 94 The RESET statistic in 95 is 256 with pvalue 5 084 Thus we do not reject 95 at the 5 significance level although we would at the 10 level On the basis of RESET the loglog model in 95 is preferred In the previous example we tried two models for explaining housing prices One was rejected by RESET while the other was not at least at the 5 level Often things are not so simple A drawback with RESET is that it provides no real direction on how to proceed if the model is rejected Rejecting 94 by using RESET does not immediately suggest that 95 is the next step Equation 95 was estimated because constant elasticity models are easy to interpret and can have nice statistical proper ties In this example it so happens that it passes the functional form test as well Some have argued that RESET is a very general test for model misspecification including unob served omitted variables and heteroskedasticity Unfortunately such use of RESET is largely mis guided It can be shown that RESET has no power for detecting omitted variables whenever they have expectations that are linear in the included independent variables in the model see Wooldridge 2001 Section 21 for a precise statement Further if the functional form is properly specified RESET has no power for detecting heteroskedasticity The bottom line is that RESET is a functional form test and nothing more 91b Tests against Nonnested Alternatives Obtaining tests for other kinds of functional form misspecificationfor example trying to decide whether an independent variable should appear in level or logarithmic formtakes us outside the realm of classical hypothesis testing It is possible to test the model y 5 b0 1 b1x1 1 b2x2 1 u 96 against the model y 5 b0 1 b1log1x12 1 b2log1x22 1 u 97 and vice versa However these are nonnested models see Chapter 6 and so we cannot simply use a standard F test Two different approaches have been suggested The first is to construct a comprehen sive model that contains each model as a special case and then to test the restrictions that led to each of the models In the current example the comprehensive model is y 5 g0 1 g1x1 1 g2x2 1 g3log1x12 1 g4log1x22 1 u 98 We can first test H0 g3 5 0 g4 5 0 as a test of 96 We can also test H0 g1 5 0 g2 5 0 as a test of 97 This approach was suggested by Mizon and Richard 1986 Another approach has been suggested by Davidson and MacKinnon 1981 They point out that if model 96 holds with E1u0x1 x22 5 0 the fitted values from the other model 97 should be insig nificant when added to equation 96 Therefore to test whether 96 is the correct model we first estimate model 97 by OLS to obtain the fitted values call these yˇ The DavidsonMacKinnon test is obtained from the t statistic on yˇ in the auxiliary equation y 5 b0 1 b1x1 1 b2x2 1 u1yˇ 1 error Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 279 Because the yˇ are just nonlinear functions of x1 and x2 they should be insignificant if 96 is the cor rect conditional mean model Therefore a significant t statistic against a twosided alternative is a rejection of 96 Similarly if y denotes the fitted values from estimating 96 the test of 97 is the t statistic on y in the model y 5 b0 1 b1log1x12 1 b2log1x22 1 u1y 1 error a significant t statistic is evidence against 97 The same two tests can be used for testing any two nonnested models with the same dependent variable There are a few problems with nonnested testing First a clear winner need not emerge Both models could be rejected or neither model could be rejected In the latter case we can use the adjusted Rsquared to choose between them If both models are rejected more work needs to be done However it is important to know the practical consequences from using one form or the other if the effects of key independent variables on y are not very different then it does not really matter which model is used A second problem is that rejecting 96 using say the DavidsonMacKinnon test does not mean that 97 is the correct model Model 96 can be rejected for a variety of functional form misspecifications An even more difficult problem is obtaining nonnested tests when the competing models have different dependent variables The leading case is y versus log1y2 We saw in Chapter 6 that just obtaining goodnessoffit measures that can be compared requires some care Tests have been pro posed to solve this problem but they are beyond the scope of this text See Wooldridge 1994a for a test that has a simple interpretation and is easy to implement 92 Using Proxy Variables for Unobserved Explanatory Variables A more difficult problem arises when a model excludes a key variable usually because of data una vailability Consider a wage equation that explicitly recognizes that ability abil affects log1wage2 log1wage2 5 b0 1 b1educ 1 b2exper 1 b3abil 1 u 99 This model shows explicitly that we want to hold ability fixed when measuring the return to educ and exper If say educ is correlated with abil then putting abil in the error term causes the OLS estimator of b1 and b2 to be biased a theme that has appeared repeatedly Our primary interest in equation 99 is in the slope parameters b1 and b2 We do not really care whether we get an unbiased or consistent estimator of the intercept b0 as we will see shortly this is not usually possible Also we can never hope to estimate b3 because abil is not observed in fact we would not know how to interpret b3 anyway since ability is at best a vague concept How can we solve or at least mitigate the omitted variables bias in an equation like 99 One possibility is to obtain a proxy variable for the omitted variable Loosely speaking a proxy variable is something that is related to the unobserved variable that we would like to control for in our analy sis In the wage equation one possibility is to use the intelligence quotient or IQ as a proxy for abil ity This does not require IQ to be the same thing as ability what we need is for IQ to be correlated with ability something we clarify in the following discussion All of the key ideas can be illustrated in a model with three independent variables two of which are observed y 5 b0 1 b1x1 1 b2x2 1 b3xp 3 1 u 910 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 280 We assume that data are available on y x1 and x2in the wage example these are log1wage2 educ and exper respectively The explanatory variable xp 3 is unobserved but we have a proxy variable for xp 3 Call the proxy variable x3 What do we require of x3 At a minimum it should have some relationship to xp 3 This is captured by the simple regression equation xp 3 5 d0 1 d3x3 1 v3 911 where v3 is an error due to the fact that xp 3 and x3 are not exactly related The parameter d3 measures the relationship between xp 3 and x3 typically we think of xp 3 and x3 as being positively related so that d3 0 If d3 5 0 then x3 is not a suitable proxy for xp 3 The intercept d0 in 911 which can be positive or negative simply allows xp 3 and x3 to be measured on different scales For exam ple unobserved ability is certainly not required to have the same average value as IQ in the US population How can we use x3 to get unbiased or at least consistent estimators of b1 and b2 The proposal is to pretend that x3 and xp 3 are the same so that we run the regression of y on x1 x2 x3 912 We call this the plugin solution to the omitted variables problem because x3 is just plugged in for xp 3 before we run OLS If x3 is truly related to xp 3 this seems like a sensible thing However since x3 and xp 3 are not the same we should determine when this procedure does in fact give consistent estima tors of b1 and b2 The assumptions needed for the plugin solution to provide consistent estimators of b1 and b2 can be broken down into assumptions about u and v3 1 The error u is uncorrelated with x1 x2 and xp 3 which is just the standard assumption in model 910 In addition u is uncorrelated with x3 This latter assumption just means that x3 is irrelevant in the population model once x1 x2 and xp 3 have been included This is essentially true by definition since x3 is a proxy variable for xp 3 it is xp 3 that directly affects y not x3 Thus the assumption that u is uncorrelated with x1 x2 xp 3 and x3 is not very controversial Another way to state this assumption is that the expected value of u given all these variables is zero 2 The error v3 is uncorrelated with x1 x2 and x3 Assuming that v3 is uncorrelated with x1 and x2 requires x3 to be a good proxy for xp 3 This is easiest to see by writing the analog of these assump tions in terms of conditional expectations E1xp 30x1 x2 x32 5 E1xp 30x32 5 d0 1 d3x3 913 The first equality which is the most important one says that once x3 is controlled for the expected value of xp 3 does not depend on x1 or x2 Alternatively xp 3 has zero correlation with x1 and x2 once x3 is partialled out In the wage equation 99 where IQ is the proxy for ability condition 913 becomes E1abil0educ exper IQ2 5 E1abil0IQ2 5 d0 1 d3IQ Thus the average level of ability only changes with IQ not with educ and exper Is this reasonable Maybe it is not exactly true but it may be close to being true It is certainly worth including IQ in the wage equation to see what happens to the estimated return to education We can easily see why the previous assumptions are enough for the plugin solution to work If we plug equation 911 into equation 910 and do simple algebra we get y 5 1b0 1 b3d02 1 b1x1 1 b2x2 1 b3d3x3 1 u 1 b3v3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 281 Call the composite error in this equation e 5 u 1 b3v3 it depends on the error in the model of inter est 910 and the error in the proxy variable equation v3 Since u and v3 both have zero mean and each is uncorrelated with x1 x2 and x3 e also has zero mean and is uncorrelated with x1 x2 and x3 Write this equation as y 5 a0 1 b1x1 1 b2x2 1 a3x3 1 e where a0 5 1b0 1 b3d02 is the new intercept and a3 5 b3d3 is the slope parameter on the proxy variable x3 As we alluded to earlier when we run the regression in 912 we will not get unbiased estimators of b0 and b3 instead we will get unbiased or at least consistent estimators of a0 b1 b2 and a3 The important thing is that we get good estimates of the parameters b1 and b2 In most cases the estimate of a3 is actually more interesting than an estimate of b3 anyway For example in the wage equation a3 measures the return to wage given one more point on IQ score ExamplE 93 IQ as a proxy for ability The file WAGE2 from Blackburn and Neumark 1992 contains information on monthly earnings education several demographic variables and IQ scores for 935 men in 1980 As a method to account for omitted ability bias we add IQ to a standard log wage equation The results are shown in Table 92 Our primary interest is in what happens to the estimated return to education Column 1 contains the estimates without using IQ as a proxy variable The estimated return to education is 65 If we think omitted ability is positively correlated with educ then we assume that this estimate is too high More precisely the average estimate across all random samples would be too high When IQ is TAblE 92 Dependent Variable logwage Independent Variables 1 2 3 educ 065 006 054 007 018 041 exper 014 003 014 003 014 003 tenure 012 002 011 002 011 002 married 199 039 200 039 201 039 south 2091 026 2080 026 2080 026 urban 184 027 182 027 184 027 black 2188 038 2143 039 2147 040 IQ 0036 0010 20009 0052 educ IQ 00034 00038 intercept 5395 113 5176 128 5648 546 Observations Rsquared 935 253 935 263 935 263 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 282 added to the equation the return to education falls to 54 which corresponds with our prior beliefs about omitted ability bias The effect of IQ on socioeconomic outcomes has been documented in the controversial book The Bell Curve by Herrnstein and Murray 1994 Column 2 shows that IQ does have a statistically sig nificant positive effect on earnings after controlling for several other factors Everything else being equal an increase of 10 IQ points is predicted to raise monthly earnings by 36 The standard devia tion of IQ in the US population is 15 so a one standard deviation increase in IQ is associated with higher earnings of 54 This is identical to the predicted increase in wage due to another year of education It is clear from column 2 that education still has an important role in increasing earnings even though the effect is not as large as originally estimated Some other interesting observations emerge from columns 1 and 2 Adding IQ to the equation only increases the Rsquared from 253 to 263 Most of the variation in log1wage2 is not explained by the factors in column 2 Also adding IQ to the equation does not eliminate the estimated earnings dif ference between black and white men a black man with the same IQ education experience and so on as a white man is predicted to earn about 143 less and the difference is very statistically significant Column 3 in Table 92 includes the interaction term educ IQ This allows for the possibility that educ and abil interact in determining log1wage2 We might think that the return to education is higher for people with more ability but this turns out not to be the case the interaction term is not significant and its addition makes educ and IQ individually insignif icant while complicating the model Therefore the estimates in column 2 are preferred There is no reason to stop at a single proxy variable for ability in this example The data set WAGE2 also contains a score for each man on the Knowledge of the World of Work KWW test This provides a different measure of ability which can be used in place of IQ or along with IQ to estimate the return to education see Computer Exercise C2 It is easy to see how using a proxy variable can still lead to bias if the proxy variable does not sat isfy the preceding assumptions Suppose that instead of 911 the unobserved variable xp 3 is related to all of the observed variables by xp 3 5 d0 1 d1x1 1 d2x2 1 d3x3 1 v3 914 where v3 has a zero mean and is uncorrelated with x1 x2 and x3 Equation 911 assumes that d1 and d2 are both zero By plugging equation 914 into 910 we get y 5 1b0 1 b3d02 1 1b1 1 b3d12x1 1 1b2 1 b3d22x2 1 b3d3x3 1 u 1 b3v3 915 from which it follows that plim1b 12 5 b1 1 b3d1 and plim1b 22 5 b2 1 b3d2 This follows because the error in 915 u 1 b3v3 has zero mean and is uncorrelated with x1 x2 and x3 In the previous example where x1 5 educ and xp 3 5 abil b3 0 so there is a positive bias inconsistency if abil has a positive partial correlation with educ 1d1 02 Thus we could still be getting an upward bias in the return to education by using IQ as a proxy for abil if IQ is not a good proxy But we can reasonably hope that this bias is smaller than if we ignored the problem of omitted ability entirely A complaint that is sometimes aired about including variables such as IQ in a regression that includes educ is that it exacerbates the problem of multicollinearity likely leading to a less precise estimate of beduc But this complaint misses two important points First the inclusion of IQ reduces the error variance because the part of ability explained by IQ has been removed from the error Typically What do you make of the small and statistically insignificant coefficient on educ in column 3 of Table 92 Hint When educ IQ is in the equation what is the interpretation of the coefficient on educ Exploring FurthEr 92 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 283 this will be reflected in a smaller standard error of the regression although it need not get smaller because of its degreesoffreedom adjustment Second and most importantly the added multicol linearity is a necessary evil if we want to get an estimator of beduc with less bias the reason educ and IQ are correlated is that educ and abil are thought to be correlated and IQ is a proxy for abil If we could observe abil we would include it in the regression and of course there would be unavoidable multicollinearity caused by correlation between educ and abil Proxy variables can come in the form of binary information as well In Example 79 see equation 715 we discussed Kruegers 1993 estimates of the return to using a computer on the job Krueger also included a binary variable indicating whether the worker uses a computer at home as well as an interaction term between computer usage at work and at home His primary reason for including computer usage at home in the equation was to proxy for unobserved technical ability that could affect wage directly and be related to computer usage at work 92a Using Lagged Dependent Variables as Proxy Variables In some applications like the earlier wage example we have at least a vague idea about which unob served factor we would like to control for This facilitates choosing proxy variables In other applica tions we suspect that one or more of the independent variables is correlated with an omitted variable but we have no idea how to obtain a proxy for that omitted variable In such cases we can include as a control the value of the dependent variable from an earlier time period This is especially useful for policy analysis Using a lagged dependent variable in a crosssectional equation increases the data require ments but it also provides a simple way to account for historical factors that cause current differences in the dependent variable that are difficult to account for in other ways For example some cities have had high crime rates in the past Many of the same unobserved factors contribute to both high current and past crime rates Likewise some universities are traditionally better in academics than other uni versities Inertial effects are also captured by putting in lags of y Consider a simple equation to explain city crime rates crime 5 b0 1 b1unem 1 b2expend 1 b3crime21 1 u 916 where crime is a measure of per capita crime unem is the city unemployment rate expend is per cap ita spending on law enforcement and crime21 indicates the crime rate measured in some earlier year this could be the past year or several years ago We are interested in the effects of unem on crime as well as of law enforcement expenditures on crime What is the purpose of including crime21 in the equation Certainly we expect that b3 0 because crime has inertia But the main reason for putting this in the equation is that cities with high historical crime rates may spend more on crime prevention Thus factors unobserved to us the econometricians that affect crime are likely to be correlated with expend and unem If we use a pure crosssectional analysis we are unlikely to get an unbiased estimator of the causal effect of law enforcement expenditures on crime But by including crime21 in the equation we can at least do the following experiment if two cities have the same previous crime rate and current unemployment rate then b2 measures the effect of another dollar of law enforcement on crime ExamplE 94 City Crime Rates We estimate a constant elasticity version of the crime model in equation 916 unem because it is a percentage is left in level form The data in CRIME2 are from 46 cities for the year 1987 The crime rate is also available for 1982 and we use that as an additional independent variable in trying to control for city unobservables that affect crime and may be correlated with current law enforcement expenditures Table 93 contains the results Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 284 Without the lagged crime rate in the equation the effects of the unemployment rate and expendi tures on law enforcement are counterintuitive neither is statistically significant although the t statis tic on log1lawexpc872 is 117 One possibility is that increased law enforcement expenditures improve reporting conventions and so more crimes are reported But it is also likely that cities with high recent crime rates spend more on law enforcement Adding the log of the crime rate from five years earlier has a large effect on the expenditures coef ficient The elasticity of the crime rate with respect to expenditures becomes 214 with t 5 2128 This is not strongly significant but it suggests that a more sophisticated model with more cities in the sample could produce significant results Not surprisingly the current crime rate is strongly related to the past crime rate The estimate indicates that if the crime rate in 1982 was 1 higher then the crime rate in 1987 is predicted to be about 119 higher We cannot reject the hypothesis that the elasticity of current crime with respect to past crime is unity 3t 5 11194 2 12132 1474 Adding the past crime rate increases the explana tory power of the regression markedly but this is no surprise The primary reason for including the lagged crime rate is to obtain a better estimate of the ceteris paribus effect of log1lawexpc872 on log1crmrte872 The practice of putting in a lagged y as a general way of controlling for unobserved variables is hardly perfect But it can aid in getting a better estimate of the effects of policy variables on various outcomes When the data are available additional lags also can be included Adding lagged values of y is not the only way to use two years of data to control for omitted factors When we discuss panel data methods in Chapters 13 and 14 we will cover other ways to use repeated data on the same crosssectional units at different points in time 92b A Different Slant on Multiple Regression The discussion of proxy variables in this section suggests an alternative way of interpreting a multi ple regression analysis when we do not necessarily observe all relevant explanatory variables Until now we have specified the population model of interest with an additive error as in equation 99 Our discussion of that example hinged upon whether we have a suitable proxy variable IQ score in this case other test scores more generally for the unobserved explanatory variable which we called ability A less structured more general approach to multiple regression is to forego specifying models with unobservables Rather we begin with the premise that we have access to a set of observable explanatory variableswhich includes the variable of primary interest such as years of schooling TAblE 93 Dependent Variable log1crmrte872 Independent Variables 1 2 unem87 2029 032 009 020 log1lawexpc872 203 173 2140 109 log1crmrte822 1194 132 intercept 334 125 076 821 Observations Rsquared 46 057 46 680 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 285 and controls such as observable test scores We then model the mean of y conditional on the observed explanatory variables For example in the wage example with lwage denoting log1wage2 we can estimate E1lwage0educ exper tenure south urban black IQ2 exactly what is reported in Table 92 The difference now is that we set our goals more modestly Namely rather than introduce the nebulous concept of ability in equation 99 we state from the outset that we will estimate the ceteris paribus effect of education holding IQ and the other observed factors fixed There is no need to discuss whether IQ is a suitable proxy for ability Consequently while we may not be answering the question underlying equation 99 we are answering a question of interest if two people have the same IQ levels and same values of experience tenure and so on yet they differ in education levels by a year what is the expected difference in their log wages As another example if we include as an explanatory variable the poverty rate in a schoollevel regression to assess the effects of spending on standardized test scores we should recognize that the poverty rate only crudely captures the relevant differences in children and parents across schools But often it is all we have and it is better to control for the poverty rate than to do nothing because we cannot find suitable proxies for student ability parental involvement and so on Almost certainly controlling for the poverty rate gets us closer to the ceteris paribus effects of spending than if we leave the poverty rate out of the analysis In some applications of regression analysis we are interested simply in predicting the outcome y given a set of explanatory variables 1x1 p xk2 In such cases it makes little sense to think in terms of bias in estimated coefficients due to omitted variables Instead we should focus on obtaining a model that predicts as well as possible and make sure we do not include as regressors variables that cannot be observed at the time of prediction For example an admissions officer for a college or university might be interested in predicting success in college as measured by grade point average in terms of variables that can be measured at application time Those variables would include high school performance maybe just grade point average but perhaps performance in specific kinds of courses standardized test scores participation in various activities such as debate or math club and even family background variables We would not include a variable measuring college class attend ance because we do not observe attendance in college at application time Nor would we wring our hands about potential biases caused by omitting an attendance variable we have no interest in say measuring the effect of high school GPA holding attendance in college fixed Likewise we would not worry about biases in coefficients because we cannot observe factors such as motivation Naturally for predictive purposes it would probably help substantially if we had a measure of motivation but in its absence we fit the best model we can with observed explanatory variables 93 Models with Random Slopes In our treatment of regression so far we have assumed that the slope coefficients are the same across individuals in the population or that if the slopes differ they differ by measurable characteristics in which case we are led to regression models containing interaction terms For example as we saw in Section 74 we can allow the return to education to differ by men and women by interacting educa tion with a gender dummy in a log wage equation Here we are interested in a related but different question What if the partial effect of a variable depends on unobserved factors that vary by population unit If we have only one explanatory vari able x we can write a general model for a random draw i from the population for emphasis as yi 5 ai 1 bixi 917 where ai is the intercept for unit i and bi is the slope In the simple regression model from Chapter 2 we assumed bi 5 b and labeled ai as the error ui The model in 917 is sometimes called a random coefficient model or random slope model because the unobserved slope coefficient bi is viewed as a random draw from the population along with the observed data 1xi yi2 and the unobserved Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 286 intercept ai As an example if yi 5 log1wagei2 and xi 5 educi then 917 allows the return to educa tion bi to vary by person If say bi contains unmeasured ability just as ai would the partial effect of another year of schooling can depend on ability With a random sample of size n we implicitly draw n values of bi along with n values of ai and the observed data on x and y Naturally we cannot estimate a slopeor for that matter an interceptfor each i But we can hope to estimate the average slope and average intercept where the average is across the population Therefore define a 5 E1ai2 and b 5 E1bi2 Then b is the aver age of the partial effect of x on y and so we call b the average partial effect APE or the average marginal effect AME In the context of a log wage equation b is the average return to a year of schooling in the population If we write ai 5 a 1 ci and bi 5 b 1 di then di is the individualspecific deviation from the APE By construction E1ci2 5 0 and E1di2 5 0 Substituting into 917 gives yi 5 a 1 bxi 1 ci 1 dixi a 1 bxi 1 ui 918 where ui 5 ci 1 dixi To make the notation easier to follow we now use a the mean value of ai as the intercept and b the mean of bi as the slope In other words we can write the random coefficient as a constant coefficient model but where the error term contains an interaction between an unobserv able di and the observed explanatory variable xi When would a simple regression of yi on xi provide an unbiased estimate of b and a We can apply the result for unbiasedness from Chapter 2 If E1ui0xi2 5 0 then OLS is generally unbiased When ui 5 ci 1 dixi sufficient is E1ci0xi2 5 E1ci2 5 0 and E1di0xi2 5 E1di2 5 0 We can write these in terms of the unitspecific intercept and slope as E1ai0xi2 5 E1ai2 and E1bi0xi2 5 E1bi2 919 that is ai and bi are both mean independent of xi This is a useful finding if we allow for unitspecific slopes OLS consistently estimates the population average of those slopes when they are mean inde pendent of the explanatory variable See Problem 6 for a weaker set of conditions that imply consist ency of OLS The error term in 918 almost certainly contains heteroskedasticity In fact if Var1ci0xi2 5 s2 c Var1di0xi2 5 s2 d and Cov1cidi0xi2 5 0 then Var1ui0xi2 5 s2 c 1 s2 dx2 i 920 and so there must be heteroskedasticity in ui unless s2 d 5 0 which means bi 5 b for all i We know how to account for heteroskedasticity of this kind We can use OLS and compute heteroskedasticity robust standard errors and test statistics or we can estimate the variance function in 920 and apply weighted least squares Of course the latter strategy imposes homoskedasticity on the random inter cept and slope and so we would want to make a WLS analysis fully robust to violations of 920 Because of equation 920 some authors like to view heteroskedasticity in regression models generally as arising from random slope coefficients But we should remember that the form of 920 is special and it does not allow for heteroskedasticity in ai or bi We cannot convincingly distinguish between a random slope model where the intercept and slope are independent of xi and a constant slope model with heteroskedasticity in ai The treatment for multiple regression is similar Generally write yi 5 ai 1 bi1xi1 1 bi2xi2 1 p 1 bikxik 921 Then by writing ai 5 a 1 ci and bij 5 bj 1 dij we have yi 5 a 1 b1xi1 1 p 1 bkxik 1 ui 922 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 287 where ui 5 ci 1 di1xi1 1 p 1 dikxik If we maintain the mean independence assumptions E1ai0xi2 5 E1ai2 and E1bij0xi2 5 E1bij2 j 5 1 p k then E1yi0xi2 5 a 1 b1xi1 1 p 1 bkxik and so OLS using a random sample produces unbiased estimators of a and the bj As in the simple regression case Var1ui0xi2 is almost certainly heteroskedastic We can allow the bij to depend on observable explanatory variables as well as unob servables For example suppose with k 5 2 the effect of xi2 depends on xi1 and we write bi2 5 b2 1 d11xi1 2 m12 1 di2 where m1 5 E1xi12 If we assume E1di20xi2 5 0 and similarly for ci and di1 then E1yi0xi1 xi22 5 a 1 b1xi1 1 b2xi2 1 d11xi1 2 m12xi2 which means we have an interac tion between xi1 and xi2 Because we have subtracted the mean m1 from xi1 b2 is the APE of xi2 The bottom line of this section is that allowing for random slopes is fairly straightforward if the slopes are independent or at least mean independent of the explanatory variables In addition we can easily model the slopes as functions of the exogenous variables which leads to models with squares and interactions Of course in Chapter 6 we discussed how such models can be useful without ever introducing the notion of a random slope The random slopes specification provides a separate justi fication for such models Estimation becomes considerably more difficult if the random intercept as well as some slopes are correlated with some of the regressors We cover the problem of endogenous explanatory variables in Chapter 15 94 Properties of OLS under Measurement Error Sometimes in economic applications we cannot collect data on the variable that truly affects eco nomic behavior A good example is the marginal income tax rate facing a family that is trying to choose how much to contribute to charity in a given year The marginal rate may be hard to obtain or summarize as a single number for all income levels Instead we might compute the average tax rate based on total income and tax payments When we use an imprecise measure of an economic variable in a regression model then our model contains measurement error In this section we derive the consequences of measurement error for ordi nary least squares estimation OLS will be consistent under certain assumptions but there are others under which it is inconsistent In some of these cases we can derive the size of the asymptotic bias As we will see the measurement error problem has a similar statistical structure to the omit ted variableproxy variable problem discussed in the previous section but they are conceptually dif ferent In the proxy variable case we are looking for a variable that is somehow associated with the unobserved variable In the measurement error case the variable that we do not observe has a welldefined quantitative meaning such as a marginal tax rate or annual income but our recorded measures of it may contain error For example reported annual income is a measure of actual annual income whereas IQ score is a proxy for ability Another important difference between the proxy variable and measurement error problems is that in the latter case often the mismeasured independent variable is the one of primary interest In the proxy variable case the partial effect of the omitted variable is rarely of central interest we are usually concerned with the effects of the other independent variables Before we consider details we should remember that measurement error is an issue only when the variables for which the econometrician can collect data differ from the variables that influence decisions by individuals families firms and so on 94a Measurement Error in the Dependent Variable We begin with the case where only the dependent variable is measured with error Let yp denote the variable in the population as always that we would like to explain For example yp could be annual family savings The regression model has the usual form yp 5 b0 1 b1x1 1 p 1 bkxk 1 u 923 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 288 and we assume it satisfies the GaussMarkov assumptions We let y represent the observable measure of yp In the savings case y is reported annual savings Unfortunately families are not perfect in their reporting of annual family savings it is easy to leave out categories or to overestimate the amount contributed to a fund Generally we can expect y and yp to differ at least for some subset of families in the population The measurement error in the population is defined as the difference between the observed value and the actual value e0 5 y 2 yp 924 For a random draw i from the population we can write ei0 5 yi 2 yp i but the important thing is how the measurement error in the population is related to other factors To obtain an estimable model we write yp 5 y 2 e0 plug this into equation 923 and rearrange y 5 b0 1 b1x1 1 p 1 bkxk 1 u 1 e0 925 The error term in equation 925 is u 1 e0 Because y x1 x2 p xk are observed we can estimate this model by OLS In effect we just ignore the fact that y is an imperfect measure of yp and proceed as usual When does OLS with y in place of yp produce consistent estimators of the bj Since the original model 923 satisfies the GaussMarkov assumptions u has zero mean and is uncorrelated with each xj It is only natural to assume that the measurement error has zero mean if it does not then we simply get a biased estimator of the intercept b0 which is rarely a cause for concern Of much more importance is our assumption about the relationship between the measurement error e0 and the explanatory variables xj The usual assumption is that the measurement error in y is statisti cally independent of each explanatory variable If this is true then the OLS estimators from 925 are unbiased and consistent Further the usual OLS inference procedures t F and LM statistics are valid If e0 and u are uncorrelated as is usually assumed then Var1u 1 e02 5 s2 u 1 s2 0 s2 u This means that measurement error in the dependent variable results in a larger error variance than when no error occurs this of course results in larger variances of the OLS estimators This is to be expected and there is nothing we can do about it except collect better data The bottom line is that if the measurement error is uncorrelated with the independent variables then OLS estimation has good properties ExamplE 95 Savings Function with measurement Error Consider a savings function savp 5 b0 1 b1inc 1 b2size 1 b3educ 1 b4age 1 u but where actual savings 1savp2 may deviate from reported savings sav The question is whether the size of the measurement error in sav is systematically related to the other variables It might be reasonable to assume that the measurement error is not correlated with inc size educ and age On the other hand we might think that families with higher incomes or more education report their savings more accurately We can never know whether the measurement error is correlated with inc or educ unless we can collect data on savp then the measurement error can be computed for each observation as ei0 5 savi 2 savpi Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 289 When the dependent variable is in logarithmic form so that log1yp2 is the dependent variable it is natural for the measurement error equation to be of the form log1y2 5 log1yp2 1 e0 926 This follows from a multiplicative measurement error for y y 5 ypa0 where a0 0 and e0 5 log1a02 ExamplE 96 measurement Error in Scrap Rates In Section 76 we discussed an example where we wanted to determine whether job training grants reduce the scrap rate in manufacturing firms We certainly might think the scrap rate reported by firms is measured with error In fact most firms in the sample do not even report a scrap rate In a simple regression framework this is captured by log1scrapp2 5 b0 1 b1grant 1 u where scrapp is the true scrap rate and grant is the dummy variable indicating whether a firm received a grant The measurement error equation is log1scrap2 5 log1scrapp2 1 e0 Is the measurement error e0 independent of whether the firm receives a grant A cynical person might think that a firm receiving a grant is more likely to underreport its scrap rate in order to make the grant look effective If this happens then in the estimable equation log1scrap2 5 b0 1 b1grant 1 u 1 e0 the error u 1 e0 is negatively correlated with grant This would produce a downward bias in b1 which would tend to make the training program look more effective than it actually was Remember a more negative b1 means the program was more effective since increased worker productivity is associated with a lower scrap rate The bottom line of this subsection is that measurement error in the dependent variable can cause biases in OLS if it is systematically related to one or more of the explanatory variables If the meas urement error is just a random reporting error that is independent of the explanatory variables as is often assumed then OLS is perfectly appropriate 94b Measurement Error in an Explanatory Variable Traditionally measurement error in an explanatory variable has been considered a much more impor tant problem than measurement error in the dependent variable In this subsection we will see why this is the case We begin with the simple regression model y 5 b0 1 b1xp 1 1 u 927 and we assume that this satisfies at least the first four GaussMarkov assumptions This means that estimation of 927 by OLS would produce unbiased and consistent estimators of b0 and b1 The problem is that xp 1 is not observed Instead we have a measure of xp 1 call it x1 For example xp 1 could be actual income and x1 could be reported income Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 290 The measurement error in the population is simply e1 5 x1 2 xp 1 928 and this can be positive negative or zero We assume that the average measurement error in the population is zero E1e12 5 0 This is natural and in any case it does not affect the important con clusions that follow A maintained assumption in what follows is that u is uncorrelated with x 1 and x1 In conditional expectation terms we can write this as E1y0xp 1 x12 5 E1y0xp 12 which just says that x1 does not affect y after xp 1 has been controlled for We used the same assumption in the proxy variable case and it is not controversial it holds almost by definition We want to know the properties of OLS if we simply replace xp 1 with x1 and run the regression of y on x1 They depend crucially on the assumptions we make about the measurement error Two assumptions have been the focus in econometrics literature and they both represent polar extremes The first assumption is that e1 is uncorrelated with the observed measure x1 Cov1x1e12 5 0 929 From the relationship in 928 if assumption 929 is true then e1 must be correlated with the unob served variable xp 1 To determine the properties of OLS in this case we write xp 1 5 x1 2 e1 and plug this into equation 927 y 5 b0 1 b1x1 1 1u 2 b1e12 930 Because we have assumed that u and e1 both have zero mean and are uncorrelated with x1 u 2 b1e1 has zero mean and is uncorrelated with x1 It follows that OLS estimation with x1 in place of xp 1 produces a consistent estimator of b1 and also b0 Since u is uncorrelated with e1 the variance of the error in 930 is Var1u 2 b1e12 5 s2 u 1 b2 1s2 e1 Thus except when b1 5 0 measurement error increases the error variance But this does not affect any of the OLS properties except that the vari ances of the b j will be larger than if we observe xp 1 directly The assumption that e1 is uncorrelated with x1 is analogous to the proxy variable assumption we made in Section 92 Since this assumption implies that OLS has all of its nice properties this is not usually what econometricians have in mind when they refer to measurement error in an explanatory variable The classical errorsinvariables CEV assumption is that the measurement error is uncor related with the unobserved explanatory variable Cov1xp 1e12 5 0 931 This assumption comes from writing the observed measure as the sum of the true explanatory variable and the measurement error x1 5 xp 1 1 e1 and then assuming the two components of x1 are uncorrelated This has nothing to do with assump tions about u we always maintain that u is uncorrelated with xp 1 and x1 and therefore with e1 If assumption 931 holds then x1 and e1 must be correlated Cov1x1e12 5 E1x1e12 5 E1xp 1e12 1 E1e2 12 5 0 1 s2 e1 5 s2 e1 932 Thus the covariance between x1 and e1 is equal to the variance of the measurement error under the CEV assumption Referring to equation 930 we can see that correlation between x1 and e1 is going to cause problems Because u and x1 are uncorrelated the covariance between x1 and the composite error u 2 b1e1 is Cov1x1u 2 b1e12 5 2b1Cov1x1e12 5 2b1s2 e1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 291 Thus in the CEV case the OLS regression of y on x1 gives a biased and inconsistent estimator Using the asymptotic results in Chapter 5 we can determine the amount of inconsistency in OLS The probability limit of b 1 is b1 plus the ratio of the covariance between x1 and u 2 b1e1 and the variance of x1 plim1b 12 5 b1 1 Cov1x1u 2 b1e12 Var1x12 5 b1 2 b1s2 e1 s2 x1 1 s2 e1 933 5 b1a1 2 s2 e1 s2 x1 1 s2 e1 b 5 b1a s2 x1 s2 x1 1 s2 e1 b where we have used the fact that Var1x12 5 Var1xp 12 1 Var1e12 Equation 933 is very interesting The term multiplying b1 which is the ratio Var1x 12Var1x12 is always less than one an implication of the CEV assumption 931 Thus plim1b 12 is always closer to zero than is b1 This is called the attenuation bias in OLS due to CEV on average or in large samples the estimated OLS effect will be attenuated In particular if b1 is positive b 1 will tend to underestimate b1 This is an important conclusion but it relies on the CEV setup If the variance of xp 1 is large relative to the variance in the measurement error then the inconsist ency in OLS will be small This is because Var1x 12Var1x12 will be close to unity when s2 xp 1 s2 e1 is large Therefore depending on how much variation there is in xp 1 relative to e1 measurement error need not cause large biases Things are more complicated when we add more explanatory variables For illustration consider the model y 5 b0 1 b1xp 1 1 b2x2 1 b2x3 1 u 934 where the first of the three explanatory variables is measured with error We make the natural assump tion that u is uncorrelated with xp 1 x2 x3 and x1 Again the crucial assumption concerns the measure ment error e1 In almost all cases e1 is assumed to be uncorrelated with x2 and x3the explanatory variables not measured with error The key issue is whether e1 is uncorrelated with x1 If it is then the OLS regression of y on x1 x2 and x3 produces consistent estimators This is easily seen by writing y 5 b0 1 b1x1 1 b2x2 1 b2x3 1 u 2 b1e1 935 where u and e1 are both uncorrelated with all the explanatory variables Under the CEV assumption in 931 OLS will be biased and inconsistent because e1 is cor related with x1 in equation 935 Remember this means that in general all OLS estimators will be biased not just b 1 What about the attenuation bias derived in equation 933 It turns out that there is still an attenuation bias for estimating b1 it can be shown that plim1b 12 5 b1a s2 r1 s2 r1 1 s2 e1 b 936 where rp 1 is the population error in the equation xp 1 5 a0 1 a1x2 1 a2x3 1 rp 1 Formula 936 also works in the general k variable case when x1 is the only mismeasured variable Things are less clearcut for estimating the bj on the variables not measured with error In the special case that xp 1 is uncorrelated with x2 and x3 b 2 and b 3 are consistent But this is rare in prac tice Generally measurement error in a single variable causes inconsistency in all estimators Unfortunately the sizes and even the directions of the biases are not easily derived Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 292 ExamplE 97 Gpa Equation with measurement Error Consider the problem of estimating the effect of family income on college grade point average after controlling for hsGPA high school grade point average and SAT scholastic aptitude test It could be that though family income is important for performance before college it has no direct effect on college performance To test this we might postulate the model colGPA 5 b0 1 b1 famincp 1 b2hsGPA 1 b3SAT 1 u where famincp is actual annual family income This might appear in logarithmic form but for the sake of illustration we leave it in level form Precise data on colGPA hsGPA and SAT are relatively easy to obtain But family income especially as reported by students could be easily mismeasured If faminc 5 famincp 1 e1 and the CEV assumptions hold then using reported family income in place of actual family income will bias the OLS estimator of b1 toward zero One consequence of the down ward bias is that a test of H0 b1 5 0 will have less chance of detecting b1 0 Of course measurement error can be present in more than one explanatory variable or in some explanatory variables and the dependent variable As we discussed earlier any measurement error in the dependent variable is usually assumed to be uncorrelated with all the explanatory variables whether it is observed or not Deriving the bias in the OLS estimators under extensions of the CEV assumptions is complicated and does not lead to clear results In some cases it is clear that the CEV assumption in 931 cannot be true Consider a variant on Example 97 colGPA 5 b0 1 b1smokedp 1 b2hsGPA 1 b3SAT 1 u where smokedp is the actual number of times a student smoked marijuana in the last 30 days The variable smoked is the answer to this question On how many separate occasions did you smoke mari juana in the last 30 days Suppose we postulate the standard measurement error model smoked 5 smokedp 1 e1 Even if we assume that students try to report the truth the CEV assumption is unlikely to hold People who do not smoke marijuana at allso that smokedp 5 0are likely to report smoked 5 0 so the measurement error is probably zero for students who never smoke marijuana When smokedp 0 it is much more likely that the student miscounts how many times he or she smoked marijuana in the last 30 days This means that the measurement error e1 and the actual number of times smoked smokedp are correlated which violates the CEV assumption in 931 Unfortunately deriving the implications of measurement error that do not satisfy 929 or 931 is difficult and beyond the scope of this text Before leaving this section we emphasize that the CEV assumption 931 while more believable than assumption 929 is still a strong assumption The truth is probably somewhere in between and if e1 is correlated with both xp 1 and x1 OLS is inconsist ent This raises an important question must we live with inconsistent estimators under CEV or other kinds of measurement error that are correlated with x1 Fortunately the answer is no Chapter 15 shows how under certain assumptions the parameters can be consistently estimated in the presence of general measurement error We postpone this discus sion until later because it requires us to leave the realm of OLS estimation See Problem 7 for how multiple measures can be used to reduce the attenuation bias Let educp be actual amount of schooling measured in years which can be a noninte ger and let educ be reported highest grade completed Do you think educ and educp are related by the CEV model Exploring FurthEr 93 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 293 95 Missing Data Nonrandom Samples and Outlying Observations The measurement error problem discussed in the previous section can be viewed as a data problem we cannot obtain data on the variables of interest Further under the CEV model the composite error term is correlated with the mismeasured independent variable violating the GaussMarkov assumptions Another data problem we discussed frequently in earlier chapters is multicollinearity among the explanatory variables Remember that correlation among the explanatory variables does not violate any assumptions When two independent variables are highly correlated it can be difficult to estimate the partial effect of each But this is properly reflected in the usual OLS statistics In this section we provide an introduction to data problems that can violate the random sam pling assumption MLR2 We can isolate cases in which nonrandom sampling has no practical effect on OLS In other cases nonrandom sampling causes the OLS estimators to be biased and incon sistent A more complete treatment that establishes several of the claims made here is given in Chapter 17 95a Missing Data The missing data problem can arise in a variety of forms Often we collect a random sample of people schools cities and so on and then discover later that information is missing on some key variables for several units in the sample For example in the data set BWGHT 196 of the 1388 observations have no information on fathers education In the data set on median starting law school salaries LAWSCH85 six of the 156 schools have no reported information on median LSAT scores for the entering class other variables are also missing for some of the law schools If data are missing for an observation on either the dependent variable or one of the independ ent variables then the observation cannot be used in a standard multiple regression analysis In fact provided missing data have been properly indicated all modern regression packages keep track of missing data and simply ignore observations when computing a regression We saw this explicitly in the birth weight scenario in Example 49 when 197 observations were dropped due to missing infor mation on parents education In the literature on missing data an estimator that uses only observations with a complete set of data on y and x1 p xk is called a complete cases estimator as mentioned earlier this estimator is computed as the default for OLS and all estimators covered later in the text Other than reducing the sample size are there any statistical consequences of using the OLS estimator and ignoring the miss ing data If in the language of the missing data literature see for example Little and Rubin 2002 Chapter 1 the data are missing completely at random sometimes called MCAR then missing data cause no statistical problems The MCAR assumption implies that the reason the data are miss ing is independent in a statistical sense of both the observed and unobserved factors affecting y In effect we can still assume that the data have been obtained by random sampling from the population so that Assumption MLR2 continues to hold When MCAR holds there are ways to use partial information obtained from units that are dropped from the complete case estimation For example that for a multiple regression model data are always available for y and x1 x2 p xk21 but are sometimes missing for the explanatory variable xk A common solution is to create two new variables For a unit i the first variable say zik is defined to be xik when xik is observed and zero otherwise The second variable is a missing data indicator say mik which equals one when xik is missing and equals zero when xik is observed Having defined these two variables all of the units are used in the regression yi on xi1 xi2 p xik21 zik mik i 5 1 p n This procedure can be shown to produce unbiased and consistent estimators of all parameters provided the missing data mechanism for xk is MCAR Incidentally it is a very poor idea to omit mik from the regression as that is the same thing as assuming xik is zero whenever it is missing Replacing Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 294 missing values with zero and not including the missing data indicator can cause substantial bias in the OLS estimators A similar trick can be used when data are missing on more than one explanatory vari able but not on y Problem 910 provides the argument in the simple regression model An important point is that the estimator that uses all of the data and adds missing data indica tors is actually less robust than the complete cases estimator As will be seen in the next subsection the complete cases estimator turns out to be consistent even when the reason the data are missing is systematically related to 1x1 p xk2 is a function of 1x1 x2 p xk2 provided it does not depend on the unobserved error u There are more complicated schemes for using partial information that are based on filling in the missing data but these are beyond the scope of this text The reader is referred to Little and Rubin 2002 95b Nonrandom Samples The MCAR assumption ensures that units for which we observe a full set of data are not system atically different from units for which some variables are missing Unfortunately MCAR is often unrealistic An example of a missing data mechanism that does not satisfy MCAR can be gotten by looking at the data set CARD where the measure of IQ is missing for 949 men If the probability that the IQ score is missing is say higher for men with lower IQ scores the mechanism violates MCAR For example in the birth weight data set what if the probability that education is missing is higher for those people with lower than average levels of education Or in Section 92 we used a wage data set that included IQ scores This data set was constructed by omitting several people from the sample for whom IQ scores were not available If obtaining an IQ score is easier for those with higher IQs the sample is not representative of the population The random sampling assumption MLR2 is violated and we must worry about these consequences for OLS estimation Fortunately certain types of nonrandom sampling do not cause bias or inconsistency in OLS Under the GaussMarkov assumptions but without MLR2 it turns out that the sample can be chosen on the basis of the independent variables without causing any statistical problems This is called sam ple selection based on the independent variables and it is an example of exogenous sample selection In the statistics literature exogenous sample selection due to missing data is often called missing at random but this is not a particularly good label because the probability of missing data is allowed to depend on the explanatory variables See Little and Rubin 2002 Chapter 1 To illustrate exogenously missing data suppose that we are estimating a saving function where annual saving depends on income age family size and some unobserved factors u A simple model is saving 5 b0 1 b1income 1 b2age 1 b3size 1 u 937 Suppose that our data set was based on a survey of people over 35 years of age thereby leaving us with a nonrandom sample of all adults While this is not ideal we can still get unbiased and consist ent estimators of the parameters in the population model 937 using the nonrandom sample We will not show this formally here but the reason OLS on the nonrandom sample is unbiased is that the regression function E1saving0incomeagesize2 is the same for any subset of the population described by income age or size Provided there is enough variation in the independent variables in the sub population selection on the basis of the independent variables is not a serious problem other than that it results in smaller sample sizes In the IQ example just mentioned things are not so clearcut because no fixed rule based on IQ is used to include someone in the sample Rather the probability of being in the sample increases with IQ If the other factors determining selection into the sample are independent of the error term in the wage equation then we have another case of exogenous sample selection and OLS using the selected sample will have all of its desirable properties under the other GaussMarkov assumptions The situation is much different when selection is based on the dependent variable y which is called sample selection based on the dependent variable and is an example of endogenous sample selection If the sample is based on whether the dependent variable is above or below a given value Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 295 bias always occurs in OLS in estimating the population model For example suppose we wish to esti mate the relationship between individual wealth and several other factors in the population of all adults wealth 5 b0 1 b1educ 1 b2exper 1 b3age 1 u 938 Suppose that only people with wealth below 250000 are included in the sample This is a nonrandom sample from the population of interest and it is based on the value of the dependent variable Using a sample on people with wealth below 250000 will result in biased and inconsistent estimators of the parameters in 932 Briefly this occurs because the population regression E1wealth0educexperage2 is not the same as the expected value conditional on wealth being less than 250000 Other sampling schemes lead to nonrandom samples from the population usually intentionally A common method of data collection is stratified sampling in which the population is divided into nonoverlapping exhaustive groups or strata Then some groups are sampled more frequently than is dictated by their population representation and some groups are sampled less frequently For exam ple some surveys purposely oversample minority groups or lowincome groups Whether special methods are needed again hinges on whether the stratification is exogenous based on exogenous explanatory variables or endogenous based on the dependent variable Suppose that a survey of military personnel oversampled women because the initial interest was in studying the factors that determine pay for women in the military Oversampling a group that is relatively small in the popu lation is common in collecting stratified samples Provided men were sampled as well we can use OLS on the stratified sample to estimate any gender differential along with the returns to education and experience for all military personnel We might be willing to assume that the returns to education and experience are not gender specific OLS is unbiased and consistent because the stratification is with respect to an explanatory variable namely gender If instead the survey oversampled lowerpaid military personnel then OLS using the strati fied sample does not consistently estimate the parameters of the military wage equation because the stratification is endogenous In such cases special econometric methods are needed see Wooldridge 2010 Chapter 19 Stratified sampling is a fairly obvious form of nonrandom sampling Other sample selection issues are more subtle For instance in several previous examples we have estimated the effects of various variables particularly education and experience on hourly wage The data set WAGE1 that we have used throughout is essentially a random sample of working individuals Labor economists are often interested in estimating the effect of say education on the wage offer The idea is this every person of working age faces an hourly wage offer and he or she can either work at that wage or not work For someone who does work the wage offer is just the wage earned For people who do not work we usually cannot observe the wage offer Now since the wage offer equation log1wageo2 5 b0 1 b1educ 1 b2exper 1 u 939 represents the population of all workingage people we cannot estimate it using a random sample from this population instead we have data on the wage offer only for working people although we can get data on educ and exper for nonworking people If we use a random sample on working people to estimate 939 will we get unbiased estimators This case is not clearcut Since the sample is selected based on someones decision to work as opposed to the size of the wage offer this is not like the previ ous case However since the decision to work might be related to unobserved factors that affect the wage offer selection might be endogenous and this can result in a sample selection bias in the OLS estima tors We will cover methods that can be used to test and correct for sample selection bias in Chapter 17 Suppose we are interested in the effects of campaign expenditures by incumbents on voter support Some incumbents choose not to run for reelection If we can only collect voting and spending outcomes on incumbents that actually do run is there likely to be endogenous sample selection Exploring FurthEr 94 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 296 95c Outliers and Influential Observations In some applications especially but not only with small data sets the OLS estimates are sensitive to the inclusion of one or several observations A complete treatment of outliers and influential observations is beyond the scope of this book because a formal development requires matrix algebra Loosely speaking an observation is an influential observation if dropping it from the analysis changes the key OLS estimates by a practically large amount The notion of an outlier is also a bit vague because it requires comparing values of the variables for one observation with those for the remaining sample Nevertheless one wants to be on the lookout for unusual observations because they can greatly affect the OLS estimates OLS is susceptible to outlying observations because it minimizes the sum of squared residuals large residuals positive or negative receive a lot of weight in the least squares minimization prob lem If the estimates change by a practically large amount when we slightly modify our sample we should be concerned When statisticians and econometricians study the problem of outliers theoretically sometimes the data are viewed as being from a random sample from a given populationalbeit with an unusual dis tribution that can result in extreme valuesand sometimes the outliers are assumed to come from a different population From a practical perspective outlying observations can occur for two reasons The easiest case to deal with is when a mistake has been made in entering the data Adding extra zeros to a number or misplacing a decimal point can throw off the OLS estimates especially in small sample sizes It is always a good idea to compute summary statistics especially minimums and maximums in order to catch mistakes in data entry Unfortunately incorrect entries are not always obvious Outliers can also arise when sampling from a small population if one or several members of the population are very different in some relevant aspect from the rest of the population The decision to keep or drop such observations in a regression analysis can be a difficult one and the statistical properties of the resulting estimators are complicated Outlying observations can provide important information by increasing the variation in the explanatory variables which reduces standard errors But OLS results should probably be reported with and without outlying observations in cases where one or several data points substantially change the results ExamplE 98 RD Intensity and Firm Size Suppose that RD expenditures as a percentage of sales rdintens are related to sales in millions and profits as a percentage of sales profmarg rdintens 5 b0 1 b1sales 1 b2 profmarg 1 u 940 The OLS equation using data on 32 chemical companies in RDCHEM is rdintens 5 2625 1 000053 sales 1 0446 profmarg 105862 10000442 104622 n 5 32 R2 5 0761 R2 5 0124 Neither sales nor profmarg is statistically significant at even the 10 level in this regression Of the 32 firms 31 have annual sales less than 20 billion One firm has annual sales of almost 40 billion Figure 91 shows how far this firm is from the rest of the sample In terms of sales this firm is over twice as large as every other firm so it might be a good idea to estimate the model with out it When we do this we obtain rdintens 5 2297 1 000186 sales 1 0478 profmarg 105922 10000842 104452 n 5 31 R2 5 1728 R2 5 1137 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 297 0 10 10000 RD as a percentage of sales 20000 30000 40000 rm sales in millions of dollars f possible outlier 5 i FiguRE 91 Scatterplot of RD intensity against firm sales When the largest firm is dropped from the regression the coefficient on sales more than triples and it now has a t statistic over two Using the sample of smaller firms we would conclude that there is a statistically significant positive effect between RD intensity and firm size The profit margin is still not significant and its coefficient has not changed by much Sometimes outliers are defined by the size of the residual in an OLS regression where all of the observations are used Generally this is not a good idea because the OLS estimates adjust to make the sum of squared residuals as small as possible In the previous example including the largest firm flat tened the OLS regression line considerably which made the residual for that estimation not especially large In fact the residual for the largest firm is 162 when all 32 observations are used This value of the residual is not even one estimated standard deviation s 5 182 from the mean of the residuals which is zero by construction Studentized residuals are obtained from the original OLS residuals by dividing them by an esti mate of their standard deviation conditional on the explanatory variables in the sample The formula for the studentized residuals relies on matrix algebra but it turns out there is a simple trick to compute a studentized residual for any observation Namely define a dummy variable equal to one for that observationsay observation hand then include the dummy variable in the regression using all observations along with the other explanatory variables The coefficient on the dummy variable has a useful interpretation it is the residual for observation h computed from the regression line using only the other observations Therefore the dummys coefficient can be used to see how far off the observa tion is from the regression line obtained without using that observation Even better the t statistic on the dummy variable is equal to the studentized residual for observation h Under the classical linear model assumptions this t statistic has a tn2k22 distribution Therefore a large value of the t statistic in absolute value implies a large residual relative to its estimated standard deviation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 298 For Example 98 if we define a dummy variable for the largest firm observation 10 in the data file and include it as an additional regressor its coefficient is 657 verifying that the observa tion for the largest firm is very far from the regression line obtained using the other observations However when studentized the residual is only 182 While this is a marginally significant t statistic twosided pvalue 5 08 it is not close to being the largest studentized residual in the sample If we use the same method for the observation with the highest value of rdintensthe first observa tion with rdintens 942the coefficient on the dummy variable is 672 with a t statistic of 456 Therefore by this measure the first observation is more of an outlier than the tenth Yet dropping the first observation changes the coefficient on sales by only a small amount to about 000051 from 000053 although the coefficient on profmarg becomes larger and statistically significant So is the first observation an outlier too These calculations show the conundrum one can enter when trying to determine observations that should be excluded from a regression analysis even when the data set is small Unfortunately the size of the studentized residual need not correspond to how influential an observation is for the OLS slope estimates and certainly not for all of them at once A general problem with using studentized residuals is that in effect all other observations are used to estimate the regression line to compute the residual for a particular observation In other words when the studentized residual is obtained for the first observation the tenth observation has been used in estimating the intercept and slope Given how flat the regression line is with the largest firm tenth observation included it is not too surprising that the first observation with its high value of rdintens is far off the regression line Of course we can add two dummy variables at the same timeone for the first observation and one for the tenthwhich has the effect of using only the remaining 30 observations to estimate the regression line If we estimate the equation without the first and tenth observations the results are rdintens 5 1939 1 000160 sales 1 0701 profmarg 104592 1000652 103432 n 5 30 R2 5 2711 R2 5 2171 The coefficient on the dummy for the first observation is 647 1t 5 4582 and for the tenth observa tion it is 541 1t 5 21952 Notice that the coefficients on the sales and profmarg are both statisti cally significant the latter at just about the 5 level against a twosided alternative 1pvalue 5 0512 Even in this regression there are still two observations with studentized residuals greater than two corresponding to the two remaining observations with RD intensities above six Certain functional forms are less sensitive to outlying observations In Section 62 we mentioned that for most economic variables the logarithmic transformation significantly narrows the range of the data and also yields functional formssuch as constant elasticity modelsthat can explain a broader range of data ExamplE 99 RD Intensity We can test whether RD intensity increases with firm size by starting with the model rd 5 salesb1exp1b0 1 b2profmarg 1 u2 941 Then holding other factors fixed RD intensity increases with sales if and only if b1 1 Taking the log of 941 gives log1rd2 5 b0 1 b1log1sales2 1 b2profmarg 1 u 942 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 299 When we use all 32 firms the regression equation is log1rd2 5 24378 1 1084 log1sales2 1 0217 profmarg 14682 10602 101282 n 5 32 R2 5 9180 R2 5 9123 while dropping the largest firm gives log1rd2 5 24404 1 1088 log1sales2 1 0218 profmarg 15112 10672 101302 n 5 31 R2 5 9037 R2 5 8968 Practically these results are the same In neither case do we reject the null H0 b1 5 1 against H1 b1 1 Why In some cases certain observations are suspected at the outset of being fundamentally different from the rest of the sample This often happens when we use data at very aggregated levels such as the city county or state level The following is an example ExamplE 910 State Infant mortality Rates Data on infant mortality per capita income and measures of health care can be obtained at the state level from the Statistical Abstract of the United States We will provide a fairly simple analysis here just to illustrate the effect of outliers The data are for the year 1990 and we have all 50 states in the United States plus the District of Columbia DC The variable infmort is number of deaths within the first year per 1000 live births pcinc is per capita income physic is physicians per 100000 mem bers of the civilian population and popul is the population in thousands The data are contained in INFMRT We include all independent variables in logarithmic form infmort 5 3386 2 468 log1pcinc2 1 415 log1physic2 120432 12602 11512 2 088 log1popul2 943 12872 n 5 51 R2 5 139 R2 5 084 Higher per capita income is estimated to lower infant mortality an expected result But more physi cians per capita is associated with higher infant mortality rates something that is counterintuitive Infant mortality rates do not appear to be related to population size The District of Columbia is unusual in that it has pockets of extreme poverty and great wealth in a small area In fact the infant mortality rate for DC in 1990 was 207 compared with 124 for the highest state It also has 615 physicians per 100000 of the civilian population compared with 337 for the highest state The high number of physicians coupled with the high infant mortality rate in DC could certainly influence the results If we drop DC from the regression we obtain infmort 5 2395 2 57 log1pcinc2 2 274 log1physic2 944 112422 11642 11192 1 629 log1popul2 11912 n 5 50 R2 5 273 R2 5 226 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 1 Regression Analysis with CrossSectional Data 300 We now find that more physicians per capita lowers infant mortality and the estimate is statisti cally different from zero at the 5 level The effect of per capita income has fallen sharply and is no longer statistically significant In equation 944 infant mortality rates are higher in more populous states and the relationship is very statistically significant Also much more variation in infmort is explained when DC is dropped from the regression Clearly DC had substantial influence on the initial estimates and we would probably leave it out of any further analysis As Example 98 demonstrates inspecting observations in trying to determine which are outliers and even which ones have substantial influence on the OLS estimates is a difficult endeavor More advanced treatments allow more formal approaches to determine which observations are likely to be influential observations Using matrix algebra Belsley Kuh and Welsh 1980 define the leverage of an observa tion which formalizes the notion that an observation has a large or small influence on the OLS estimates These authors also provide a more indepth discussion of standardized and studentized residuals 96 Least Absolute Deviations Estimation Rather than trying to determine which observations if any have undue influence on the OLS esti mates a different approach to guarding against outliers is to use an estimation method that is less sensitive to outliers than OLS One such method which has become popular among applied econo metricians is called least absolute deviations LAD The LAD estimators of the bj in a linear model minimize the sum of the absolute values of the residuals min b0 b1 p bk a n i51 0yi 2 b0 2 b1xi1 2 p 2 bkxik0 945 Unlike OLS which minimizes the sum of squared residuals the LAD estimates are not available in closed formthat is we cannot write down formulas for them In fact historically solving the prob lem in equation 945 was computationally difficult especially with large sample sizes and many explanatory variables But with the vast improvements in computational speed over the past two dec ades LAD estimates are fairly easy to obtain even for large data sets Figure 92 shows the OLS and LAD objective functions The LAD objective function is linear on either side of zero so that if say a positive residual increases by one unit the LAD objective function increases by one unit By contrast the OLS objective function gives increasing importance to large residuals and this makes OLS more sensitive to outlying observations Because LAD does not give increasing weight to larger residuals it is much less sensitive to changes in the extreme values of the data than OLS In fact it is known that LAD is designed to estimate the parameters of the conditional median of y given x1 x2 p xk rather than the conditional mean Because the median is not affected by large changes in the extreme observations it follows that the LAD parameter estimates are more resilient to outlying observations See Section A1 for a brief discussion of the sample median In choosing the estimates OLS squares each residual and so the OLS estimates can be very sensitive to outlying observations as we saw in Examples 98 and 910 In addition to LAD being more computationally intensive than OLS a second drawback of LAD is that all statistical inference involving the LAD estimators is justified only as the sample size grows The formulas are somewhat complicated and require matrix algebra and we do not need them here Koenker 2005 provides a comprehensive treatment Recall that under the classical linear model assumptions the OLS t statistics have exact t distributions and F statistics have exact F distribu tions While asymptotic versions of these statistics are available for LADand reported routinely by software packages that compute LAD estimatesthese are justified only in large samples Like the additional computational burden involved in computing LAD estimates the lack of exact inference for LAD is only of minor concern because most applications of LAD involve several hundred if not Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 301 several thousand observations Of course we might be pushing it if we apply largesample approxi mations in an example such as Example 98 with n 5 32 In a sense this is not very different from OLS because more often than not we must appeal to large sample approximations to justify OLS inference whenever any of the CLM assumptions fail A more subtle but important drawback to LAD is that it does not always consistently esti mate the parameters appearing in the conditional mean function E1y0x1 p xk2 As mentioned ear lier LAD is intended to estimate the effects on the conditional median Generally the mean and median are the same only when the distribution of y given the covariates x1 p xk is symmetric about b0 1 b1x1 1 p 1 bkxk Equivalently the population error term u is symmetric about zero Recall that OLS produces unbiased and consistent estimators of the parameters in the conditional mean whether or not the error distribution is symmetric symmetry does not appear among the Gauss Markov assumptions When LAD and OLS are applied to cases with asymmetric distributions the estimated partial effect of say x1 obtained from LAD can be very different from the partial effect obtained from OLS But such a difference could just reflect the difference between the median and the mean and might not have anything to do with outliers See Computer Exercise C9 for an example If we assume that the population error u in model 92 is independent of 1x1 p xk2 then the OLS and LAD slope estimates should differ only by sampling error whether or not the distribution of u is symmetric The intercept estimates generally will be different to reflect the fact that if the mean of u is zero then its median is different from zero under asymmetry Unfortunately independence between the error and the explanatory variables is often unrealistically strong when LAD is applied In particular independence rules out heteroskedasticity a problem that often arises in applications with asymmetric distributions An advantage that LAD has over OLS is that because LAD estimates the median it is easy to obtain partial effectsand predictionsusing monotonic transformations Here we consider the most common transformation taking the natural log Suppose that log1y2 follows a linear model where the error has a zero conditional median log1y2 5 b0 1 xb 1 u 946 Med1u0x2 5 0 947 LAD OLS 0 5 10 15 4 2 0 2 4 u FiguRE 92 The OLS and LAD objective functions Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 302 PART 1 Regression Analysis with CrossSectional Data which implies that Med3log1y2 0x4 5 b0 1 xb A wellknown feature of the conditional mediansee for example Wooldridge 2010 Chapter 12 is that it passes through increasing functions Therefore Med1y0x2 5 exp1b0 1 xb2 948 It follows that bj is the semielasticity of Med1y0x2 with respect to xj In other words the partial effect of xj in the linear equation 946 can be used to uncover the partial effect in the nonlinear model 948 It is important to understand that this holds for any distribution of u such that 947 holds and we need not assume u and x are independent By contrast if we specify a linear model for E3log1y2 0x4 then in general there is no way to uncover E1y0x2 If we make a full distributional assumption for u given x then in principle we can recover E1y0x2 We covered the special case in equation 640 under the assumption that log1y2 follows a classical linear model However in general there is no way to find E1y0x2 from a model for E3log1y2 0x4 even though we can always obtain Med1y0x2 from Med3log1y2 0x4 Problem 9 investigates how heteroskedasticity in a linear model for log1y2 confounds our ability to find E1y0x2 LAD is a special case of what is often called robust regression Unfortunately the way robust is used here can be confusing In the statistics literature a robust regression estimator is relatively insen sitive to extreme observations Effectively observations with large residuals are given less weight than in least squares Berk 1990 contains an introductory treatment of estimators that are robust to outlying observations Based on our earlier discussion in econometric parlance LAD is not a robust estimator of the conditional mean because it requires extra assumptions in order to consistently esti mate the conditional mean parameters In Equation 92 either the distribution of u given 1x1 p xk2 has to be symmetric about zero or u must be independent of 1x1 p xk2 Neither of these is required for OLS LAD is also a special case of quantile regression which is used to estimate the effect of the xj on different parts of the distributionnot just the median or mean For example in a study to see how having access to a particular pension plan affects wealth it could be that access affects highwealth people differently from lowwealth people and these effects both differ from the median person Wooldridge 2010 Chapter 12 contains a treatment and examples of quantile regression Summary We have further investigated some important specification and data issues that often arise in empirical crosssectional analysis Misspecified functional form makes the estimated equation difficult to interpret Nevertheless incorrect functional form can be detected by adding quadratics computing RESET or testing against a nonnested alternative model using the DavidsonMacKinnon test No additional data collection is needed Solving the omitted variables problem is more difficult In Section 92 we discussed a possible solu tion based on using a proxy variable for the omitted variable Under reasonable assumptions including the proxy variable in an OLS regression eliminates or at least reduces bias The hurdle in applying this method is that proxy variables can be difficult to find A general possibility is to use data on a dependent variable from a prior year Applied economists are often concerned with measurement error Under the classical errorsinvar iables CEV assumptions measurement error in the dependent variable has no effect on the statistical properties of OLS In contrast under the CEV assumptions for an independent variable the OLS estimator Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 303 for the coefficient on the mismeasured variable is biased toward zero The bias in coefficients on the other variables can go either way and is difficult to determine Nonrandom samples from an underlying population can lead to biases in OLS When sample selection is correlated with the error term u OLS is generally biased and inconsistent On the other hand exogenous sample selectionwhich is either based on the explanatory variables or is otherwise independent of u does not cause problems for OLS Outliers in data sets can have large impacts on the OLS estimates espe cially in small samples It is important to at least informally identify outliers and to reestimate models with the suspected outliers excluded Least absolute deviations estimation is an alternative to OLS that is less sensitive to outliers and that delivers consistent estimates of conditional median parameters In the past 20 years with computational advances and improved understanding of the pros and cons of LAD and OLS LAD is used more and more in empirical researchoften as a supplement to OLS Key Terms Attenuation Bias Average Marginal Effect Average Partial Effect APE Classical ErrorsinVariables CEV Complete Cases Estimator Conditional Median DavidsonMacKinnon Test Endogenous Explanatory Variable Endogenous Sample Selection Exogenous Sample Selection Functional Form Misspecification Influential Observations Lagged Dependent Variable Least Absolute Deviations LAD Measurement Error Missing at Random Missing Completely at Random MCAR Missing Data Multiplicative Measurement Error Nonnested Models Nonrandom Sample Outliers PlugIn Solution to the Omitted Variables Problem Proxy Variable Random Coefficient Slope Model Regression Specification Error Test RESET Stratified Sampling Studentized Residuals Problems 1 In Problem 11 in Chapter 4 the Rsquared from estimating the model log1salary2 5 b0 1 b1log1sales2 1 b2log1mktval2 1 b3profmarg 1 b4ceoten 1 b5comten 1 u using the data in CEOSAL2 was R2 5 353 1n 5 1772 When ceoten2 and comten2 are added R2 5 375 Is there evidence of functional form misspecification in this model 2 Let us modify Computer Exercise C4 in Chapter 8 by using voting outcomes in 1990 for incumbents who were elected in 1988 Candidate A was elected in 1988 and was seeking reelection in 1990 voteA90 is Candidate As share of the twoparty vote in 1990 The 1988 voting share of Candidate A is used as a proxy variable for quality of the candidate All other variables are for the 1990 election The following equations were estimated using the data in VOTE2 voteA90 5 7571 1 312 prtystrA 1 493 democA 19252 10462 11012 2929 log1expendA2 2 1950 log1expendB2 16842 12812 n 5 186 R2 5 495 R2 5 483 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 304 PART 1 Regression Analysis with CrossSectional Data and voteA90 5 7081 1 282 prtystrA 1 452 democA 110012 10522 11062 2839 log1expendA2 2 1846 log1expendB2 1 067 voteA88 16872 12922 10532 n 5 186 R2 5 499 R2 5 485 i Interpret the coefficient on voteA88 and discuss its statistical significance ii Does adding voteA88 have much effect on the other coefficients 3 Let math10 denote the percentage of students at a Michigan high school receiving a passing score on a standardized math test see also Example 42 We are interested in estimating the effect of perstudent spending on math performance A simple model is math10 5 b0 1 b1log1expend2 1 b2log1enroll2 1 b3poverty 1 u where poverty is the percentage of students living in poverty i The variable lnchprg is the percentage of students eligible for the federally funded school lunch program Why is this a sensible proxy variable for poverty ii The table that follows contains OLS estimates with and without lnchprg as an explanatory variable Dependent Variable math10 Independent Variables 1 2 log1expend2 1113 330 775 304 log1enroll2 022 615 126 58 lnchprg 324 036 intercept 6924 2672 2314 2499 Observations Rsquared 428 0297 428 1893 Explain why the effect of expenditures on math10 is lower in column 2 than in column 1 Is the effect in column 2 still statistically greater than zero iii Does it appear that pass rates are lower at larger schools other factors being equal Explain iv Interpret the coefficient on lnchprg in column 2 v What do you make of the substantial increase in R2 from column 1 to column 2 4 The following equation explains weekly hours of television viewing by a child in terms of the childs age mothers education fathers education and number of siblings tvhoursp 5 b0 1 b1age 1 b2age2 1 b3motheduc 1 b4fatheduc 1 b5sibs 1 u We are worried that tvhoursp is measured with error in our survey Let tvhours denote the reported hours of television viewing per week i What do the classical errorsinvariables CEV assumptions require in this application ii Do you think the CEV assumptions are likely to hold Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 305 5 In Example 44 we estimated a model relating number of campus crimes to student enrollment for a sample of colleges The sample we used was not a random sample of colleges in the United States because many schools in 1992 did not report campus crimes Do you think that college failure to report crimes can be viewed as exogenous sample selection Explain 6 In the model 917 show that OLS consistently estimates a and b if ai is uncorrelated with xi and bi is uncorrelated with xi and x2 i which are weaker assumptions than 919 Hint Write the equation as in 918 and recall from Chapter 5 that sufficient for consistency of OLS for the intercept and slope is E1ui2 5 0 and Cov1xi ui2 5 0 7 Consider the simple regression model with classical measurement error y 5 b0 1 b1xp 1 u where we have m measures on xp Write these as zh 5 xp 1 eh h 5 1 p m Assume that xp is uncorrelated with u e1 p em that the measurement errors are pairwise uncorrelated and have the same variance s2 e Let w 5 1z1 1 p 1 zm2m be the average of the measures on xp so that for each observation i wi 5 1zi1 1 p 1 zim2m is the average of the m measures Let b1 be the OLS estimator from the simple regression yi on 1 wi i 5 1 p n using a random sample of data i Show that plim1b12 5 b1e sx 2 3s2 x 1 1s2 em4 f Hint The plim of b1 is Cov1wy2Var1w2 ii How does the inconsistency in b1 compare with that when only a single measure is available that is m 5 1 What happens as m grows Comment 8 The point of this exercise is to show that tests for functional form cannot be relied on as a general test for omitted variables Suppose that conditional on the explanatory variables x1 and x2 a linear model relating y to x1 and x2 satisfies the GaussMarkov assumptions y 5 b0 1 b1x1 1 b2x2 1 u E1u0x1 x22 5 0 Var1u0x1 x22 5 s2 To make the question interesting assume b2 2 0 Suppose further that x2 has a simple linear relationship with x1 x2 5 d0 1 d1x1 1 r E1r0x12 5 0 Var1r0x12 5 t2 i Show that E1y0x12 5 1b0 1 b2d02 1 1b1 1 b2d12 x1 Under random sampling what is the probability limit of the OLS estimator from the simple regression of y on x1 Is the simple regression estimator generally consistent for b1 ii If you run the regression of y on x1 x2 1 what will be the probability limit of the OLS estimator of the coefficient on x2 1 Explain iii Using substitution show that we can write y 5 1b0 1 b2d02 1 1b1 1 b2d12x1 1 u 1 b2r It can be shown that if we define v 5 u 1 b2r then E1v0x12 5 0 Var1v0x12 5 s2 1 b2 2t2 What consequences does this have for the t statistic on x2 1 from the regression in part ii iv What do you conclude about adding a nonlinear function of x1in particular x2 1in an attempt to detect omission of x2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 306 PART 1 Regression Analysis with CrossSectional Data 9 Suppose that log1y2 follows a linear model with a linear form of heteroskedasticity We write this as log1y2 5 b0 1 xb 1 u u0x Normal10h1x2 2 so that conditional on x u has a normal distribution with mean and median zero but with variance hx that depends on x Because Med1u0x2 5 0 equation 948 holds Med1y0x2 5 exp1b0 1 xb2 Further using an extension of the result from Chapter 6 it can be shown that E1y0x2 5 exp3b0 1 xb 1 h1x224 i Given that hx can be any positive function is it possible to conclude E1y0x2xj is the same sign as bj ii Suppose h1x2 5 d0 1 xd and ignore the problem that linear functions are not necessarily always positive Show that a particular variable say x1 can have a negative effect on Med1y0x2 but a positive effect on E1y0x2 iii Consider the case covered in Section 64 where h1x2 5 s2 How would you predict y using an estimate of E1y0x2 How would you predict y using an estimate of Med1y0x2 Which prediction is always larger 10 This exercise shows that in a simple regression model adding a dummy variable for missing data on the explanatory variable produces a consistent estimator of the slope coefficient if the missingness is unrelated to both the unobservable and observable factors affecting y Let m be a variable such that m 5 1 if we do not observe x and m 5 0 if we observe x We assume that y is always observed The population model is y 5 b0 1 b1x 1 u E1u0x2 5 0 i Provide an interpretation of the stronger assumption E1u0xm2 5 0 In particular what kind of missing data schemes would cause this assumption to fail ii Show that we can always write y 5 b0 1 b111 2 m2x 1 b1mx 1 u iii Let 1xi yi mi2 i 5 1 p n6 be random draws from the population where xi is missing when mi 5 1 Explain the nature of the variable zi 5 11 2 mi2xi In particular what does this variable equal when xi is missing iv Let r 5 P1m 5 12 and assume that m and x are independent Show that Cov3 11 2 m2xmx4 5 2r11 2 r2mx where mx 5 E1x2 What does this imply about estimating b1 from the regression yi on zi i 5 1 p n v If m and x are independent it can be shown that mx 5 d0 1 d1m 1 v where v is uncorrelated with m and z 5 11 2 m2x Explain why this makes m a suitable proxy variable for mx What does this mean about the coefficient on zi in the regression yi on zi mi i 5 1 p n vi Suppose for a population of children y is a standardized test score obtained from school records and x is family income which is reported voluntarily by families and so some families do not report their income Is it realistic to assume m and x are independent Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 307 Computer Exercises C1 i Apply RESET from equation 93 to the model estimated in Computer Exercise C5 in Chapter 7 Is there evidence of functional form misspecification in the equation ii Compute a heteroskedasticityrobust form of RESET Does your conclusion from part i change C2 Use the data set WAGE2 for this exercise i Use the variable KWW the knowledge of the world of work test score as a proxy for ability in place of IQ in Example 93 What is the estimated return to education in this case ii Now use IQ and KWW together as proxy variables What happens to the estimated return to education iii In part ii are IQ and KWW individually significant Are they jointly significant C3 Use the data from JTRAIN for this exercise i Consider the simple regression model log1scrap2 5 b0 1 b1grant 1 u where scrap is the firm scrap rate and grant is a dummy variable indicating whether a firm received a job training grant Can you think of some reasons why the unobserved factors in u might be correlated with grant ii Estimate the simple regression model using the data for 1988 You should have 54 observations Does receiving a job training grant significantly lower a firms scrap rate iii Now add as an explanatory variable log1scrap872 How does this change the estimated effect of grant Interpret the coefficient on grant Is it statistically significant at the 5 level against the onesided alternative H1 bgrant 0 iv Test the null hypothesis that the parameter on log1scrap872 is one against the twosided alternative Report the pvalue for the test v Repeat parts iii and iv using heteroskedasticityrobust standard errors and briefly discuss any notable differences C4 Use the data for the year 1990 in INFMRT for this exercise i Reestimate equation 943 but now include a dummy variable for the observation on the District of Columbia called DC Interpret the coefficient on DC and comment on its size and significance ii Compare the estimates and standard errors from part i with those from equation 944 What do you conclude about including a dummy variable for a single observation C5 Use the data in RDCHEM to further examine the effects of outliers on OLS estimates and to see how LAD is less sensitive to outliers The model is rdintens 5 b0 1 b1sales 1 b2sales2 1 b3profmarg 1 u where you should first change sales to be in billions of dollars to make the estimates easier to interpret i Estimate the above equation by OLS both with and without the firm having annual sales of almost 40 billion Discuss any notable differences in the estimated coefficients ii Estimate the same equation by LAD again with and without the largest firm Discuss any important differences in estimated coefficients iii Based on your findings in i and ii would you say OLS or LAD is more resilient to outliers C6 Redo Example 410 by dropping schools where teacher benefits are less than 1 of salary i How many observations are lost ii Does dropping these observations have any important effects on the estimated tradeoff Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 308 PART 1 Regression Analysis with CrossSectional Data C7 Use the data in LOANAPP for this exercise i How many observations have obrat 40 that is other debt obligations more than 40 of total income ii Reestimate the model in part iii of Computer Exercise C8 excluding observations with obrat 40 What happens to the estimate and t statistic on white iii Does it appear that the estimate of bwhite is overly sensitive to the sample used C8 Use the data in TWOYEAR for this exercise i The variable stotal is a standardized test variable which can act as a proxy variable for unobserved ability Find the sample mean and standard deviation of stotal ii Run simple regressions of jc and univ on stotal Are both college education variables statistically related to stotal Explain iii Add stotal to equation 417 and test the hypothesis that the returns to two and fouryear colleges are the same against the alternative that the return to fouryear colleges is greater How do your findings compare with those from Section 44 iv Add stotal2 to the equation estimated in part iii Does a quadratic in the test score variable seem necessary v Add the interaction terms stotaljc and stotaluniv to the equation from part iii Are these terms jointly significant vi What would be your final model that controls for ability through the use of stotal Justify your answer C9 In this exercise you are to compare OLS and LAD estimates of the effects of 401k plan eligibility on net financial assets The model is nettfa 5 b0 1 b1inc 1 b2inc2 1 b3age 1 b4age2 1 b5male 1 b6e401k 1 u i Use the data in 401KSUBS to estimate the equation by OLS and report the results in the usual form Interpret the coefficient on e401k ii Use the OLS residuals to test for heteroskedasticity using the BreuschPagan test Is u independent of the explanatory variables iii Estimate the equation by LAD and report the results in the same form as for OLS Interpret the LAD estimate of b6 iv Reconcile your findings from parts i and iii C10 You need to use two data sets for this exercise JTRAIN2 and JTRAIN3 The former is the outcome of a job training experiment The file JTRAIN3 contains observational data where individuals themselves largely determine whether they participate in job training The data sets cover the same time period i In the data set JTRAIN2 what fraction of the men received job training What is the fraction in JTRAIN3 Why do you think there is such a big difference ii Using JTRAIN2 run a simple regression of re78 on train What is the estimated effect of participating in job training on real earnings iii Now add as controls to the regression in part ii the variables re74 re75 educ age black and hisp Does the estimated effect of job training on re78 change much How come Hint Remember that these are experimental data iv Do the regressions in parts ii and iii using the data in JTRAIN3 reporting only the estimated coefficients on train along with their t statistics What is the effect now of controlling for the extra factors and why v Define avgre 5 1re74 1 re7522 Find the sample averages standard deviations and minimum and maximum values in the two data sets Are these data sets representative of the same populations in 1978 vi Almost 96 of men in the data set JTRAIN2 have avgre less than 10000 Using only these men run the regression re78 on train re74 re75 educ age black hisp Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 9 More on Specification and Data Issues 309 and report the training estimate and its t statistic Run the same regression for JTRAIN3 using only men with avgre 10 For the subsample of lowincome men how do the estimated training effects compare across the experimental and nonexperimental data sets vii Now use each data set to run the simple regression re78 on train but only for men who were unemployed in 1974 and 1975 How do the training estimates compare now viii Using your findings from the previous regressions discuss the potential importance of having comparable populations underlying comparisons of experimental and nonexperimental estimates C11 Use the data in MURDER only for the year 1993 for this question although you will need to first obtain the lagged murder rate say mrdrte21 i Run the regression of mrdrte on exec unem What are the coefficient and t statistic on exec Does this regression provide any evidence for a deterrent effect of capital punishment ii How many executions are reported for Texas during 1993 Actually this is the sum of executions for the current and past two years How does this compare with the other states Add a dummy variable for Texas to the regression in part i Is its t statistic unusually large From this does it appear Texas is an outlier iii To the regression in part i add the lagged murder rate What happens to b exec and its statistical significance iv For the regression in part iii does it appear Texas is an outlier What is the effect on b exec from dropping Texas from the regression C12 Use the data in ELEM9495 to answer this question See also Computer Exercise C10 in Chapter 4 i Using all of the data run the regression lavgsal on bs lenrol lstaff and lunch Report the coefficient on bs along with its usual and heteroskedasticityrobust standard errors What do you conclude about the economic and statistical significance of b bs ii Now drop the four observations with bs 5 that is where average benefits are supposedly more than 50 of average salary What is the coefficient on bs Is it statistically significant using the heteroskedasticityrobust standard error iii Verify that the four observations with bs 5 are 68 1127 1508 and 1670 Define four dummy variables for each of these observations You might call them d68 d1127 d1508 and d1670 Add these to the regression from part i and verify that the OLS coefficients and standard errors on the other variables are identical to those in part ii Which of the four dummies has a t statistic statistically different from zero at the 5 level iv Verify that in this data set the data point with the largest studentized residual largest t statistic on the dummy variable in part iii has a large influence on the OLS estimates That is run OLS using all observations except the one with the large studentized residual Does dropping in turn each of the other observations with bs 5 have important effects v What do you conclude about the sensitivity of OLS to a single observation even with a large sample size vi Verify that the LAD estimator is not sensitive to the inclusion of the observation identified in part iii C13 Use the data in CEOSAL2 to answer this question i Estimate the model lsalary 5 b0 1 b1lsales 1 b2lmktval 1 b3ceoten 1 b4ceoten2 1 u by OLS using all of the observations where lsalary lsales and lmktvale are all natural logarithms Report the results in the usual form with the usual OLS standard errors You may verify that the heteroskedasticityrobust standard errors are similar ii In the regression from part i obtain the studentized residuals call these stri How many studentized residuals are above 196 in absolute value If the studentized residuals were independent draws from a standard normal distribution about how many would you expect to be above two in absolute value with 177 draws Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 310 PART 1 Regression Analysis with CrossSectional Data iii Reestimate the equation in part i by OLS using only the observations with 0stri0 196 How do the coefficients compare with those in part i iv Estimate the equation in part i by LAD using all of the data Is the estimate of b1 closer to the OLS estimate using the full sample or the restricted sample What about for b3 v Evaluate the following statement Dropping outliers based on extreme values of studentized residuals makes the resulting OLS estimates closer to the LAD estimates on the full sample C14 Use the data in ECONMATH to answer this question The population model is score 5 b0 1 b1act 1 u i For how many students is the ACT score missing What is the fraction of the sample Define a new variable actmiss which equals one if act is missing and zero otherwise ii Create a new variable say act0 which is the act score when act is reported and zero when act is missing Find the average of act0 and compare it with the average for act iii Run the simple regression of score on act using only the complete cases What do you obtain for the slope coefficient and its heteroskedasticityrobust standard error iv Run the simple regression of score on act0 using all of the cases Compare the slope coefficient with that in part iii and comment v Now use all of the cases and run the regression scorei on act0i actmissi What is the slope estimate on act0i How does it compare with the answers in parts iii and iv vi Comparing regressions iii and v does using all cases and adding the missing data estimator improve estimation of b1 vii If you add the variable colgpa to the regressions in parts iii and v does this change your answer to part vi Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 311 N ow that we have a solid understanding of how to use the multiple regression model for crosssectional applications we can turn to the econometric analysis of time series data Since we will rely heavily on the method of ordinary least squares most of the work con cerning mechanics and inference has already been done However as we noted in Chapter 1 time series data have certain characteristics that crosssectional data do not and these can require spe cial attention when applying OLS Chapter 10 covers basic regression analysis and gives attention to problems unique to time series data We provide a set of GaussMarkov and classical linear model assumptions for time series applications The problems of functional form dummy variables trends and seasonality are also discussed Because certain time series models necessarily violate the GaussMarkov assumptions Chapter 11 describes the nature of these violations and presents the large sample properties of ordinary least squares As we can no longer assume random sampling we must cover conditions that restrict the temporal correlation in a time series in order to ensure that the usual asymptotic analysis is valid Chapter 12 turns to an important new problem serial correlation in the error terms in time series regressions We discuss the consequences ways of testing and methods for dealing with serial correlation Chapter 12 also contains an explanation of how heteroskedasticity can arise in time series models Part 2 Regression Analysis with Time Series Data Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 312 I n this chapter we begin to study the properties of OLS for estimating linear regression models using time series data In Section 101 we discuss some conceptual differences between time series and crosssectional data Section 102 provides some examples of time series regressions that are often estimated in the empirical social sciences We then turn our attention to the finite sample prop erties of the OLS estimators and state the GaussMarkov assumptions and the classical linear model assumptions for time series regression Although these assumptions have features in common with those for the crosssectional case they also have some significant differences that we will need to highlight In addition we return to some issues that we treated in regression with crosssectional data such as how to use and interpret the logarithmic functional form and dummy variables The important top ics of how to incorporate trends and account for seasonality in multiple regression are taken up in Section 105 101 The Nature of Time Series Data An obvious characteristic of time series data that distinguishes them from crosssectional data is tem poral ordering For example in Chapter 1 we briefly discussed a time series data set on employment the minimum wage and other economic variables for Puerto Rico In this data set we must know that the data for 1970 immediately precede the data for 1971 For analyzing time series data in the social sciences we must recognize that the past can affect the future but not vice versa unlike in the Star Trek universe To emphasize the proper ordering of time series data Table 101 gives a partial listing Basic Regression Analysis with Time Series Data c h a p t e r 10 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 313 of the data on US inflation and unemployment rates from various editions of the Economic Report of the President including the 2004 Report Tables B42 and B64 Another difference between crosssectional and time series data is more subtle In Chapters 3 and 4 we studied statistical properties of the OLS estimators based on the notion that samples were randomly drawn from the appropriate population Understanding why crosssectional data should be viewed as random outcomes is fairly straightforward a different sample drawn from the population will generally yield different values of the independent and dependent variables such as education experience wage and so on Therefore the OLS estimates computed from different random samples will generally differ and this is why we consider the OLS estimators to be random variables How should we think about randomness in time series data Certainly economic time series sat isfy the intuitive requirements for being outcomes of random variables For example today we do not know what the Dow Jones Industrial Average will be at the close of the next trading day We do not know what the annual growth in output will be in Canada during the coming year Since the outcomes of these variables are not foreknown they should clearly be viewed as random variables Formally a sequence of random variables indexed by time is called a stochastic process or a time series process Stochastic is a synonym for random When we collect a time series data set we obtain one possible outcome or realization of the stochastic process We can only see a single realization because we cannot go back in time and start the process over again This is analogous to crosssectional analysis where we can collect only one random sample However if certain condi tions in history had been different we would generally obtain a different realization for the stochastic process and this is why we think of time series data as the outcome of random variables The set of all possible realizations of a time series process plays the role of the population in crosssectional analysis The sample size for a time series data set is the number of time periods over which we observe the variables of interest 102 Examples of Time Series Regression Models In this section we discuss two examples of time series models that have been useful in empirical time series analysis and that are easily estimated by ordinary least squares We will study additional mod els in Chapter 11 TAblE 101 Partial Listing of Data on US Inflation and Unemployment Rates 19482003 Year Inflation Unemployment 1948 81 38 1949 212 59 1950 13 53 1951 79 33 1998 16 45 1999 22 42 2000 34 40 2001 28 47 2002 16 58 2003 23 60 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 314 102a Static Models Suppose that we have time series data available on two variables say y and z where yt and zt are dated contemporaneously A static model relating y to z is yt 5 b0 1 b1zt 1 ut t 5 1 2 p n 101 The name static model comes from the fact that we are modeling a contemporaneous relationship between y and z Usually a static model is postulated when a change in z at time t is believed to have an immediate effect on y Dyt 5 b1Dzt when Dut 5 0 Static regression models are also used when we are interested in knowing the tradeoff between y and z An example of a static model is the static Phillips curve given by inft 5 b0 1 b1unemt 1 ut 102 where inft is the annual inflation rate and unemt is the annual unemployment rate This form of the Phillips curve assumes a constant natural rate of unemployment and constant inflationary expecta tions and it can be used to study the contemporaneous tradeoff between inflation and unemployment See for example Mankiw 1994 Section 112 Naturally we can have several explanatory variables in a static regression model Let mrdrtet denote the murders per 10000 people in a particular city during year t let convrtet denote the murder conviction rate let unemt be the local unemployment rate and let yngmlet be the fraction of the popu lation consisting of males between the ages of 18 and 25 Then a static multiple regression model explaining murder rates is mrdrtet 5 b0 1 b1convrtet 1 b2unemt 1 b3yngmlet 1 ut 103 Using a model such as this we can hope to estimate for example the ceteris paribus effect of an increase in the conviction rate on a particular criminal activity 102b Finite Distributed Lag Models In a finite distributed lag FDL model we allow one or more variables to affect y with a lag For example for annual observations consider the model gfrt 5 a0 1 d0pet 1 d1pet21 1 d2pet22 1 ut 104 where gfrt is the general fertility rate children born per 1000 women of childbearing age and pet is the real dollar value of the personal tax exemption The idea is to see whether in the aggregate the decision to have children is linked to the tax value of having a child Equation 104 recognizes that for both biological and behavioral reasons decisions to have children would not immediately result from changes in the personal exemption Equation 104 is an example of the model yt 5 a0 1 d0zt 1 d1zt21 1 d2zt22 1 ut 105 which is an FDL of order two To interpret the coefficients in 105 suppose that z is a constant equal to c in all time periods before time t At time t z increases by one unit to c 1 1 and then reverts to its previous level at time t 1 1 That is the increase in z is temporary More precisely p zt22 5 c zt21 5 c zt 5 c 1 1 zt11 5 c zt12 5 c p Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 315 To focus on the ceteris paribus effect of z on y we set the error term in each time period to zero Then yt21 5 a0 1 d0c 1 d1c 1 d2c yt 5 a0 1 d01c 1 12 1 d1c 1 d2c yt11 5 a0 1 d0c 1 d11c 1 12 1 d2c yt12 5 a0 1 d0c 1 d1c 1 d21c 1 12 yt13 5 a0 1 d0c 1 d1c 1 d2c and so on From the first two equations yt 2 yt21 5 d0 which shows that d0 is the immediate change in y due to the oneunit increase in z at time t d0 is usually called the impact propensity or impact multiplier Similarly d1 5 yt11 2 yt21 is the change in y one period after the temporary change and d2 5 yt12 2 yt21 is the change in y two periods after the change At time t 1 3 y has reverted back to its initial level yt13 5 yt21 This is because we have assumed that only two lags of z appear in 105 When we graph the dj as a function of j we obtain the lag distribution which summarizes the dynamic effect that a temporary increase in z has on y A possible lag distribution for the FDL of order two is given in Figure 101 Of course we would never know the parameters dj instead we will esti mate the dj and then plot the estimated lag distribution The lag distribution in Figure 101 implies that the largest effect is at the first lag The lag distri bution has a useful interpretation If we standardize the initial value of y at yt21 5 0 the lag distribu tion traces out all subsequent values of y due to a oneunit temporary increase in z We are also interested in the change in y due to a permanent increase in z Before time t z equals the constant c At time t z increases permanently to c 1 1 zs 5 c s t and zs 5 c 1 1 s t Again setting the errors to zero we have yt21 5 a0 1 d0c 1 d1c 1 d2c yt 5 a0 1 d01c 1 12 1 d1c 1 d2c yt11 5 a0 1 d01c 1 12 1 d11c 1 12 1 d2c yt12 5 a0 1 d01c 1 12 1 d11c 1 12 1 d21c 1 12 1 0 coefficient 2 3 4 lag j FiguRE 101 A lag distribution with two nonzero lags The maximum effect is at the first lag Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 316 and so on With the permanent increase in z after one period y has increased by d0 1 d1 and after two periods y has increased by d0 1 d1 1 d2 There are no further changes in y after two periods This shows that the sum of the coefficients on current and lagged z d0 1 d1 1 d2 is the longrun change in y given a permanent increase in z and is called the longrun propensity LRP or longrun multiplier The LRP is often of interest in distributed lag models As an example in equation 104 d0 measures the immediate change in fertility due to a one dollar increase in pe As we mentioned earlier there are reasons to believe that d0 is small if not zero But d1 or d2 or both might be positive If pe permanently increases by one dollar then after two years gfr will have changed by d0 1 d1 1 d2 This model assumes that there are no further changes after two years Whether this is actually the case is an empirical matter An FDL of order q is written as yt 5 a0 1 d0zt 1 d1zt21 1 p 1 dqzt2q 1 ut 106 This contains the static model as a special case by setting d1 d2 p dq equal to zero Sometimes a primary purpose for estimating a distributed lag model is to test whether z has a lagged effect on y The impact propensity is always the coefficient on the contemporaneous z d0 Occasionally we omit zt from 106 in which case the impact propensity is zero In the general case the lag distribution can be plotted by graphing the estimated dj as a function of j For any horizon h we can define the cumulative effect as d0 1 d1 1 p 1 dh which is interpreted as the change in the expected outcome h periods after a permanent oneunit increase in x Once the dj have been estimated one may plot the estimated cumulative effects as a function of h The LRP is the cumula tive effect after all changes have taken place it is simply the sum of all of the coefficients on the zt2j LRP 5 d0 1 d1 1 p 1 dq 107 Because of the often substantial correlation in z at different lagsthat is due to multicollinearity in 106it can be difficult to obtain precise esti mates of the individual dj Interestingly even when the dj cannot be precisely estimated we can often get good estimates of the LRP We will see an example later We can have more than one explanatory variable appearing with lags or we can add contemporaneous variables to an FDL model For example the average education level for women of childbearing age could be added to 104 which allows us to account for changing education levels for women 102c A Convention about the Time Index When models have lagged explanatory variables and as we will see in the next chapter for models with lagged y confusion can arise concerning the treatment of initial observations For example if in 105 we assume that the equation holds starting at t 5 1 then the explanatory variables for the first time period are z1 z0 and z21 Our convention will be that these are the initial values in our sample so that we can always start the time index at t 5 1 In practice this is not very important because regression packages automatically keep track of the observations available for estimating models with lags But for this and the next two chapters we need some convention concerning the first time period being represented by the regression equation In an equation for annual data suppose that intt 5 16 1 48 inft 2 15 inft21 1 32 inft22 1 utr where int is an interest rate and inf is the inflation rate What are the impact and long run propensities Exploring FurthEr 101 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 317 103 Finite Sample Properties of OLS under Classical Assumptions In this section we give a complete listing of the finite sample or small sample properties of OLS under standard assumptions We pay particular attention to how the assumptions must be altered from our crosssectional analysis to cover time series regressions 103a Unbiasedness of OLS The first assumption simply states that the time series process follows a model that is linear in its parameters Assumption TS1 Linear in Parameters The stochastic process 5 1xt1 xt2 p xtk yt2 t 5 1 2 p n6 follows the linear model yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ut 108 where 5ut t 5 1 2 p n6 is the sequence of errors or disturbances Here n is the number of observations time periods In the notation xtj t denotes the time period and j is as usual a label to indicate one of the k explanatory variables The terminology used in crosssectional regression applies here yt is the dependent variable explained variable or regressand the xtj are the independent variables explana tory variables or regressors We should think of Assumption TS1 as being essentially the same as Assumption MLR1 the first crosssectional assumption but we are now specifying a linear model for time series data The examples covered in Section 102 can be cast in the form of 108 by appropriately defining xtj For example equation 105 is obtained by setting xt1 5 zt xt2 5 zt21 and xt3 5 zt22 To state and discuss several of the remaining assumptions we let xt 5 1xt1 xt2 p xtk2 denote the set of all independent variables in the equation at time t Further X denotes the collection of all independent variables for all time periods It is useful to think of X as being an array with n rows and k columns This reflects how time series data are stored in econometric software packages the tth row of X is xt consisting of all independent variables for time period t Therefore the first row of X corresponds to t 5 1 the second row to t 5 2 and the last row to t 5 n An example is given in Table 102 using n 5 8 and the explanatory variables in equation 103 TAblE 102 Example of X for the Explanatory Variables in Equation 103 t convrte unem yngmle 1 46 074 12 2 42 071 12 3 42 063 11 4 47 062 09 5 48 060 10 6 50 059 11 7 55 058 12 8 56 059 13 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 318 Naturally as with crosssectional regression we need to rule out perfect collinearity among the regressors Assumption TS2 No Perfect Collinearity In the sample and therefore in the underlying time series process no independent variable is constant nor a perfect linear combination of the others We discussed this assumption at length in the context of crosssectional data in Chapter 3 The issues are essentially the same with time series data Remember Assumption TS2 does allow the explanatory variables to be correlated but it rules out perfect correlation in the sample The final assumption for unbiasedness of OLS is the time series analog of Assumption MLR4 and it also obviates the need for random sampling in Assumption MLR2 Assumption TS3 Zero Conditional Mean For each t the expected value of the error ut given the explanatory variables for all time periods is zero Mathematically E1ut0X2 5 0 t 5 1 2 p n 109 This is a crucial assumption and we need to have an intuitive grasp of its meaning As in the cross sectional case it is easiest to view this assumption in terms of uncorrelatedness Assumption TS3 implies that the error at time t ut is uncorrelated with each explanatory variable in every time period The fact that this is stated in terms of the conditional expectation means that we must also correctly specify the functional relationship between yt and the explanatory variables If ut is independent of X and E1ut2 5 0 then Assumption TS3 automatically holds Given the crosssectional analysis from Chapter 3 it is not surprising that we require ut to be uncorrelated with the explanatory variables also dated at time t in conditional mean terms E1ut0xt1 p xtk2 5 E1ut0xt2 5 0 1010 When 1010 holds we say that the xtj are contemporaneously exogenous Equation 1010 implies that ut and the explanatory variables are contemporaneously uncorrelated Corr1xtjut2 5 0 for all j Assumption TS3 requires more than contemporaneous exogeneity ut must be uncorrelated with xsj even when s 2 t This is a strong sense in which the explanatory variables must be exogenous and when TS3 holds we say that the explanatory variables are strictly exogenous In Chapter 11 we will demonstrate that 1010 is sufficient for proving consistency of the OLS estimator But to show that OLS is unbiased we need the strict exogeneity assumption In the crosssectional case we did not explicitly state how the error term for say person i ui is related to the explanatory variables for other people in the sample This was unnecessary because with random sampling Assumption MLR2 ui is automatically independent of the explanatory vari ables for observations other than i In a time series context random sampling is almost never appro priate so we must explicitly assume that the expected value of ut is not related to the explanatory variables in any time periods It is important to see that Assumption TS3 puts no restriction on correlation in the independent variables or in the ut across time Assumption TS3 only says that the average value of ut is unrelated to the independent variables in all time periods Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 319 Anything that causes the unobservables at time t to be correlated with any of the explanatory variables in any time period causes Assumption TS3 to fail Two leading candidates for failure are omitted variables and measurement error in some of the regressors But the strict exogeneity assump tion can also fail for other less obvious reasons In the simple static regression model yt 5 b0 1 b1zt 1 ut Assumption TS3 requires not only that ut and zt are uncorrelated but that ut is also uncorrelated with past and future values of z This has two implications First z can have no lagged effect on y If z does have a lagged effect on y then we should estimate a distributed lag model A more subtle point is that strict exogeneity excludes the possibility that changes in the error term today can cause future changes in z This effectively rules out feedback from y to future values of z For example consider a simple static model to explain a citys murder rate in terms of police officers per capita mrdrtet 5 b0 1 b1polpct 1 ut It may be reasonable to assume that ut is uncorrelated with polpct and even with past values of polpct for the sake of argument assume this is the case But suppose that the city adjusts the size of its police force based on past values of the murder rate This means that say polpct11 might be correlated with ut since a higher ut leads to a higher mrdrtet If this is the case Assumption TS3 is generally violated There are similar considerations in distributed lag models Usually we do not worry that ut might be correlated with past z because we are controlling for past z in the model But feedback from u to future z is always an issue Explanatory variables that are strictly exogenous cannot react to what has happened to y in the past A factor such as the amount of rainfall in an agricultural production function satisfies this requirement rainfall in any future year is not influenced by the output during the current or past years But something like the amount of labor input might not be strictly exogenous as it is chosen by the farmer and the farmer may adjust the amount of labor based on last years yield Policy variables such as growth in the money supply expenditures on welfare and highway speed limits are often influenced by what has happened to the outcome variable in the past In the social sciences many explanatory variables may very well violate the strict exogeneity assumption Even though Assumption TS3 can be unrealistic we begin with it in order to conclude that the OLS estimators are unbiased Most treatments of static and FDL models assume TS3 by making the stronger assumption that the explanatory variables are nonrandom or fixed in repeated samples The nonrandomness assumption is obviously false for time series observations Assumption TS3 has the advantage of being more realistic about the random nature of the xtj while it isolates the necessary assumption about how ut and the explanatory variables are related in order for OLS to be unbiased UNbiasedNess of oLs Under Assumptions TS1 TS2 and TS3 the OLS estimators are unbiased conditional on X and therefore unconditionally as well when the expectations exist E1b j2 5 bj j 5 0 1 p k thEorEm 101 The proof of this theorem is essentially the same as that for Theorem 31 in Chapter 3 and so we omit it When comparing Theorem 101 to Theorem 31 we have been able to drop the random sampling assumption by assuming that for each t ut has zero mean given the explanatory variables at all time peri ods If this assumption does not hold OLS cannot be shown to be unbiased In the FDL model yt 5 a0 1 d0zt 1 d1zt21 1 ut what do we need to assume about the sequence 5z0 z1 p zn6 in order for Assumption TS3 to hold Exploring FurthEr 102 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 320 The analysis of omitted variables bias which we covered in Section 33 is essentially the same in the time series case In particular Table 32 and the discussion surrounding it can be used as before to determine the directions of bias due to omitted variables 103b The Variances of the OLS Estimators and the GaussMarkov Theorem We need to add two assumptions to round out the GaussMarkov assumptions for time series regres sions The first one is familiar from crosssectional analysis Assumption TS4 Homoskedasticity Conditional on X the variance of ut is the same for all t Var1ut0X2 5 Var1ut2 5 s2 t 5 1 2 p n This assumption means that Var1ut0X2 cannot depend on Xit is sufficient that ut and X are independentand that Var1ut2 is constant over time When TS4 does not hold we say that the errors are heteroskedastic just as in the crosssectional case For example consider an equation for deter mining threemonth Tbill rates 1i3t2 based on the inflation rate 1inft2 and the federal deficit as a per centage of gross domestic product 1deft2 i3t 5 b0 1 b1inft 1 b2deft 1 ut 1011 Among other things Assumption TS4 requires that the unobservables affecting interest rates have a constant variance over time Since policy regime changes are known to affect the variability of inter est rates this assumption might very well be false Further it could be that the variability in interest rates depends on the level of inflation or relative size of the deficit This would also violate the homo skedasticity assumption When Var1ut0X2 does depend on X it often depends on the explanatory variables at time t xt In Chapter 12 we will see that the tests for heteroskedasticity from Chapter 8 can also be used for time series regressions at least under certain assumptions The final GaussMarkov assumption for time series analysis is new Assumption TS5 No serial Correlation Conditional on X the errors in two different time periods are uncorrelated Corr1utus0X2 5 0 for all t 2 s The easiest way to think of this assumption is to ignore the conditioning on X Then Assumption TS5 is simply Corr1utus2 5 0 for all t 2 s 1012 This is how the no serial correlation assumption is stated when X is treated as nonrandom When considering whether Assumption TS5 is likely to hold we focus on equation 1012 because of its simple interpretation When 1012 is false we say that the errors in 108 suffer from serial correlation or auto correlation because they are correlated across time Consider the case of errors from adjacent time periods Suppose that when ut21 0 then on average the error in the next time period ut is also positive Then Corr1utut212 0 and the errors suffer from serial correlation In equation 1011 this means that if interest rates are unexpectedly high for this period then they are likely to be above Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 321 average for the given levels of inflation and deficits for the next period This turns out to be a reasonable characterization for the error terms in many time series applications which we will see in Chapter 12 For now we assume TS5 Importantly Assumption TS5 assumes nothing about temporal correlation in the independent variables For example in equation 1011 inft is almost certainly correlated across time But this has nothing to do with whether TS5 holds A natural question that arises is in Chapters 3 and 4 why did we not assume that the errors for different crosssectional observations are uncorrelated The answer comes from the random sampling assumption under random sampling ui and uh are independent for any two observations i and h It can also be shown that under random sampling the errors for different observations are independ ent conditional on the explanatory variables in the sample Thus for our purposes we consider serial correlation only to be a potential problem for regressions with times series data In Chapters 13 and 14 the serial correlation issue will come up in connection with panel data analysis Assumptions TS1 through TS5 are the appropriate GaussMarkov assumptions for time series applications but they have other uses as well Sometimes TS1 through TS5 are satisfied in cross sectional applications even when random sampling is not a reasonable assumption such as when the crosssectional units are large relative to the population Suppose that we have a crosssectional data set at the city level It might be that correlation exists across cities within the same state in some of the explanatory variables such as property tax rates or per capita welfare payments Correlation of the explanatory variables across observations does not cause problems for verifying the GaussMarkov assumptions provided the error terms are uncorrelated across cities However in this chapter we are primarily interested in applying the GaussMarkov assumptions to time series regression problems oLs saMPLiNg VariaNCes Under the time series GaussMarkov Assumptions TS1 through TS5 the variance of b j conditional on X is Var1b j0X2 5 s23SSTj11 2 R2 j 2 4 j 5 1 p k 1013 where SSTj is the total sum of squares of xtj and R2 j is the Rsquared from the regression of xj on the other independent variables thEorEm 102 Equation 1013 is the same variance we derived in Chapter 3 under the crosssectional Gauss Markov assumptions Because the proof is very similar to the one for Theorem 32 we omit it The discussion from Chapter 3 about the factors causing large variances including multicollinearity among the explanatory variables applies immediately to the time series case The usual estimator of the error variance is also unbiased under Assumptions TS1 through TS5 and the GaussMarkov Theorem holds UNbiased estiMatioN of s2 Under Assumptions TS1 through TS5 the estimator s 2 5 SSRdf is an unbiased estimator of s2 where df 5 n 2 k 2 1 thEorEm 103 gaUssMarkoV tHeoreM Under Assumptions TS1 through TS5 the OLS estimators are the best linear unbiased estimators conditional on X thEorEm 104 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 322 The bottom line here is that OLS has the same desirable finite sample properties under TS1 through TS5 that it has under MLR1 through MLR5 103c Inference under the Classical Linear Model Assumptions In order to use the usual OLS standard errors t statistics and F statistics we need to add a final assumption that is analogous to the normality assumption we used for crosssectional analysis Assumption TS6 Normality The errors ut are independent of X and are independently and identically distributed as Normal10s22 Assumption TS6 implies TS3 TS4 and TS5 but it is stronger because of the independence and normality assumptions In the FDL model yt 5 a0 1 d0 zt 1 d1zt21 1 ut explain the nature of any multicollinearity in the explanatory variables Exploring FurthEr 103 NorMaL saMPLiNg distribUtioNs Under Assumptions TS1 through TS6 the CLM assumptions for time series the OLS estimators are normally distributed conditional on X Further under the null hypothesis each t statistic has a t distri bution and each F statistic has an F distribution The usual construction of confidence intervals is also valid thEorEm 105 The implications of Theorem 105 are of utmost importance It implies that when Assumptions TS1 through TS6 hold everything we have learned about estimation and inference for crosssectional regressions applies directly to time series regressions Thus t statistics can be used for testing statistical significance of individual explanatory variables and F statistics can be used to test for joint significance Just as in the crosssectional case the usual inference procedures are only as good as the underlying assumptions The classical linear model assumptions for time series data are much more restrictive than those for crosssectional datain particular the strict exogeneity and no serial correlation assumptions can be unrealistic Nevertheless the CLM framework is a good starting point for many applications exaMPLe 101 static Phillips Curve To determine whether there is a tradeoff on average between unemployment and inflation we can test H0 b1 5 0 against H1 b1 0 in equation 102 If the classical linear model assumptions hold we can use the usual OLS t statistic We use the file PHILLIPS to estimate equation 102 restricting ourselves to the data through 1996 In later exercises for example Computer Exercises C12 and C10 in Chapter 11 you are asked to use all years through 2003 In Chapter 18 we use the years 1997 through 2003 in various forecast ing exercises The simple regression estimates are inft 5 142 1 468 unemt 11722 12892 1014 n 5 49 R2 5 053 R2 5 033 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 323 This equation does not suggest a tradeoff between unem and inf b 1 0 The t statistic for b 1 is about 162 which gives a pvalue against a twosided alternative of about 11 Thus if anything there is a positive relationship between inflation and unemployment There are some problems with this analysis that we cannot address in detail now In Chapter 12 we will see that the CLM assumptions do not hold In addition the static Phillips curve is probably not the best model for determining whether there is a shortrun tradeoff between inflation and unem ployment Macroeconomists generally prefer the expectations augmented Phillips curve a simple example of which is given in Chapter 11 As a second example we estimate equation 1011 using annual data on the US economy exaMPLe 102 effects of inflation and deficits on interest rates The data in INTDEF come from the 2004 Economic Report of the President Tables B73 and B79 and span the years 1948 through 2003 The variable i3 is the threemonth Tbill rate inf is the annual inflation rate based on the consumer price index CPI and def is the federal budget deficit as a per centage of GDP The estimated equation is i3t 5 173 1 606 inft 1 513 deft 10432 10822 11182 1015 n 5 56 R2 5 602 R2 5 587 These estimates show that increases in inflation or the relative size of the deficit increase shortterm interest rates both of which are expected from basic economics For example a ceteris paribus one percentage point increase in the inflation rate increases i3 by 606 points Both inf and def are very statistically significant assuming of course that the CLM assumptions hold 104 Functional Form Dummy Variables and Index Numbers All of the functional forms we learned about in earlier chapters can be used in time series regressions The most important of these is the natural logarithm time series regressions with constant percentage effects appear often in applied work exaMPLe 103 Puerto rican employment and the Minimum Wage Annual data on the Puerto Rican employment rate minimum wage and other variables are used by CastilloFreeman and Freeman 1992 to study the effects of the US minimum wage on employment in Puerto Rico A simplified version of their model is log1prepopt2 5 b0 1 b1log1mincovt2 1 b2log1usgnpt2 1 ut 1016 where prepopt is the employment rate in Puerto Rico during year t ratio of those working to total population usgnpt is real US gross national product in billions of dollars and mincov meas ures the importance of the minimum wage relative to average wages In particular mincov avgminavgwageavgcov where avgmin is the average minimum wage avgwage is the average overall wage and avgcov is the average coverage rate the proportion of workers actually covered by the minimum wage law Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 324 Using the data in PRMINWGE for the years 1950 through 1987 gives log1prepopt2 5 2105 2 154 log1mincovt2 2 012 log1usgnpt2 10772 10652 10892 1017 n 5 38 R2 5 661 R2 5 641 The estimated elasticity of prepop with respect to mincov is 2154 and it is statistically significant with t 5 2237 Therefore a higher minimum wage lowers the employment rate something that classical economics predicts The GNP variable is not statistically significant but this changes when we account for a time trend in the next section We can use logarithmic functional forms in distributed lag models too For example for quar terly data suppose that money demand 1Mt2 and gross domestic product 1GDPt2 are related by log1Mt2 5 a0 1 d0log1GDPt2 1 d1log1GDPt212 1 d2log1GDPt222 1 d3log1GDPt232 1 d4log1GDPt242 1 ut The impact propensity in this equation d0 is also called the shortrun elasticity it meas ures the immediate percentage change in money demand given a 1 increase in GDP The LRP d0 1 d1 1 p 1 d4 is sometimes called the longrun elasticity it measures the percentage increase in money demand after four quarters given a permanent 1 increase in GDP Binary or dummy independent variables are also quite useful in time series applications Since the unit of observation is time a dummy variable represents whether in each time period a certain event has occurred For example for annual data we can indicate in each year whether a Democrat or a Republican is president of the United States by defining a variable democt which is unity if the president is a Democrat and zero otherwise Or in looking at the effects of capital punishment on murder rates in Texas we can define a dummy variable for each year equal to one if Texas had capital punishment during that year and zero otherwise Often dummy variables are used to isolate certain periods that may be systematically different from other periods covered by a data set exaMPLe 104 effects of Personal exemption on fertility rates The general fertility rate gfr is the number of children born to every 1000 women of childbearing age For the years 1913 through 1984 the equation gfrt 5 b0 1 b1 pet 1 b2ww2t 1 b3 pillt 1 ut explains gfr in terms of the average real dollar value of the personal tax exemption pe and two binary variables The variable ww2 takes on the value unity during the years 1941 through 1945 when the United States was involved in World War II The variable pill is unity from 1963 onward when the birth control pill was made available for contraception Using the data in FERTIL3 which were taken from the article by Whittington Alm and Peters 1990 gfrt 5 9868 1 083 pet 2 2424 ww2t 2 3159 pillt 13212 10302 17462 14082 1018 n 5 72 R2 5 473 R2 5 450 Each variable is statistically significant at the 1 level against a twosided alternative We see that the fertil ity rate was lower during World War II given pe there were about 24 fewer births for every 1000 women of childbearing age which is a large reduction From 1913 through 1984 gfr ranged from about 65 to 127 Similarly the fertility rate has been substantially lower since the introduction of the birth control pill Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 325 The variable of economic interest is pe The average pe over this time period is 10040 ranging from zero to 24383 The coefficient on pe implies that a 1200 increase in pe increases gfr by about one birth per 1000 women of childbearing age This effect is hardly trivial In Section 102 we noted that the fertility rate may react to changes in pe with a lag Estimating a distributed lag model with two lags gives gfrt 5 9587 1 073 pet 2 0058 pet21 1 034 pet22 2 2212 ww2t 2 3130 pillt 13282 11262 115572 11262 110732 13982 1019 n 5 70 R2 5 499 R2 5 459 In this regression we only have 70 observations because we lose two when we lag pe twice The coef ficients on the pe variables are estimated very imprecisely and each one is individually insignificant It turns out that there is substantial correlation between pet pet21 and pet22 and this multicollinearity makes it difficult to estimate the effect at each lag However pet pet21 and pet22 are jointly signifi cant the F statistic has a pvalue 5 012 Thus pe does have an effect on gfr as we already saw in 1018 but we do not have good enough estimates to determine whether it is contemporaneous or with a one or twoyear lag or some of each Actually pet21 and pet22 are jointly insignificant in this equation 1pvalue 5 952 so at this point we would be justified in using the static model But for illustrative purposes let us obtain a confidence interval for the LRP in this model The estimated LRP in 1019 is 073 2 0058 1 034 101 However we do not have enough information in 1019 to obtain the standard error of this estimate To obtain the standard error of the estimated LRP we use the trick suggested in Section 44 Let u0 5 d0 1 d1 1 d2 denote the LRP and write d0 in terms of u0 d1 and d2 as d0 5 u0 2 d1 2 d2 Next substitute for d0 in the model gfrt 5 a0 1 d0 pet 1 d1pet21 1 d2 pet22 1 p to get gfrt 5 a0 1 1u0 2 d1 2 d22pet 1 d1pet21 1 d2 pet22 1 p 5 a0 1 u0 pet 1 d11pet21 2 pet2 1 d21pet22 2 pet2 1 p From this last equation we can obtain u 0 and its standard error by regressing gfrt on pet 1pet21 2 pet2 1pet22 2 pet2 ww2t and pillt The coefficient and associated standard error on pet are what we need Running this regression gives u 0 5 101 as the coefficient on pet as we already knew and se1u 02 5 030 which we could not compute from 1019 Therefore the t statistic for u 0 is about 337 so u 0 is statistically different from zero at small significance levels Even though none of the d j is individually significant the LRP is very significant The 95 confidence interval for the LRP is about 041 to 160 Whittington Alm and Peters 1990 allow for further lags but restrict the coefficients to help alleviate the multicollinearity problem that hinders estimation of the individual dj See Problem 6 for an example of how to do this For estimating the LRP which would seem to be of primary interest here such restrictions are unnecessary Whittington Alm and Peters also control for additional vari ables such as average female wage and the unemployment rate Binary explanatory variables are the key component in what is called an event study In an event study the goal is to see whether a particular event influences some outcome Economists who study industrial organization have looked at the effects of certain events on firm stock prices For example Rose 1985 studied the effects of new trucking regulations on the stock prices of trucking companies A simple version of an equation used for such event studies is Rf t 5 b0 1 b1Rm t 1 b2dt 1 ut Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 326 where Rf t is the stock return for firm f during period t usually a week or a month Rm t is the market return usually computed for a broad stock market index and dt is a dummy variable indicating when the event occurred For example if the firm is an airline dt might denote whether the airline experi enced a publicized accident or near accident during week t Including Rm t in the equation controls for the possibility that broad market movements might coincide with airline accidents Sometimes mul tiple dummy variables are used For example if the event is the imposition of a new regulation that might affect a certain firm we might include a dummy variable that is one for a few weeks before the regulation was publicly announced and a second dummy variable for a few weeks after the regulation was announced The first dummy variable might detect the presence of inside information Before we give an example of an event study we need to discuss the notion of an index num ber and the difference between nominal and real economic variables An index number typically aggregates a vast amount of information into a single quantity Index numbers are used regularly in time series analysis especially in macroeconomic applications An example of an index num ber is the index of industrial production IIP computed monthly by the Board of Governors of the Federal Reserve The IIP is a measure of production across a broad range of industries and as such its magnitude in a particular year has no quantitative meaning In order to interpret the magnitude of the IIP we must know the base period and the base value In the 1997 Economic Report of the President ERP the base year is 1987 and the base value is 100 Setting IIP to 100 in the base period is just a convention it makes just as much sense to set IIP 5 1 in 1987 and some indexes are defined with 1 as the base value Because the IIP was 1077 in 1992 we can say that industrial production was 77 higher in 1992 than in 1987 We can use the IIP in any two years to compute the percentage difference in industrial output during those two years For example because IIP 5 614 in 1970 and IIP 5 857 in 1979 industrial production grew by about 396 during the 1970s It is easy to change the base period for any index number and sometimes we must do this to give index numbers reported with different base years a common base year For example if we want to change the base year of the IIP from 1987 to 1982 we simply divide the IIP for each year by the 1982 value and then multiply by 100 to make the base period value 100 Generally the formula is newindext 5 1001oldindext oldindexnewbase2 1020 where oldindexnewbase is the original value of the index in the new base year For example with base year 1987 the IIP in 1992 is 1077 if we change the base year to 1982 the IIP in 1992 becomes 100110778192 5 1315 because the IIP in 1982 was 819 Another important example of an index number is a price index such as the CPI We already used the CPI to compute annual inflation rates in Example 101 As with the industrial production index the CPI is only meaningful when we compare it across different years or months if we are using monthly data In the 1997 ERP CPI 5 388 in 1970 and CPI 5 1307 in 1990 Thus the gen eral price level grew by almost 237 over this 20year period In 1997 the CPI is defined so that its average in 1982 1983 and 1984 equals 100 thus the base period is listed as 19821984 In addition to being used to compute inflation rates price indexes are necessary for turning a time series measured in nominal dollars or current dollars into real dollars or constant dollars Most economic behavior is assumed to be influenced by real not nominal variables For example classical labor economics assumes that labor supply is based on the real hourly wage not the nominal wage Obtaining the real wage from the nominal wage is easy if we have a price index such as the CPI We must be a little careful to first divide the CPI by 100 so that the value in the base year is 1 Then if w denotes the average hourly wage in nominal dollars and p 5 CPI100 the real wage is simply wp This wage is measured in dollars for the base period of the CPI For example in Table B45 in the 1997 ERP average hourly earnings are reported in nominal terms and in 1982 dollars which means that the CPI used in computing the real wage had the base year 1982 This table reports that the nominal hourly wage in 1960 was 209 but measured in 1982 dollars the wage was 679 The real hourly wage had peaked in 1973 at 855 in 1982 dollars and had fallen to 740 by 1995 Thus Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 327 there was a nontrivial decline in real wages over those 22 years If we compare nominal wages from 1973 and 1995 we get a very misleading picture 394 in 1973 and 1144 in 1995 Because the real wage fell the increase in the nominal wage was due entirely to inflation Standard measures of economic output are in real terms The most important of these is gross domestic product or GDP When growth in GDP is reported in the popular press it is always real GDP growth In the 2012 ERP Table B2 GDP is reported in billions of 2005 dollars We used a similar measure of output real gross national product in Example 103 Interesting things happen when real dollar variables are used in combination with natural loga rithms Suppose for example that average weekly hours worked are related to the real wage as log1hours2 5 b0 1 b1log1wp2 1 u Using the fact that log1wp2 5 log1w2 2 log1p2 we can write this as log1hours2 5 b0 1 b1log1w2 1 b2log1p2 1 u 1021 but with the restriction that b2 5 2b1 Therefore the assumption that only the real wage influences labor supply imposes a restriction on the parameters of model 1021 If b2 2 2b1 then the price level has an effect on labor supply something that can happen if workers do not fully understand the distinction between real and nominal wages There are many practical aspects to the actual computation of index numbers but it would take us too far afield to cover those here Detailed discussions of price indexes can be found in most interme diate macroeconomic texts such as Mankiw 1994 Chapter 2 For us it is important to be able to use index numbers in regression analysis As mentioned earlier since the magnitudes of index numbers are not especially informative they often appear in logarithmic form so that regression coefficients have percentage change interpretations We now give an example of an event study that also uses index numbers exaMPLe 105 antidumping filings and Chemical imports Krupp and Pollard 1996 analyzed the effects of antidumping filings by US chemical industries on imports of various chemicals We focus here on one industrial chemical barium chloride a cleaning agent used in various chemical processes and in gasoline production The data are contained in the file BARIUM In the early 1980s US barium chloride producers believed that China was offering its US imports an unfairly low price an action known as dumping and the barium chloride indus try filed a complaint with the US International Trade Commission ITC in October 1983 The ITC ruled in favor of the US barium chloride industry in October 1984 There are several questions of interest in this case but we will touch on only a few of them First were imports unusually high in the period immediately preceding the initial filing Second did imports change noticeably after an antidumping filing Finally what was the reduction in imports after a decision in favor of the US industry To answer these questions we follow Krupp and Pollard by defining three dummy variables befile6 is equal to 1 during the six months before filing affile6 indicates the six months after fil ing and afdec6 denotes the six months after the positive decision The dependent variable is the volume of imports of barium chloride from China chnimp which we use in logarithmic form We include as explanatory variables all in logarithmic form an index of chemical production chempi to control for overall demand for barium chloride the volume of gasoline production gas another demand variable and an exchange rate index rtwex which measures the strength of the dollar against several other currencies The chemical production index was defined to be 100 in June 1977 The analysis here differs somewhat from Krupp and Pollard in that we use natural logarithms of all variables except the dummy variables of course and we include all three dummy variables in the same regression Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 328 Using monthly data from February 1978 through December 1988 gives the following log1chnimp2 5 21780 1 312 log1chempi2 1 196 log1gas2 121052 1482 19072 1 983 log1rtwex2 1 060 befile6 2 032 affile6 2 565 afdec6 1022 14002 12612 12642 12862 n 5 131 R2 5 305 R2 5 271 The equation shows that befile6 is statistically insignificant so there is no evidence that Chinese imports were unusually high during the six months before the suit was filed Further although the esti mate on affile6 is negative the coefficient is small indicating about a 32 fall in Chinese imports and it is statistically very insignificant The coefficient on afdec6 shows a substantial fall in Chinese imports of barium chloride after the decision in favor of the US industry which is not surprising Since the effect is so large we compute the exact percentage change 1003exp125652 2 14 2432 The coefficient is statistically significant at the 5 level against a twosided alternative The coefficient signs on the control variables are what we expect an increase in overall chemical production increases the demand for the cleaning agent Gasoline production does not affect Chinese imports significantly The coefficient on logrtwex shows that an increase in the value of the dollar relative to other currencies increases the demand for Chinese imports as is predicted by economic theory In fact the elasticity is not statistically different from 1 Why Interactions among qualitative and quantitative variables are also used in time series analysis An example with practical importance follows exaMPLe 106 election outcomes and economic Performance Fair 1996 summarizes his work on explaining presidential election outcomes in terms of economic performance He explains the proportion of the twoparty vote going to the Democratic candidate using data for the years 1916 through 1992 every four years for a total of 20 observations We esti mate a simplified version of Fairs model using variable names that are more descriptive than his demvote 5 b0 1 b1partyWH 1 b2incum 1 b3partyWH gnews 1 b4partyWH inf 1 u where demvote is the proportion of the twoparty vote going to the Democratic candidate The explan atory variable partyWH is similar to a dummy variable but it takes on the value 1 if a Democrat is in the White House and 1 if a Republican is in the White House Fair uses this variable to impose the restriction that the effects of a Republican or a Democrat being in the White House have the same magnitude but the opposite sign This is a natural restriction because the party shares must sum to one by definition It also saves two degrees of freedom which is important with so few observa tions Similarly the variable incum is defined to be 1 if a Democratic incumbent is running 1 if a Republican incumbent is running and zero otherwise The variable gnews is the number of quarters during the administrations first 15 quarters when the quarterly growth in real per capita output was above 29 at an annual rate and inf is the average annual inflation rate over the first 15 quarters of the administration See Fair 1996 for precise definitions Economists are most interested in the interaction terms partyWHgnews and partyWHinf Since partyWH equals 1 when a Democrat is in the White House b3 measures the effect of good economic news on the party in power we expect b3 0 Similarly b4 measures the effect that inflation has on the party in power Because inflation during an administration is considered to be bad news we expect b4 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 329 The estimated equation using the data in FAIR is demvote 5 481 2 0435 partyWH 1 0544 incum 10122 104052 102342 1 0108 partyWH gnews 2 0077 partyWH inf 1023 100412 100332 n 5 20 R2 5 663 R2 5 573 All coefficients except that on partyWH are statistically significant at the 5 level Incumbency is worth about 54 percentage points in the share of the vote Remember demvote is measured as a proportion Further the economic news variable has a positive effect one more quarter of good news is worth about 11 percentage points Inflation as expected has a negative effect if average annual inflation is say two percentage points higher the party in power loses about 15 percentage points of the twoparty vote We could have used this equation to predict the outcome of the 1996 presidential election between Bill Clinton the Democrat and Bob Dole the Republican The independent candidate Ross Perot is excluded because Fairs equation is for the twoparty vote only Because Clinton ran as an incum bent partyWH 1 and incum 1 To predict the election outcome we need the variables gnews and inf During Clintons first 15 quarters in office the annual growth rate of per capita real GDP exceeded 29 three times so gnews 3 Further using the GDP price deflator reported in Table B4 in the 1997 ERP the average annual inflation rate computed using Fairs formula from the fourth quarter in 1991 to the third quarter in 1996 was 3019 Plugging these into 1023 gives demvote 5 481 2 0435 1 0544 1 0108132 2 0077130192 5011 Therefore based on information known before the election in November Clinton was predicted to receive a very slight majority of the twoparty vote about 501 In fact Clinton won more handily his share of the twoparty vote was 5465 105 Trends and Seasonality 105a Characterizing Trending Time Series Many economic time series have a common tendency of growing over time We must recognize that some series contain a time trend in order to draw causal inference using time series data Ignoring the fact that two sequences are trending in the same or opposite directions can lead us to falsely conclude that changes in one variable are actually caused by changes in another variable In many cases two time series processes appear to be correlated only because they are both trending over time for rea sons related to other unobserved factors Figure 102 contains a plot of labor productivity output per hour of work in the United States for the years 1947 through 1987 This series displays a clear upward trend which reflects the fact that workers have become more productive over time Other series at least over certain time periods have clear downward trends Because positive trends are more common we will focus on those during our discussion What kind of statistical models adequately capture trending behavior One popular formulation is to write the series 5yt6 as yt 5 a0 1 a1t 1 et t 5 1 2 p 1024 where in the simplest case 5et6 is an independent identically distributed iid sequence with E1et2 5 0 and Var1et2 5 s2 e Note how the parameter a1 multiplies time t resulting in a linear time trend Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 330 Interpreting a1 in 1024 is simple holding all other factors those in e1 fixed a1 measures the change in yt from one period to the next due to the passage of time We can write this mathematically by defining the change in et from period t21 to t as Det 5 et 2 et21 Equation 1024 implies that if Det 5 0 then Dyt 5 yt 2 yt21 5 a1 Another way to think about a sequence that has a linear time trend is that its average value is a linear function of time E1yt2 5 a0 1 a1t 1025 If a1 0 then on average yt is growing over time and therefore has an upward trend If a1 0 then yt has a downward trend The values of yt do not fall exactly on the line in 1025 due to randomness but the expected values are on the line Unlike the mean the variance of yt is constant across time Var1yt2 5 Var1et2 5 s2 e If 5et6 is an iid sequence then 5yt6 is an inde pendent though not identically distributed sequence A more realistic characterization of trending time series allows 5et6 to be correlated over time but this does not change the flavor of a linear time trend In fact what is important for regression analysis under the classical linear model assumptions is that E5yt6 is linear in t When we cover large sample properties of OLS in Chapter 11 we will have to discuss how much temporal correlation in 5et6 is allowed Many economic time series are better approximated by an exponential trend which follows when a series has the same average growth rate from period to period Figure 103 plots data on annual nominal imports for the United States during the years 1948 through 1995 ERP 1997 Table B101 In the early years we see that the change in imports over each year is relatively small whereas the change increases as time passes This is consistent with a constant average growth rate the percentage change is roughly the same in each period output per hour 1967 1987 year 1947 50 80 110 FiguRE 102 Output per labor hour in the United States during the years 19471987 1977 100 In Example 104 we used the general fertil ity rate as the dependent variable in an FDL model From 1950 through the mid1980s the gfr has a clear downward trend Can a linear trend with a1 0 be realistic for all future time periods Explain Exploring FurthEr 104 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 331 In practice an exponential trend in a time series is captured by modeling the natural logarithm of the series as a linear trend assuming that yt 0 log1yt2 5 b0 1 b1t 1 et t 5 1 2 p 1026 Exponentiating shows that yt itself has an exponential trend yt 5 exp1b0 1 b1t 1 et2 Because we will want to use exponentially trending time series in linear regression models 1026 turns out to be the most convenient way for representing such series How do we interpret b1 in 1026 Remember that for small changes Dlog1yt2 5 log1yt2 2 log1yt212 is approximately the proportionate change in yt Dlog1yt2 1yt 2 yt212yt21 1027 The righthand side of 1027 is also called the growth rate in y from period t21 to period t To turn the growth rate into a percentage we simply multiply by 100 If yt follows 1026 then taking changes and setting Det 5 0 Dlog1yt2 5 b1 for all t 1028 In other words b1 is approximately the average per period growth rate in yt For example if t denotes year and b1 5 027 then yt grows about 27 per year on average Although linear and exponential trends are the most common time trends can be more compli cated For example instead of the linear trend model in 1024 we might have a quadratic time trend yt 5 a0 1 a1t 1 a2t2 1 et 1029 If a1 and a2 are positive then the slope of the trend is increasing as is easily seen by computing the approximate slope holding et fixed Dyt Dt a1 1 2a2t 1030 US imports 1972 1995 year 1948 100 400 750 7 FiguRE 103 Nominal US imports during the years 19481995 in billions of US dollars Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 332 If you are familiar with calculus you recognize the righthand side of 1030 as the derivative of a0 1 a1t 1 a2t2 with respect to t If a1 0 but a2 0 the trend has a hump shape This may not be a very good description of certain trending series because it requires an increasing trend to be followed eventually by a decreasing trend Nevertheless over a given time span it can be a flexible way of modeling time series that have more complicated trends than either 1024 or 1026 105b Using Trending Variables in Regression Analysis Accounting for explained or explanatory variables that are trending is fairly straightforward in regres sion analysis First nothing about trending variables necessarily violates the classical linear model Assumptions TS1 through TS6 However we must be careful to allow for the fact that unobserved trending factors that affect yt might also be correlated with the explanatory variables If we ignore this possibility we may find a spurious relationship between yt and one or more explanatory variables The phenomenon of finding a relationship between two or more trending variables simply because each is growing over time is an example of a spurious regression problem Fortunately adding a time trend eliminates this problem For concreteness consider a model where two observed factors xt1 and xt2 affect yt In addition there are unobserved factors that are systematically growing or shrinking over time A model that captures this is yt 5 b0 1 b1xt1 1 b2xt2 1 b3t 1 ut 1031 This fits into the multiple linear regression framework with xt3 5 t Allowing for the trend in this equation explicitly recognizes that yt may be growing 1b3 02 or shrinking 1b3 02 over time for reasons essentially unrelated to xt1 and xt2 If 1031 satisfies assumptions TS1 TS2 and TS3 then omitting t from the regression and regressing yt on xt1 xt2 will generally yield biased estimators of b1 and b2 we have effectively omitted an important variable t from the regression This is especially true if xt1 and xt2 are themselves trending because they can then be highly correlated with t The next example shows how omitting a time trend can result in spurious regression exaMPLe 107 Housing investment and Prices The data in HSEINV are annual observations on housing investment and a housing price index in the United States for 1947 through 1988 Let invpc denote real per capita housing investment in thou sands of dollars and let price denote a housing price index equal to 1 in 1982 A simple regression in constant elasticity form which can be thought of as a supply equation for housing stock gives log1invpc2 5 2550 1 1241 log1price2 10432 13822 1032 n 5 42 R2 5 208 R2 5 189 The elasticity of per capita investment with respect to price is very large and statistically significant it is not statistically different from one We must be careful here Both invpc and price have upward trends In particular if we regress loginvpc on t we obtain a coefficient on the trend equal to 0081 1standard error 5 00182 the regression of logprice on t yields a trend coefficient equal to 0044 1standard error 5 00042 Although the standard errors on the trend coefficients are not neces sarily reliablethese regressions tend to contain substantial serial correlationthe coefficient esti mates do reveal upward trends Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 333 To account for the trending behavior of the variables we add a time trend log1invpc2 5 2913 2 381 log1price2 1 0098 t 11362 16792 100352 1033 n 5 42 R2 5 341 R2 5 307 The story is much different now the estimated price elasticity is negative and not statistically dif ferent from zero The time trend is statistically significant and its coefficient implies an approxi mate 1 increase in invpc per year on average From this analysis we cannot conclude that real per capita housing investment is influenced at all by price There are other factors captured in the time trend that affect invpc but we have not modeled these The results in 1032 show a spurious relationship between invpc and price due to the fact that price is also trending upward over time In some cases adding a time trend can make a key explanatory variable more significant This can happen if the dependent and independent variables have different kinds of trends say one upward and one downward but movement in the independent variable about its trend line causes movement in the dependent variable away from its trend line exaMPLe 108 fertility equation If we add a linear time trend to the fertility equation 1018 we obtain gfrt 5 11177 1 279 pet 2 3559 ww2t 1 997 pillt 2 115 t 13362 10402 16302 166262 1192 1034 n 5 72 R2 5 662 R2 5 642 The coefficient on pe is more than triple the estimate from 1018 and it is much more statistically significant Interestingly pill is not significant once an allowance is made for a linear trend As can be seen by the estimate gfr was falling on average over this period other factors being equal Since the general fertility rate exhibited both upward and downward trends during the period from 1913 through 1984 we can see how robust the estimated effect of pe is when we use a quadratic trend gfrt 5 12409 1 348 pet 2 3588 ww2t 2 1012 pillt 2 253 t 1 0196 t2 14362 10402 15712 16342 1392 100502 1035 n 5 72 R2 5 727 R2 5 706 The coefficient on pe is even larger and more statistically significant Now pill has the expected nega tive effect and is marginally significant and both trend terms are statistically significant The quad ratic trend is a flexible way to account for the unusual trending behavior of gfr You might be wondering in Example 108 why stop at a quadratic trend Nothing prevents us from adding say t3 as an independent variable and in fact this might be warranted see Computer Exercise C6 But we have to be careful not to get carried away when including trend terms in a model We want relatively simple trends that capture broad movements in the dependent variable that are not explained by the independent variables in the model If we include enough polynomial terms in t then we can track any series pretty well But this offers little help in finding which explanatory variables affect yt Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 334 105c A Detrending Interpretation of Regressions with a Time Trend Including a time trend in a regression model creates a nice interpretation in terms of detrending the original data series before using them in regression analysis For concreteness we focus on model 1031 but our conclusions are much more general When we regress yt on xt1 xt2 and t we obtain the fitted equation yt 5 b 0 1 b 1xt1 1 b 2xt2 1 b 3t 1036 We can extend the FrischWaugh result on the partialling out interpretation of OLS that we covered in Section 32 to show that b 1 and b 2 can be obtained as follows i Regress each of yt xt1 and xt2 on a constant and the time trend t and save the residuals say y t x t1 x t2 t 5 1 2 p n For example y t 5 yt 2 a 0 2 a 1t Thus we can think of y t as being linearly detrended In detrending yt we have estimated the model yt 5 a0 1 a1t 1 et by OLS the residuals from this regression et 5 y t have the time trend removed at least in the sam ple A similar interpretation holds for x t1 and x t2 ii Run the regression of y t on x t1 x t2 1037 No intercept is necessary but including an intercept affects nothing the intercept will be estimated to be zero This regression exactly yields b 1 and b 2 from 1036 This means that the estimates of primary interest b 1 and b 2 can be interpreted as coming from a regression without a time trend but where we first detrend the dependent variable and all other inde pendent variables The same conclusion holds with any number of independent variables and if the trend is quadratic or of some other polynomial degree If t is omitted from 1036 then no detrending occurs and yt might seem to be related to one or more of the xtj simply because each contains a trend we saw this in Example 107 If the trend term is statistically significant and the results change in important ways when a time trend is added to a regression then the initial results without a trend should be treated with suspicion The interpretation of b 1 and b 2 shows that it is a good idea to include a trend in the regression if any independent variable is trending even if yt is not If yt has no noticeable trend but say xt1 is growing over time then excluding a trend from the regression may make it look as if xt1 has no effect on yt even though movements of xt1 about its trend may affect yt This will be captured if t is included in the regression exaMPLe 109 Puerto rican employment When we add a linear trend to equation 1017 the estimates are log1prepopt2 5 2870 2 169 log1mincovt2 1 106 log1usgnpt2 11302 10442 10182 1038 2 032 t 10052 n 5 38 R2 5 847 R2 5 834 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 335 The coefficient on logusgnp has changed dramatically from 2012 and insignificant to 106 and very significant The coefficient on the minimum wage has changed only slightly although the stand ard error is notably smaller making logmincov more significant than before The variable prepopt displays no clear upward or downward trend but logusgnp has an upward linear trend A regression of logusgnp on t gives an estimate of about 03 so that usgnp is grow ing by about 3 per year over the period We can think of the estimate 106 as follows when usgnp increases by 1 above its longrun trend prepop increases by about 106 105d Computing RSquared When the Dependent Variable Is Trending Rsquareds in time series regressions are often very high especially compared with typical Rsquareds for crosssectional data Does this mean that we learn more about factors affecting y from time series data Not necessarily On one hand time series data often come in aggregate form such as average hourly wages in the US economy and aggregates are often easier to explain than outcomes on indi viduals families or firms which is often the nature of crosssectional data But the usual and adjusted Rsquareds for time series regressions can be artificially high when the dependent variable is trend ing Remember that R2 is a measure of how large the error variance is relative to the variance of y The formula for the adjusted Rsquared shows this directly R2 5 1 2 1s 2 us 2 y2 where s 2 u is the unbiased estimator of the error variance s 2 y 5 SST1n 2 12 and SST 5 a n t511yt 2 y2 2 Now estimating the error variance when yt is trending is no problem provided a time trend is included in the regression However when E1yt2 follows say a linear time trend see 1024 SST1n 2 12 is no longer an unbiased or consistent estimator of Var1yt2 In fact SST1n 2 12 can substantially overestimate the variance in yt because it does not account for the trend in yt When the dependent variable satisfies linear quadratic or any other polynomial trends it is easy to compute a goodnessoffit measure that first nets out the effect of any time trend on yt The simplest method is to compute the usual Rsquared in a regression where the dependent variable has already been detrended For example if the model is 1031 then we first regress yt on t and obtain the residuals y t Then we regress y t on xt1 xt2 and t 1039 The Rsquared from this regression is 1 2 SSR a n t51y2 t 1040 where SSR is identical to the sum of squared residuals from 1036 Since a n t51y2 t a n t511yt 2 y2 2 and usually the inequality is strict the Rsquared from 1040 is no greater than and usually less than the Rsquared from 1036 The sum of squared residuals is identical in both regressions When yt contains a strong linear time trend 1040 can be much less than the usual Rsquared The Rsquared in 1040 better reflects how well xt1 and xt2 explain yt because it nets out the effect of the time trend After all we can always explain a trending variable with some sort of trend but this does not mean we have uncovered any factors that cause movements in yt An adjusted Rsquared can also be computed based on 1040 divide SSR by 1n 2 42 because this is the df in 1036 and divide a n t51y2 t by 1n 2 22 as there are two trend parameters estimated in detrending yt Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 336 PART 2 Regression Analysis with Time Series Data In general SSR is divided by the df in the usual regression that includes any time trends and a n t51 y2 t is divided by 1n 2 p2 where p is the number of trend parameters estimated in detrending yt Wooldridge 1991a provides detailed suggestions for degreesoffreedom corrections but a compu tationally simple approach is fine as an approximation use the adjusted Rsquared from the regres sion y t on t t2 p tp xt1 p xtk This requires us only to remove the trend from yt to obtain y t and then we can use y t to compute the usual kinds of goodnessoffit measures exaMPLe 1010 Housing investment In Example 107 we saw that including a linear time trend along with logprice in the housing investment equation had a substantial effect on the price elasticity But the Rsquared from regres sion 1033 taken literally says that we are explaining 341 of the variation in loginvpc This is misleading If we first detrend loginvpc and regress the detrended variable on log price and t the Rsquared becomes 008 and the adjusted Rsquared is actually negative Thus movements in logprice about its trend have virtually no explanatory power for movements in loginvpc about its trend This is consistent with the fact that the t statistic on logprice in equation 1033 is very small Before leaving this subsection we must make a final point In computing the Rsquared form of an F statistic for testing multiple hypotheses we just use the usual Rsquareds without any detrend ing Remember the Rsquared form of the F statistic is just a computational device and so the usual formula is always appropriate 105e Seasonality If a time series is observed at monthly or quarterly intervals or even weekly or daily it may exhibit seasonality For example monthly housing starts in the Midwest are strongly influenced by weather Although weather patterns are somewhat random we can be sure that the weather during January will usually be more inclement than in June and so housing starts are generally higher in June than in January One way to model this phenomenon is to allow the expected value of the series yt to be different in each month As another example retail sales in the fourth quarter are typically higher than in the previous three quarters because of the Christmas holiday Again this can be captured by allowing the average retail sales to differ over the course of a year This is in addition to possibly allowing for a trending mean For example retail sales in the most recent first quarter were higher than retail sales in the fourth quarter from 30 years ago because retail sales have been steadily growing Nevertheless if we compare average sales within a typical year the seasonal holiday factor tends to make sales larger in the fourth quarter Even though many monthly and quarterly data series display seasonal patterns not all of them do For example there is no noticeable seasonal pattern in monthly interest or inflation rates In addi tion series that do display seasonal patterns are often seasonally adjusted before they are reported for public use A seasonally adjusted series is one that in principle has had the seasonal factors removed from it Seasonal adjustment can be done in a variety of ways and a careful discussion is beyond the scope of this text See Harvey 1990 and Hylleberg 1992 for detailed treatments Seasonal adjustment has become so common that it is not possible to get seasonally unadjusted data in many cases Quarterly US GDP is a leading example In the annual Economic Report of the President many macroeconomic data sets reported at monthly frequencies at least for the most recent years and those that display seasonal patterns are all seasonally adjusted The major sources for macroeconomic time series including Citibase also seasonally adjust many of the series Thus the scope for using our own seasonal adjustment is often limited Sometimes we do work with seasonally unadjusted data and it is useful to know that simple methods are available for dealing with seasonality in regression models Generally we can include a Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 337 set of seasonal dummy variables to account for seasonality in the dependent variable the independ ent variables or both The approach is simple Suppose that we have monthly data and we think that seasonal patterns within a year are roughly constant across time For example since Christmas always comes at the same time of year we can expect retail sales to be on average higher in months late in the year than in earlier months Or since weather patterns are broadly similar across years housing starts in the Midwest will be higher on average during the summer months than the winter months A general model for monthly data that captures these phenomena is yt 5 b0 1 d1febt 1 d2mart 1 d3aprt 1 p 1 d11dect 1041 1 b1xt1 1 p 1 bkxtk 1 ut where febt mart p dect are dummy variables indi cating whether time period t corresponds to the appropriate month In this formulation January is the base month and b0 is the intercept for January If there is no seasonality in yt once the xtj have been controlled for then d1 through d11 are all zero This is easily tested via an F test exaMPLe 1011 effects of antidumping filings In Example 105 we used monthly data in the file BARIUM that have not been seasonally adjusted Therefore we should add seasonal dummy variables to make sure none of the important conclusions change It could be that the months just before the suit was filed are months where imports are higher or lower on average than in other months When we add the 11 monthly dummy variables as in 1041 and test their joint significance we obtain pvalue 5 59 and so the seasonal dummies are jointly insignificant In addition nothing important changes in the estimates once statistical signifi cance is taken into account Krupp and Pollard 1996 actually used three dummy variables for the seasons fall spring and summer with winter as the base season rather than a full set of monthly dummies the outcome is essentially the same If the data are quarterly then we would include dummy variables for three of the four quarters with the omitted category being the base quarter Sometimes it is useful to interact seasonal dummies with some of the xtj to allow the effect of xtj on yt to differ across the year Just as including a time trend in a regression has the interpretation of initially detrending the data including seasonal dummies in a regression can be interpreted as deseasonalizing the data For concreteness consider equation 1041 with k 5 2 The OLS slope coefficients b 1 and b 2 on x1 and x2 can be obtained as follows i Regress each of yt xt1 and xt2 on a constant and the monthly dummies febt mart p dect and save the residuals say y t x t1 and x t2 for all t 5 1 2 p n For example y t 5 yt 2 a 0 2 a 1 febt 2 a 2mart 2 p 2 a 11dect This is one method of deseasonalizing a monthly time series A similar interpretation holds for x t1 and x t2 ii Run the regression without the monthly dummies of y t on x t1 and x t2 just as in 1037 This gives b 1 and b 2 In some cases if yt has pronounced seasonality a better goodnessoffit measure is an Rsquared based on the deseasonalized yt This nets out any seasonal effects that are not explained by the xtj In equation 1041 what is the intercept for March Explain why seasonal dummy variables satisfy the strict exogeneity assumption Exploring FurthEr 105 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 338 Wooldridge 1991a suggests specific degreesoffreedom adjustments or one may simply use the adjusted Rsquared where the dependent variable has been deseasonalized Time series exhibiting seasonal patterns can be trending as well in which case we should esti mate a regression model with a time trend and seasonal dummy variables The regressions can then be interpreted as regressions using both detrended and deseasonalized series Goodnessoffit statistics are discussed in Wooldridge 1991a essentially we detrend and deseasonalize yt by regressing on both a time trend and seasonal dummies before computing Rsquared or adjusted Rsquared Summary In this chapter we have covered basic regression analysis with time series data Under assumptions that parallel those for crosssectional analysis OLS is unbiased under TS1 through TS3 OLS is BLUE under TS1 through TS5 and the usual OLS standard errors t statistics and F statistics can be used for statistical inference under TS1 through TS6 Because of the temporal correlation in most time series data we must explicitly make assumptions about how the errors are related to the explanatory variables in all time periods and about the temporal correlation in the errors themselves The classical linear model assumptions can be pretty restrictive for time series applications but they are a natural starting point We have applied them to both static regression and finite distributed lag models Logarithms and dummy variables are used regularly in time series applications and in event studies We also discussed index numbers and time series measured in terms of nominal and real dollars Trends and seasonality can be easily handled in a multiple regression framework by including time and seasonal dummy variables in our regression equations We presented problems with the usual Rsquared as a goodnessoffit measure and suggested some simple alternatives based on detrending or deseasonalizing ClassiCal linear Model assuMptions for tiMe series regression Following is a summary of the six classical linear model CLM assumptions for time series regression applications Assumptions TS1 through TS5 are the time series versions of the GaussMarkov assump tions which implies that OLS is BLUE and has the usual sampling variances We only needed TS1 TS2 and TS3 to establish unbiasedness of OLS As in the case of crosssectional regression the normality assumption TS6 was used so that we could perform exact statistical inference for any sample size assumption ts1 linear in parameters The stochastic process 5 1xt1 xt2 p xtk yt2 t 5 1 2 p n6 follows the linear model yt 5 b0 1 b1xt1 1 b2xt2 1 p 1 bkxtk 1 ut where 5ut t 5 1 2 p n6 is the sequence of errors or disturbances Here n is the number of observations time periods assumption ts2 no perfect Collinearity In the sample and therefore in the underlying time series process no independent variable is constant nor a perfect linear combination of the others assumption ts3 Zero Conditional Mean For each t the expected value of the error ut given the explanatory variables for all time periods is zero Mathematically E1ut0X2 5 0 t 5 1 2 p n Assumption TS3 replaces MLR4 for crosssectional regression and it also means we do not have to make the random sampling assumption MLR2 Remember Assumption TS3 implies that the error in each time period t is uncorrelated with all explanatory variables in all time periods including of course time period t PART 2 Regression Analysis with Time Series Data Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 339 Assumption TS4 Homoskedasticity Conditional on X the variance of ut is the same for all t Var1ut0X2 5 Var1ut2 5 s2 t 5 1 2 p n Assumption TS5 No Serial Correlation Conditional on X the errors in two different time periods are uncorrelated Corr1ut us0X2 5 0 for all t 2 s Recall that we added the no serial correlation assumption along with the homoskedasticity assump tion to obtain the same variance formulas that we derived for crosssectional regression under random sampling As we will see in Chapter 12 Assumption TS5 is often violated in ways that can make the usual statistical inference very unreliable Assumption TS6 Normality The errors ut are independent of X and are independently and identically distributed as Normal10s22 Key Terms Autocorrelation Base Period Base Value Contemporaneously Exogenous Cumulative Effect Deseasonalizing Detrending Event Study Exponential Trend Finite Distributed Lag FDL Model Growth Rate Impact Multiplier Impact Propensity Index Number Lag Distribution Linear Time Trend LongRun Elasticity LongRun Multiplier LongRun Propensity LRP Seasonal Dummy Variables Seasonality Seasonally Adjusted Serial Correlation ShortRun Elasticity Spurious Regression Problem Static Model Stochastic Process Strictly Exogenous Time Series Process Time Trend Problems 1 Decide if you agree or disagree with each of the following statements and give a brief explanation of your decision i Like crosssectional observations we can assume that most time series observations are independently distributed ii The OLS estimator in a time series regression is unbiased under the first three GaussMarkov assumptions iii A trending variable cannot be used as the dependent variable in multiple regression analysis iv Seasonality is not an issue when using annual time series observations 2 Let gGDPt denote the annual percentage change in gross domestic product and let intt denote a short term interest rate Suppose that gGDPt is related to interest rates by gGDPt 5 a0 1 d0intt 1 d1intt21 1 ut where ut is uncorrelated with intt intt21 and all other past values of interest rates Suppose that the Federal Reserve follows the policy rule intt 5 g0 1 g11gGDPt21 2 32 1 vt where g1 0 When last years GDP growth is above 3 the Fed increases interest rates to prevent an overheated economy If vt is uncorrelated with all past values of intt and ut argue that intt must be correlated with ut21 Hint Lag the first equation for one time period and substitute for gGDPt21 in the second equation Which GaussMarkov assumption does this violate Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 340 PART 2 Regression Analysis with Time Series Data 3 Suppose yt follows a second order FDL model yt 5 a0 1 d0zt 1 d1zt21 1 d2zt22 1 ut Let zp denote the equilibrium value of zt and let yp be the equilibrium value of yt such that yp 5 a0 1 d0zp 1 d1zp 1 d2zp Show that the change in yp due to a change in zp equals the longrun propensity times the change in zp Dyp 5 LRP Dzp This gives an alternative way of interpreting the LRP 4 When the three event indicators befile6 affile6 and afdec6 are dropped from equation 1022 we obtain R2 5 281 and R2 5 264 Are the event indicators jointly significant at the 10 level 5 Suppose you have quarterly data on new housing starts interest rates and real per capita income Specify a model for housing starts that accounts for possible trends and seasonality in the variables 6 In Example 104 we saw that our estimates of the individual lag coefficients in a distributed lag model were very imprecise One way to alleviate the multicollinearity problem is to assume that the dj follow a relatively simple pattern For concreteness consider a model with four lags yt 5 a0 1 d0zt 1 d1zt21 1 d2zt22 1 d3zt23 1 d4zt24 1 ut Now let us assume that the dj follow a quadratic in the lag j dj 5 g0 1 g1j 1 g2 j2 for parameters g0 g1 and g2 This is an example of a polynomial distributed lag PDL model i Plug the formula for each dj into the distributed lag model and write the model in terms of the parameters gh for h 5 0 1 2 ii Explain the regression you would run to estimate the gh iii The polynomial distributed lag model is a restricted version of the general model How many restrictions are imposed How would you test these Hint Think F test 7 In Example 104 we wrote the model that explicitly contains the longrun propensity u0 as gfrt 5 a0 1 u0pet 1 d11pet21 2 pet2 1 d21pet22 2 pet2 1 u where we omit the other explanatory variables for simplicity As always with multiple regression anal ysis u0 should have a ceteris paribus interpretation Namely if pet increases by one dollar holding 1pet21 2 pet2 and 1pet22 2 pet2 fixed gfrt should change by u0 i If 1pet21 2 pet2 and 1pet22 2 pet2 are held fixed but pet is increasing what must be true about changes in pet21 and pet22 ii How does your answer in part i help you to interpret u0 in the above equation as the LRP 8 In the linear model given in equation 108 the explanatory variables xt 5 1xt1 p xtk2 are said to be sequentially exogenous sometimes called weakly exogenous if E1ut0xt xt21 p x12 5 0 t 5 1 2 p so that the errors are unpredictable given current and all past values of the explanatory variables i Explain why sequential exogeneity is implied by strict exogeneity ii Explain why contemporaneous exogeneity is implied by sequential exogeneity Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 341 iii Are the OLS estimators generally unbiased under the sequential exogeneity assumption Explain iv Consider a model to explain the annual rate of HIV infections HIVrate as a distributed lag of per capita condom usage pccon for a state region or province E1HIVratet0pccont pccontt21 p 2 5 a0 1 d0pccont 1 d1pccont21 1 d2pccont22 1 d3pccont23 Explain why this model satisfies the sequential exogeneity assumption Does it seem likely that strict exogeneity holds too Computer Exercises C1 In October 1979 the Federal Reserve changed its policy of using finely tuned interest rate adjustments and instead began targeting the money supply Using the data in INTDEF define a dummy variable equal to 1 for years after 1979 Include this dummy in equation 1015 to see if there is a shift in the interest rate equation after 1979 What do you conclude C2 Use the data in BARIUM for this exercise i Add a linear time trend to equation 1022 Are any variables other than the trend statistically significant ii In the equation estimated in part i test for joint significance of all variables except the time trend What do you conclude iii Add monthly dummy variables to this equation and test for seasonality Does including the monthly dummies change any other estimates or their standard errors in important ways C3 Add the variable log prgnp to the minimum wage equation in 1038 Is this variable significant Interpret the coefficient How does adding log prgnp affect the estimated minimum wage effect C4 Use the data in FERTIL3 to verify that the standard error for the LRP in equation 1019 is about 030 C5 Use the data in EZANDERS for this exercise The data are on monthly unemployment claims in Anderson Township in Indiana from January 1980 through November 1988 In 1984 an enterprise zone EZ was located in Anderson as well as other cities in Indiana See Papke 1994 for details i Regress loguclms on a linear time trend and 11 monthly dummy variables What was the overall trend in unemployment claims over this period Interpret the coefficient on the time trend Is there evidence of seasonality in unemployment claims ii Add ez a dummy variable equal to one in the months Anderson had an EZ to the regression in part i Does having the enterprise zone seem to decrease unemployment claims By how much You should use formula 710 from Chapter 7 iii What assumptions do you need to make to attribute the effect in part ii to the creation of an EZ C6 Use the data in FERTIL3 for this exercise i Regress gfrt on t and t2 and save the residuals This gives a detrended gfrt say gf t ii Regress gf t on all of the variables in equation 1035 including t and t2 Compare the Rsquared with that from 1035 What do you conclude iii Reestimate equation 1035 but add t3 to the equation Is this additional term statistically significant C7 Use the data set CONSUMP for this exercise i Estimate a simple regression model relating the growth in real per capita consumption of nondurables and services to the growth in real per capita disposable income Use the change in the logarithms in both cases Report the results in the usual form Interpret the equation and discuss statistical significance Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 342 PART 2 Regression Analysis with Time Series Data ii Add a lag of the growth in real per capita disposable income to the equation from part i What do you conclude about adjustment lags in consumption growth iii Add the real interest rate to the equation in part i Does it affect consumption growth C8 Use the data in FERTIL3 for this exercise i Add pet23 and pet24 to equation 1019 Test for joint significance of these lags ii Find the estimated longrun propensity and its standard error in the model from part i Compare these with those obtained from equation 1019 iii Estimate the polynomial distributed lag model from Problem 6 Find the estimated LRP and compare this with what is obtained from the unrestricted model C9 Use the data in VOLAT for this exercise The variable rsp500 is the monthly return on the Standard Poors 500 stock market index at an annual rate This includes price changes as well as dividends The variable i3 is the return on threemonth Tbills and pcip is the percentage change in industrial production these are also at an annual rate i Consider the equation rsp500t 5 b0 1 b1pcipt 1 b2i3t 1 ut What signs do you think b1 and b2 should have ii Estimate the previous equation by OLS reporting the results in standard form Interpret the signs and magnitudes of the coefficients iii Which of the variables is statistically significant iv Does your finding from part iii imply that the return on the SP 500 is predictable Explain C10 Consider the model estimated in 1015 use the data in INTDEF i Find the correlation between inf and def over this sample period and comment ii Add a single lag of inf and def to the equation and report the results in the usual form iii Compare the estimated LRP for the effect of inflation with that in equation 1015 Are they vastly different iv Are the two lags in the model jointly significant at the 5 level C11 The file TRAFFIC2 contains 108 monthly observations on automobile accidents traffic laws and some other variables for California from January 1981 through December 1989 Use this data set to answer the following questions i During what month and year did Californias seat belt law take effect When did the highway speed limit increase to 65 miles per hour ii Regress the variable logtotacc on a linear time trend and 11 monthly dummy variables using January as the base month Interpret the coefficient estimate on the time trend Would you say there is seasonality in total accidents iii Add to the regression from part ii the variables wkends unem spdlaw and beltlaw Discuss the coefficient on the unemployment variable Does its sign and magnitude make sense to you iv In the regression from part iii interpret the coefficients on spdlaw and beltlaw Are the estimated effects what you expected Explain v The variable prcfat is the percentage of accidents resulting in at least one fatality Note that this variable is a percentage not a proportion What is the average of prcfat over this period Does the magnitude seem about right vi Run the regression in part iii but use prcfat as the dependent variable in place of logtotacc Discuss the estimated effects and significance of the speed and seat belt law variables C12 i Estimate equation 102 using all the data in PHILLIPS and report the results in the usual form How many observations do you have now ii Compare the estimates from part i with those in equation 1014 In particular does adding the extra years help in obtaining an estimated tradeoff between inflation and unemployment Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 10 Basic Regression Analysis with Time Series Data 343 iii Now run the regression using only the years 1997 through 2003 How do these estimates differ from those in equation 1014 Are the estimates using the most recent seven years precise enough to draw any firm conclusions Explain iv Consider a simple regression setup in which we start with n time series observations and then split them into an early time period and a later time period In the first time period we have n1 observations and in the second period n2 observations Draw on the previous parts of this exercise to evaluate the following statement Generally we can expect the slope estimate using all n observations to be roughly equal to a weighted average of the slope estimates on the early and later subsamples where the weights are n1n and n2n respectively C13 Use the data in MINWAGE for this exercise In particular use the employment and wage series for sec tor 232 Mens and Boys Furnishings The variable gwage232 is the monthly growth change in logs in the average wage in sector 232 gemp232 is the growth in employment in sector 232 gmwage is the growth in the federal minimum wage and gcpi is the growth in the urban Consumer Price Index i Run the regression gwage232 on gmwage gcpi Do the sign and magnitude of b gmwage make sense to you Explain Is gmwage statistically significant ii Add lags 1 through 12 of gmwage to the equation in part i Do you think it is necessary to include these lags to estimate the longrun effect of minimum wage growth on wage growth in sector 232 Explain iii Run the regression gemp232 on gmwage gcpi Does minimum wage growth appear to have a contemporaneous effect on gemp232 iv Add lags 1 through 12 to the employment growth equation Does growth in the minimum wage have a statistically significant effect on employment growth either in the short run or long run Explain C14 Use the data in APPROVAL to answer the following questions The data set consists of 78 months of data during the presidency of George W Bush The data end in July 2007 before Bush left office In addition to economic variables and binary indicators of various events it includes an approval rate approve collected by Gallup Caution One should also attempt Computer Exercise C14 in Chapter 11 to gain a more complete understanding of the econometric issues involved in analyzing these data i What is the range of the variable approve What is its average value ii Estimate the model approvet 5 b0 1 b1lcpifoodt 1 b2lrgaspricet 1 b3unemployt 1 ut where the first two variables are in logarithmic form and report the estimates in the usual way iii Interpret the coefficients in the estimates from part ii Comment on the signs and sizes of the effects as well as statistical significance iv Add the binary variables sep11 and iraqinvade to the equation from part ii Interpret the coefficients on the dummy variables Are they statistically significant v Does adding the dummy variables in part iv change the other estimates much Are any of the coefficients in part iv hard to rationalize vi Add lsp500 to the regression in part iv Controlling for other factors does the stock market have an important effect on the presidential approval rating Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 344 I n Chapter 10 we discussed the finite sample properties of OLS for time series data under increasingly stronger sets of assumptions Under the full set of classical linear model assumptions for time series TS1 through TS6 OLS has exactly the same desirable properties that we derived for crosssectional data Likewise statistical inference is carried out in the same way as it was for crosssectional analysis From our crosssectional analysis in Chapter 5 we know that there are good reasons for studying the large sample properties of OLS For example if the error terms are not drawn from a normal dis tribution then we must rely on the central limit theorem CLT to justify the usual OLS test statistics and confidence intervals Large sample analysis is even more important in time series contexts This is somewhat ironic given that large time series samples can be difficult to come by but we often have no choice other than to rely on large sample approximations In Section 103 we explained how the strict exogene ity assumption TS3 might be violated in static and distributed lag models As we will show in Section 112 models with lagged dependent variables must violate Assumption TS3 Unfortunately large sample analysis for time series problems is fraught with many more difficul ties than it was for crosssectional analysis In Chapter 5 we obtained the large sample properties of OLS in the context of random sampling Things are more complicated when we allow the observa tions to be correlated across time Nevertheless the major limit theorems hold for certain although not all time series processes The key is whether the correlation between the variables at different time periods tends to zero quickly enough Time series that have substantial temporal correlation require special attention in regression analysis This chapter will alert you to certain issues pertaining to such series in regression analysis Further Issues in Using OLS with Time Series Data c h a p t e r 11 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 345 111 Stationary and Weakly Dependent Time Series In this section we present the key concepts that are needed to apply the usual large sample approxi mations in regression analysis with time series data The details are not as important as a general understanding of the issues 111a Stationary and Nonstationary Time Series Historically the notion of a stationary process has played an important role in the analysis of time series A stationary time series process is one whose probability distributions are stable over time in the following sense If we take any collection of random variables in the sequence and then shift that sequence ahead h time periods the joint probability distribution must remain unchanged A formal definition of stationarity follows Stationary Stochastic Process The stochastic process 5xt t 5 1 2 p6 is stationary if for every col lection of time indices 1 t1 t2 p tm the joint distribution of 1xt1 xt2 p xtm2 is the same as the joint distribution of 1xt11h xt21h p xtm1h2 for all integers h 1 This definition is a little abstract but its meaning is pretty straightforward One implication by choosing m 5 1 and t1 5 1 is that xt has the same distribution as x1 for all t 5 2 3 p In other words the sequence 5xt t 5 1 2 p6 is identically distributed Stationarity requires even more For example the joint distribution of 1x1 x22 the first two terms in the sequence must be the same as the joint distribution of 1xt xt112 for any t 1 Again this places no restrictions on how xt and xt11 are related to one another indeed they may be highly correlated Stationarity does require that the nature of any correlation between adjacent terms is the same across all time periods A stochastic process that is not stationary is said to be a nonstationary process Since station arity is an aspect of the underlying stochastic process and not of the available single realization it can be very difficult to determine whether the data we have collected were generated by a stationary process However it is easy to spot certain sequences that are not stationary A process with a time trend of the type covered in Section 105 is clearly nonstationary at a minimum its mean changes over time Sometimes a weaker form of stationarity suffices If 5xt t 5 1 2 p6 has a finite second moment that is E1xt 22 for all t then the following definition applies Covariance Stationary Process A stochastic process 5xt t 5 1 2 p6 with a finite second moment 3E1xt 22 4 is covariance stationary if i E1xt2 is constant ii Var1xt2 is constant and iii for any t h 1 Cov1xt xt1h2 depends only on h and not on t Covariance stationarity focuses only on the first two moments of a stochastic process the mean and variance of the process are constant across time and the covariance between xt and xt1h depends only on the distance between the two terms h and not on the location of the initial time period t It follows imme diately that the correlation between xt and xt1h also depends only on h If a stationary process has a finite second moment then it must be covariance stationary but the converse is certainly not true Sometimes to emphasize that stationarity is a stronger requirement than covariance stationarity the former is referred to as strict stationarity Because strict stationarity simplifies the statements of some of our subsequent assumptions stationarity for us will always mean the strict form Suppose that 5yt t 5 1 2 p6 is generated by yt 5 d0 1 d1t 1 et where d1 2 0 and 5et t 5 1 2 p6 is an iid sequence with mean zero and variance s2 e i Is 5yt6 covari ance stationary ii Is yt 2 E1yt2 covariance stationary Exploring FurthEr 111 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 346 How is stationarity used in time series econometrics On a technical level stationarity simplifies statements of the law of large numbers LLN and the CLT although we will not worry about formal statements in this chapter On a practical level if we want to understand the relationship between two or more variables using regression analysis we need to assume some sort of stability over time If we allow the relationship between two variables say yt and xt to change arbitrarily in each time period then we cannot hope to learn much about how a change in one variable affects the other variable if we only have access to a single time series realization In stating a multiple regression model for time series data we are assuming a certain form of stationarity in that the bj do not change over time Further Assumptions TS4 and TS5 imply that the variance of the error process is constant over time and that the correlation between errors in two adjacent periods is equal to zero which is clearly constant over time 111b Weakly Dependent Time Series Stationarity has to do with the joint distributions of a process as it moves through time A very differ ent concept is that of weak dependence which places restrictions on how strongly related the random variables xt and xt1h can be as the time distance between them h gets large The notion of weak dependence is most easily discussed for a stationary time series loosely speaking a stationary time series process 5xt t 5 1 2 p6 is said to be weakly dependent if xt and xt1h are almost independ ent as h increases without bound A similar statement holds true if the sequence is nonstationary but then we must assume that the concept of being almost independent does not depend on the starting point t The description of weak dependence given in the previous paragraph is necessarily vague We cannot formally define weak dependence because there is no definition that covers all cases of inter est There are many specific forms of weak dependence that are formally defined but these are well beyond the scope of this text See White 1984 Hamilton 1994 and Wooldridge 1994b for advanced treatments of these concepts For our purposes an intuitive notion of the meaning of weak dependence is sufficient Covariance stationary sequences can be characterized in terms of correlations a covariance stationary time series is weakly dependent if the correlation between xt and xt1h goes to zero sufficiently quickly as h S Because of covariance stationarity the correlation does not depend on the starting point t In other words as the variables get farther apart in time the correlation between them becomes smaller and smaller Covariance stationary sequences where Corr1xt xt1h2 S 0 as h S are said to be asymptotically uncorrelated Intuitively this is how we will usually characterize weak depend ence Technically we need to assume that the correlation converges to zero fast enough but we will gloss over this Why is weak dependence important for regression analysis Essentially it replaces the assump tion of random sampling in implying that the LLN and the CLT hold The most wellknown CLT for time series data requires stationarity and some form of weak dependence thus stationary weakly dependent time series are ideal for use in multiple regression analysis In Section 112 we will argue that OLS can be justified quite generally by appealing to the LLN and the CLT Time series that are not weakly dependentexamples of which we will see in Section 113do not generally satisfy the CLT which is why their use in multiple regression analysis can be tricky The simplest example of a weakly dependent time series is an independent identically distrib uted sequence a sequence that is independent is trivially weakly dependent A more interesting exam ple of a weakly dependent sequence is xt 5 et 1 a1et21 t 5 1 2 p 111 where 5et t 5 0 1 p6 is an iid sequence with zero mean and variance s2 e The process 5xt6 is called a moving average process of order one MA1 xt is a weighted average of et and et21 in the next Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 347 period we drop et21 and then xt11 depends on et11 and et Setting the coefficient of et to 1 in 111 is done without loss of generality In equation 111 we use xt and et as generic labels for time series processes They need have nothing to do with the explanatory variables or errors in a time series regression model although both the explanatory variables and errors could be MA1 processes Why is an MA1 process weakly dependent Adjacent terms in the sequence are correlated because xt11 5 et11 1 a1et Cov1xt xt112 5 a1Var1et2 5 a1s2 e Because Var1xt2 5 11 1 a2 12s2 e Corr1xt xt112 5 a111 1 a2 12 For example if a1 5 5 then Corr1xt xt112 5 4 The maximum posi tive correlation occurs when a1 5 1 in which case Corr1xt xt112 5 5 However once we look at variables in the sequence that are two or more time periods apart these variables are uncorrelated because they are independent For example xt12 5 et12 1 a1et11 is independent of xt because 5et6 is independent across t Due to the identical distribution assumption on the et 5xt6 in 111 is actually stationary Thus an MA1 is a stationary weakly dependent sequence and the LLN and the CLT can be applied to 5xt6 A more popular example is the process yt 5 r1yt21 1 et t 5 1 2 p 112 The starting point in the sequence is y01at t 5 02 and 5et t 5 1 2 p6 is an iid sequence with zero mean and variance s2 e We also assume that the et are independent of y0 and that E1y02 5 0 This is called an autoregressive process of order one AR1 The crucial assumption for weak dependence of an AR1 process is the stability condition 0r10 1 Then we say that 5yt6 is a stable AR1 process To see that a stable AR1 process is asymptotically uncorrelated it is useful to assume that the process is covariance stationary In fact it can generally be shown that 5yt6 is strictly stationary but the proof is somewhat technical Then we know that E1yt2 5 E1yt212 and from 112 with r1 2 1 this can happen only if E1yt2 5 0 Taking the variance of 112 and using the fact that et and yt21 are independent and therefore uncorrelated Var1yt2 5 r2 1Var1yt212 1 Var1et2 and so under covari ance stationarity we must have s2 y 5 r2 1s2 y 1 s2 e Since r2 1 1 by the stability condition we can easily solve for s2 y s2 y 5 s2 e11 2 r2 12 113 Now we can find the covariance between yt and yt1h for h 1 Using repeated substitution yt1h 5 r1yt1h21 1 et1h 5 r11r1yt1h22 1 et1h212 1 et1h 5 r2 1yt1h22 1 r1et1h21 1 et1h 5 p 5 r2 1yt 1 rh21 1 et11 1 p 1 r1et1h21 1 et1h Because E1yt2 5 0 for all t we can multiply this last equation by yt and take expectations to obtain Cov1yt yt1h2 Using the fact that et1j is uncorrelated with yt for all j 1 gives Cov1yt yt1h2 5 E1ytyt1h2 5 rh 1E1y2 t 2 1 rh21 1 E1ytet112 1 p 1 E1ytet1h2 5 rh 1E1y2 t 2 5 rh 1s2 y Because sy is the standard deviation of both yt and yt1h we can easily find the correlation between yt and yt1h for any h 1 Corr1yt yt1h2 5 Cov1yt yt1h21sysy2 5 rh 1 114 In particular Corr1yt yt112 5 r1 so r1 is the correlation coefficient between any two adjacent terms in the sequence Equation 114 is important because it shows that although yt and yt1h are correlated for any h 1 this correlation gets very small for large h because 0r10 1 rh 1 S 0 as h S Even when r1 is largesay 9 which implies a very high positive correlation between adjacent termsthe correlation between yt and yt1h tends to zero fairly rapidly For example Corr1yt yt152 5 591 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 348 Corr1yt yt1102 5 349 and Corr1yt yt1202 5 122 If t indexes year this means that the correlation between the outcome of two y that are 20 years apart is about 122 When r1 is smaller the correlation dies out much more quickly You might try r1 5 5 to verify this This analysis heuristically demonstrates that a stable AR1 process is weakly dependent The AR1 model is especially important in multiple regression analysis with time series data We will cover additional applications in Chapter 12 and the use of it for forecasting in Chapter 18 There are many other types of weakly dependent time series including hybrids of autoregressive and moving average processes But the previous examples work well for our purposes Before ending this section we must emphasize one point that often causes confusion in time series econometrics A trending series though certainly nonstationary can be weakly dependent In fact in the simple linear time trend model in Chapter 10 see equation 1024 the series 5yt6 was actually independent A series that is stationary about its time trend as well as weakly dependent is often called a trendstationary process Notice that the name is not completely descriptive because we assume weak dependence along with stationarity Such processes can be used in regression analy sis just as in Chapter 10 provided appropriate time trends are included in the model 112 Asymptotic Properties of OLS In Chapter 10 we saw some cases in which the classical linear model assumptions are not satisfied for certain time series problems In such cases we must appeal to large sample properties of OLS just as with crosssectional analysis In this section we state the assumptions and main results that justify OLS more generally The proofs of the theorems in this chapter are somewhat difficult and therefore omitted See Wooldridge 1994b Assumption TS19 Linearity and Weak Dependence We assume the model is exactly as in Assumption TS1 but now we add the assumption that 5 1xt yt2 t 5 1 2 p6 is stationary and weakly dependent In particular the LLN and the CLT can be applied to sample averages The linear in parameters requirement again means that we can write the model as yt 5 b0 1 b1xt1 1 p 1 bk xtk 1 ut 115 where the bj are the parameters to be estimated Unlike in Chapter 10 the xtj can include lags of the dependent variable As usual lags of explanatory variables are also allowed We have included stationarity in Assumption TS1r for convenience in stating and interpreting assumptions If we were carefully working through the asymptotic properties of OLS as we do in Appendix E stationarity would also simplify those derivations But stationarity is not at all critical for OLS to have its standard asymptotic properties As mentioned in Section 111 by assuming the bj are constant across time we are already assuming some form of stability in the distributions over time The important extra restriction in Assumption TS1r as compared with Assumption TS1 is the weak dependence assumption In Section 111 we spent some effort discussing weak dependence for a time series process because it is by no means an innocuous assumption Technically Assumption TS1r requires weak dependence on multiple time series yt and elements of xt and this entails putting restrictions on the joint distribution across time The details are not particularly important and are anyway beyond the scope of this text see Wooldridge 1994 It is more important to understand the kinds of persistent time series processes that violate the weak dependence requirement and we will turn to that in the next section There we also discuss the use of persistent processes in multiple regression models Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 349 Naturally we still rule out perfect collinearity Assumption TS29 No Perfect Collinearity Same as Assumption TS2 Assumption TS39 Zero Conditional Mean The explanatory variables xt 5 1xt1 xt2 p xtk2 are contemporaneously exogenous as in equation 1010 E1ut0xt2 5 0 This is the most natural assumption concerning the relationship between ut and the explanatory vari ables It is much weaker than Assumption TS3 because it puts no restrictions on how ut is related to the explanatory variables in other time periods We will see examples that satisfy TS3r shortly By stationarity if contemporaneous exogeneity holds for one time period it holds for them all Relaxing stationarity would simply require us to assume the condition holds for all t 5 1 2 For certain purposes it is useful to know that the following consistency result only requires ut to have zero unconditional mean and to be uncorrelated with each xtj E1ut2 5 0 Cov1xtj ut2 5 0 j 5 1 p k 116 We will work mostly with the zero conditional mean assumption because it leads to the most straight forward asymptotic analysis CoNsisteNCy of oLs Under TS1r TS2r and TS3r the OLS estimators are consistent plim b j 5 bj j 5 0 1 p k thEorEm 111 There are some key practical differences between Theorems 101 and 111 First in Theorem 111 we conclude that the OLS estimators are consistent but not necessarily unbiased Second in Theorem 111 we have weakened the sense in which the explanatory variables must be exogenous but weak dependence is required in the underlying time series Weak dependence is also crucial in obtaining approximate distributional results which we cover later exaMPLe 111 static Model Consider a static model with two explanatory variables yt 5 b0 1 b1zt1 1 b2zt2 1 ut 117 Under weak dependence the condition sufficient for consistency of OLS is E1ut0zt1 zt22 5 0 118 This rules out omitted variables that are in ut and are correlated with either zt1 or zt2 Also no function of zt1 or zt2 can be correlated with ut and so Assumption TS3r rules out misspecified functional form just as in the crosssectional case Other problems such as measurement error in the variables zt1 or zt2 can cause 118 to fail Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 350 Importantly Assumption TS3r does not rule out correlation between say ut21 and zt1 This type of correlation could arise if zt1 is related to past yt21 such as zt1 5 d0 1 d1yt21 1 vt 119 For example zt1 might be a policy variable such as monthly percentage change in the money supply and this change might depend on last months rate of inflation 1yt212 Such a mechanism generally causes zt1 and ut21 to be correlated as can be seen by plugging in for yt21 This kind of feedback is allowed under Assumption TS3r exaMPLe 112 finite Distributed Lag Model In the finite distributed lag model yt 5 a0 1 d0zt 1 d1zt21 1 d2zt22 1 ut 1110 a very natural assumption is that the expected value of ut given current and all past values of z is zero E1ut0zt zt21 zt22 zt23 p2 5 0 1111 This means that once zt zt21 and zt22 are included no further lags of z affect E1yt0zt zt21 zt22 zt23 p2 if this were not true we would put further lags into the equation For example yt could be the annual percentage change in investment and zt a measure of interest rates during year t When we set xt 5 1zt zt21 zt222 Assumption TS3r is then satisfied OLS will be consistent As in the previous example TS3r does not rule out feedback from y to future values of z The previous two examples do not necessarily require asymptotic theory because the explanatory variables could be strictly exogenous The next example clearly violates the strict exogeneity assump tion therefore we can only appeal to large sample properties of OLS exaMPLe 113 aR1 Model Consider the AR1 model yt 5 b0 1 b1yt21 1 ut 1112 where the error ut has a zero expected value given all past values of y E1ut0yt21 yt22 p2 5 0 1113 Combined these two equations imply that E1yt0yt21 yt22 p2 5 E1yt0yt212 5 b0 1 b1yt21 1114 This result is very important First it means that once y lagged one period has been controlled for no further lags of y affect the expected value of yt This is where the name first order originates Second the relationship is assumed to be linear Because xt contains only yt21 equation 1113 implies that Assumption TS3r holds By contrast the strict exogeneity assumption needed for unbiasedness Assumption TS3 does not hold Since the set of explanatory variables for all time periods includes all of the values on y except the last 1y0 y1 p yn212 Assumption TS3 requires that for all t ut is uncorrelated with each of y0 y1 p yn21 This cannot be true In fact because ut is uncorrelated with yt21 under 1113 ut and yt must be cor related In fact it is easily seen that Cov1yt ut2 5 Var1ut2 0 Therefore a model with a lagged dependent variable cannot satisfy the strict exogeneity Assumption TS3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 351 For the weak dependence condition to hold we must assume that 0b10 1 as we discussed in Section 111 If this condition holds then Theorem 111 implies that the OLS estimator from the regression of yt on yt21 produces consistent estimators of b0 and b1 Unfortunately b 1 is biased and this bias can be large if the sample size is small or if b1 is near 1 For b1 near 1 b 1 can have a severe downward bias In moderate to large samples b 1 should be a good estimator of b1 When using the standard inference procedures we need to impose versions of the homoskedas ticity and no serial correlation assumptions These are less restrictive than their classical linear model counterparts from Chapter 10 Assumption TS49 Homoskedasticity The errors are contemporaneously homoskedastic that is Var1ut0xt2 5 s2 Assumption TS59 No serial Correlation For all t 2 s E1ut us0xt xs2 5 0 In TS4r note how we condition only on the explanatory variables at time t compare to TS4 In TS5r we condition only on the explanatory variables in the time periods coinciding with ut and us As stated this assumption is a little difficult to interpret but it is the right condition for studying the large sample properties of OLS in a variety of time series regressions When considering TS5r we often ignore the conditioning on xt and xs and we think about whether ut and us are uncorrelated for all t 2 s Serial correlation is often a problem in static and finite distributed lag regression models nothing guarantees that the unobservables ut are uncorrelated over time Importantly Assumption TS5r does hold in the AR1 model stated in equations 1112 and 1113 Since the explanatory variable at time t is yt21 we must show that E1utus0yt21 ys212 5 0 for all t 2 s To see this suppose that s t The other case follows by symmetry Then since us 5 ys 2 b0 2 b1ys21 us is a function of y dated before time t But by 1113 E1ut0us yt21 ys212 5 0 and so E1utus0us yt21 ys212 5 usE1ut0yt21 ys212 5 0 By the law of iterated expectations see Appendix B E1utus0yt21 ys212 5 0 This is very important as long as only one lag belongs in 1112 the errors must be serially uncorrelated We will discuss this feature of dynamic models more generally in Section 114 We now obtain an asymptotic result that is practically identical to the crosssectional case asyMPtotiC NoRMaLity of oLs Under TS1r through TS5r the OLS estimators are asymptotically normally distributed Further the usual OLS standard errors t statistics F statistics and LM statistics are asymptotically valid thEorEm 112 This theorem provides additional justification for at least some of the examples estimated in Chapter 10 even if the classical linear model assumptions do not hold OLS is still consistent and the usual inference procedures are valid Of course this hinges on TS1r through TS5r being true In the next section we discuss ways in which the weak dependence assumption can fail The problems of serial correlation and heteroskedasticity are treated in Chapter 12 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 352 exaMPLe 114 efficient Markets Hypothesis We can use asymptotic analysis to test a version of the efficient markets hypothesis EMH Let yt be the weekly percentage return from Wednesday close to Wednesday close on the New York Stock Exchange composite index A strict form of the EMH states that information observable to the market prior to week t should not help to predict the return during week t If we use only past information on y the EMH is stated as E1yt0yt21 yt22 p2 5 E1yt2 1115 If 1115 is false then we could use information on past weekly returns to predict the current return The EMH presumes that such investment opportunities will be noticed and will disappear almost instantaneously One simple way to test 1115 is to specify the AR1 model in 1112 as the alternative model Then the null hypothesis is easily stated as H0 b1 5 0 Under the null hypothesis Assumption TS3r is true by 1115 and as we discussed earlier serial correlation is not an issue The homoskedastic ity assumption is Var1yt0yt212 5 Var1yt2 5 s2 which we just assume is true for now Under the null hypothesis stock returns are serially uncorrelated so we can safely assume that they are weakly dependent Then Theorem 112 says we can use the usual OLS t statistic for b 1 to test H0 b1 5 0 against H1 b1 2 0 The weekly returns in NYSE are computed using data from January 1976 through March 1989 In the rare case that Wednesday was a holiday the close at the next trading day was used The aver age weekly return over this period was 196 in percentage form with the largest weekly return being 845 and the smallest being 21532 during the stock market crash of October 1987 Estimation of the AR1 model gives returnt 5 180 1 059 returnt2l 10812 10382 1116 n 5 689 R2 5 0035 R2 5 0020 The t statistic for the coefficient on returnt21 is about 155 and so H0 b1 5 0 cannot be rejected against the twosided alternative even at the 10 significance level The estimate does suggest a slight positive correlation in the NYSE return from one week to the next but it is not strong enough to warrant rejection of the EMH In the previous example using an AR1 model to test the EMH might not detect correlation between weekly returns that are more than one week apart It is easy to estimate models with more than one lag For example an autoregressive model of order two or AR2 model is yt 5 b0 1 b1yt21 1 b2yt22 1 ut 1117 E1ut0yt21 yt22 p2 5 0 There are stability conditions on b1 and b2 that are needed to ensure that the AR2 process is weakly dependent but this is not an issue here because the null hypothesis states that the EMH holds H0 b1 5 b2 5 0 1118 If we add the homoskedasticity assumption Var1ut0yt21 yt222 5 s2 we can use a standard F sta tistic to test 1118 If we estimate an AR2 model for returnt we obtain returnt 5 186 1 060 returnt21 2 038 returnt22 10812 10382 10382 n 5 688 R2 5 0048 R2 5 0019 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 353 where we lose one more observation because of the additional lag in the equation The two lags are individually insignificant at the 10 level They are also jointly insignificant using R2 5 0048 we find the F statistic is approximately F 5 165 the pvalue for this F statistic with 2 and 685 degrees of freedom is about 193 Thus we do not reject 1118 at even the 15 significance level exaMPLe 115 expectations augmented Phillips Curve A linear version of the expectations augmented Phillips curve can be written as inft 2 inf e t 5 b11unemt 2 m02 1 et where m0 is the natural rate of unemployment and inf e t is the expected rate of inflation formed in year t 2 1 This model assumes that the natural rate is constant something that macroeconomists ques tion The difference between actual unemployment and the natural rate is called cyclical unemploy ment while the difference between actual and expected inflation is called unanticipated inflation The error term et is called a supply shock by macroeconomists If there is a tradeoff between unantici pated inflation and cyclical unemployment then b1 0 For a detailed discussion of the expecta tions augmented Phillips curve see Mankiw 1994 Section 112 To complete this model we need to make an assumption about inflationary expectations Under adaptive expectations the expected value of current inflation depends on recently observed inflation A particularly simple formulation is that expected inflation this year is last years inflation inf e t 5 inft21 See Section 181 for an alternative formulation of adaptive expectations Under this assumption we can write inft 2 inft21 5 b0 1 b1unemt 1 et or Dinft 5 b0 1 b1unemt 1 et where Dinft 5 inft 2 inft21 and b0 5 2b1m0 b0 is expected to be positive since b1 0 and m0 0 Therefore under adaptive expectations the expectations augmented Phillips curve relates the change in inflation to the level of unemployment and a supply shock et If et is uncorrelated with unemt as is typically assumed then we can consistently estimate b0 and b1 by OLS We do not have to assume that say future unemployment rates are unaffected by the current supply shock We assume that TS1r through TS5r hold Using the data through 1996 in PHILLIPS we estimate Dinft 5 303 2 543 unemt 11382 12302 1119 n 5 48 R2 5 108 R2 5 088 The tradeoff between cyclical unemployment and unanticipated inflation is pronounced in equa tion 1119 a onepoint increase in unem lowers unanticipated inflation by over onehalf of a point The effect is statistically significant twosided pvalue 023 We can contrast this with the static Phillips curve in Example 101 where we found a slightly positive relationship between inflation and unemployment Because we can write the natural rate as m0 5 b012b12 we can use 1119 to obtain our own estimate of the natural rate m 0 5 b 012b 12 5 303543 558 Thus we estimate the natural rate to be about 56 which is well within the range suggested by macroeconomists historically 5 to 6 is a common range cited for the natural rate of unemployment A standard error of this estimate is difficult to obtain because we have a nonlinear function of the OLS estimators Wooldridge 2010 Chapter 3 contains the theory for general nonlinear functions In the current application the standard error is 657 which leads to an asymptotic 95 confidence interval based on the standard normal distribution of about 429 to 687 for the natural rate Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 354 Under Assumptions TS1r through TS5r we can show that the OLS estimators are asymptoti cally efficient in the class of estimators described in Theorem 53 but we replace the crosssectional observation index i with the time series index t Finally models with trending explanatory variables can effectively satisfy Assumptions TS1r through TS5r provided they are trend stationary As long as time trends are included in the equations when needed the usual inference procedures are asymptotically valid 113 Using Highly Persistent Time Series in Regression Analysis The previous section shows that provided the time series we use are weakly dependent usual OLS inference procedures are valid under assumptions weaker than the classical linear model assumptions Unfortunately many economic time series cannot be characterized by weak dependence Using time series with strong dependence in regression analysis poses no problem if the CLM assumptions in Chapter 10 hold But the usual inference procedures are very susceptible to violation of these assump tions when the data are not weakly dependent because then we cannot appeal to the LLN and the CLT In this section we provide some examples of highly persistent or strongly dependent time series and show how they can be transformed for use in regression analysis 113a Highly Persistent Time Series In the simple AR1 model 112 the assumption 0r10 1 is crucial for the series to be weakly dependent It turns out that many economic time series are better characterized by the AR1 model with r1 5 1 In this case we can write yt 5 yt21 1 et t 5 1 2 p 1120 where we again assume that 5et t 5 1 2 p6 is independent and identically distributed with mean zero and variance s2 e We assume that the initial value y0 is independent of et for all t 1 The process in 1120 is called a random walk The name comes from the fact that y at time t is obtained by starting at the previous value yt21 and adding a zero mean random variable that is inde pendent of yt21 Sometimes a random walk is defined differently by assuming different properties of the innovations et such as lack of correlation rather than independence but the current definition suffices for our purposes First we find the expected value of yt This is most easily done by using repeated substitution to get yt 5 et 1 et21 1 p 1 e1 1 y0 Taking the expected value of both sides gives E1yt2 5 E1et2 1 E1et212 1 p 1 E1e12 1 E1y02 5 E1y02 for all t 1 Therefore the expected value of a random walk does not depend on t A popular assumption is that y0 5 0the process begins at zero at time zeroin which case E1yt2 5 0 for all t By contrast the variance of a random walk does change with t To compute the variance of a random walk for simplicity we assume that y0 is nonrandom so that Var1y02 5 0 this does not affect any important conclusions Then by the iid assumption for 5et6 Var1yt2 5 Var1et2 1 Var1et212 1 p 1 Var1e12 5 s2 et 1121 Suppose that expectations are formed as inf e t 5 1122inft21 1 1122inft22 What regression would you run to estimate the expectations augmented Phillips curve Exploring FurthEr 112 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 355 In other words the variance of a random walk increases as a linear function of time This shows that the process cannot be stationary Even more importantly a random walk displays highly persistent behavior in the sense that the value of y today is important for determining the value of y in the very distant future To see this write for h periods hence yt1h 5 et1h 1 et1h21 1 p 1 et11 1 yt Now suppose at time t we want to compute the expected value of yt1h given the current value yt Since the expected value of et1j given yt is zero for all j 1 we have E1yt1h0yt2 5 yt for all h 1 1122 This means that no matter how far in the future we look our best prediction of yt1h is todays value yt We can contrast this with the stable AR1 case where a similar argument can be used to show that E1yt1h0yt2 5 rh 1 yt for all h 1 Under stability 0r10 1 and so E1yt1h0yt2 approaches zero as h S the value of yt becomes less and less important and E1yt1h0yt2 gets closer and closer to the unconditional expected value E1yt2 5 0 When h 5 1 equation 1122 is reminiscent of the adaptive expectations assumption we used for the inflation rate in Example 115 if inflation follows a random walk then the expected value of inft given past values of inflation is simply inft21 Thus a random walk model for inflation justifies the use of adaptive expectations We can also see that the correlation between yt and yt1h is close to one for large t when 5yt6 fol lows a random walk If Var1y02 5 0 it can be shown that Corr1yt yt1h2 5 t1t 1 h2 Thus the correlation depends on the starting point t so that 5yt6 is not covariance stationary Further although for fixed t the correlation tends to zero as h S it does not do so very quickly In fact the larger t is the more slowly the correlation tends to zero as h gets large If we choose h to be something largesay h 5 100we can always choose a large enough t such that the correlation between yt and yt1h is arbitrarily close to one If h 5 100 and we want the correlation to be greater than 95 then t 1000 does the trick Therefore a random walk does not satisfy the requirement of an asymptotically uncorrelated sequence Figure 111 plots two realizations of a random walk generated from a computer with initial value y0 5 0 and et Normal 10 12 Generally it is not easy to look at a time series plot and determine whether it is a random walk Next we will discuss an informal method for making the distinction between weakly and highly dependent sequences we will study formal statistical tests in Chapter 18 A series that is generally thought to be well characterized by a random walk is the threemonth Tbill rate Annual data are plotted in Figure 112 for the years 1948 through 1996 A random walk is a special case of what is known as a unit root process The name comes from the fact that r1 5 1 in the AR1 model A more general class of unit root processes is generated as in 1120 but 5et6 is now allowed to be a general weakly dependent series For example 5et6 could itself follow an MA1 or a stable AR1 process When 5et6 is not an iid sequence the properties of the random walk we derived earlier no longer hold But the key feature of 5yt6 is preserved the value of y today is highly correlated with y even in the distant future From a policy perspective it is often important to know whether an economic time series is highly persistent or not Consider the case of gross domestic product in the United States If GDP is asymptotically uncorrelated then the level of GDP in the coming year is at best weakly related to what GDP was say 30 years ago This means a policy that affected GDP long ago has very little lasting impact On the other hand if GDP is strongly dependent then next years GDP can be highly correlated with the GDP from many years ago Then we should recognize that a policy that causes a discrete change in GDP can have longlasting effects Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 356 10 t yt 25 0 5 0 50 5 FiguRE 111 Two realizations of the random walk yt 5 yt21 1 et with y0 5 0 et Normal10 12 and n 5 50 1 year interest rate 1972 8 14 1948 1996 FiguRE 112 The US threemonth Tbill rate for the years 19481996 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 357 It is extremely important not to confuse trending and highly persistent behaviors A series can be trending but not highly persistent as we saw in Chapter 10 Further factors such as interest rates inflation rates and unemployment rates are thought by many to be highly persistent but they have no obvious upward or downward trend However it is often the case that a highly persistent series also contains a clear trend One model that leads to this behavior is the random walk with drift yt 5 a0 1 yt21 1 et t 5 1 2 p 1123 where 5et t 5 1 2 p6 and y0 satisfy the same properties as in the random walk model What is new is the parameter a0 which is called the drift term Essentially to generate yt the constant a0 is added along with the random noise et to the previous value yt21 We can show that the expected value of yt follows a linear time trend by using repeated substitution yt 5 a0t 1 et 1 et21 1 p 1 e1 1 y0 Therefore if y0 5 0 E1yt2 5 a0t the expected value of yt is growing over time if a0 0 and shrink ing over time if a0 0 By reasoning as we did in the pure random walk case we can show that E1yt1h0yt2 5 a0h 1 yt and so the best prediction of yt1h at time t is yt plus the drift a0h The variance of yt is the same as it was in the pure random walk case Figure 113 contains a realization of a random walk with drift where n 5 50 y0 5 0 a0 5 2 and the et are Normal0 9 random variables As can be seen from this graph yt tends to grow over time but the series does not regularly return to the trend line 0 t yt 25 50 100 0 50 FiguRE 113 A realization of the random walk with drift yt 5 2 1 yt21 1 et with y0 5 0 et s Normal0 9 and n 5 50 The dashed line is the expected value of yt E1yt2 5 2t Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 358 A random walk with drift is another example of a unit root process because it is the special case r1 5 1 in an AR1 model with an intercept yt 5 a0 1 r1yt21 1 et When r1 5 1 and 5et6 is any weakly dependent process we obtain a whole class of highly persistent time series processes that also have linearly trending means 113b Transformations on Highly Persistent Time Series Using time series with strong persistence of the type displayed by a unit root process in a regression equation can lead to very misleading results if the CLM assumptions are violated We will study the spurious regression problem in more detail in Chapter 18 but for now we must be aware of potential problems Fortunately simple transformations are available that render a unit root process weakly dependent Weakly dependent processes are said to be integrated of order zero or I0 Practically this means that nothing needs to be done to such series before using them in regression analysis averages of such sequences already satisfy the standard limit theorems Unit root processes such as a random walk with or without drift are said to be integrated of order one or I1 This means that the first difference of the process is weakly dependent and often stationary A time series that is I1 is often said to be a differencestationary process although the name is somewhat misleading with its emphasis on stationarity after differencing rather than weak dependence in the differences The concept of an I1 process is easiest to see for a random walk With 5yt6 generated as in 1120 for t 5 1 2 p Dyt 5 yt 2 yr21 5 et t 5 2 3 p 1124 therefore the firstdifferenced series 5Dyt t 5 2 3 p6 is actually an iid sequence More generally if 5yt6 is generated by 1124 where 5et6 is any weakly dependent process then 5Dyt6 is weakly depend ent Thus when we suspect processes are integrated of order one we often first difference in order to use them in regression analysis we will see some examples later Incidentally the symbol Δ can mean change as well as difference In actual data sets if an original variable is named y then its change or difference is often denoted cy or dy For example the change in price might be denoted cprice Many time series yt that are strictly positive are such that log1yt2 is integrated of order one In this case we can use the first difference in the logs Dlog1yt2 5 log1yt2 2 log1yt212 in regression analy sis Alternatively since Dlog1yt2 1yt 2 yt212yt21 1125 we can use the proportionate or percentage change in yt directly this is what we did in Example 114 where rather than stating the EMH in terms of the stock price pt we used the weekly percentage change returnt 5 1003 1pt 2 pt212pt214 The quantity in equation 1125 is often called the growth rate meas ured as a proportionate change When using a particular data set it is important to know how the growth rates are measuredwhether as a proportionate or a percentage change Sometimes if an original variable is y its growth rate is denoted gy so that for each t gyt 5 log1yt2 2 log1yt212 or gyt 5 1yt 2 yt212yt21 Often these quantities are multiplied by 100 to turn a proportionate change into a percentage change Differencing time series before using them in regression analysis has another benefit it removes any linear time trend This is easily seen by writing a linearly trending variable as yt 5 g0 1 g1t 1 vt where vt has a zero mean Then Dyt 5 g1 1 Dvt and so E1Dyt2 5 g1 1 E1Dvt2 5 g1 In other words E1Dyt2 is constant The same argument works for Dlog1yt2 when log1yt2 follows a linear time trend Therefore rather than including a time trend in a regression we can instead difference those variables that show obvious trends Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 359 113c Deciding Whether a Time Series Is I1 Determining whether a particular time series realization is the outcome of an I1 versus an I0 pro cess can be quite difficult Statistical tests can be used for this purpose but these are more advanced we provide an introductory treatment in Chapter 18 There are informal methods that provide useful guidance about whether a time series process is roughly characterized by weak dependence A very simple tool is motivated by the AR1 model if 0r10 1 then the process is I0 but it is I1 if r1 5 1 Earlier we showed that when the AR1 pro cess is stable r1 5 Corr1yt yt212 Therefore we can estimate r1 from the sample correlation between yt and yt21 This sample correlation coefficient is called the first order autocorrelation of 5yt6 we denote this by r 1 By applying the LLN r 1 can be shown to be consistent for r1 provided 0r10 1 However r 1 is not an unbiased estimator of r1 We can use the value of r 1 to help decide whether the process is I1 or I0 Unfortunately because r 1 is an estimate we can never know for sure whether r1 1 Ideally we could compute a confidence interval for r1 to see if it excludes the value r1 5 1 but this turns out to be rather difficult the sampling distributions of the estimator of r 1 are extremely different when r1 is close to one and when r1 is much less than one In fact when r1 is close to one r 1 can have a severe downward bias In Chapter 18 we will show how to test H0 r1 5 1 against H1 r1 1 For now we can only use r 1 as a rough guide for determining whether a series needs to be differenced No hardandfast rule exists for making this choice Most economists think that differencing is warranted if r 1 9 some would difference when r 1 8 exaMPLe 116 fertility equation In Example 104 we explained the general fertility rate g fr in terms of the value of the personal exemption pe The first order autocorrelations for these series are very large r 1 5 977 for g fr and r 1 5 964 for pe These autocorrelations are highly suggestive of unit root behavior and they raise serious questions about our use of the usual OLS t statistics for this example back in Chapter 10 Remember the t statistics only have exact t distributions under the full set of classical linear model assumptions To relax those assumptions in any way and apply asymptotics we generally need the underlying series to be I0 processes We now estimate the equation using first differences and drop the dummy variable for simplicity Dgfr 5 2785 2 043 Dpe 15022 10282 1126 n 5 71 R2 5 032 R2 5 018 Now an increase in pe is estimated to lower g fr contemporaneously although the estimate is not sta tistically different from zero at the 5 level This gives very different results than when we estimated the model in levels and it casts doubt on our earlier analysis If we add two lags of Dpe things improve Dgfr 5 2964 2 036 Dpe 2 014 Dpe21 1 110 Dpe22 14682 10272 10282 10272 1127 n 5 69 R2 5 233 R2 5 197 Even though Dpe and Dpe21 have negative coefficients their coefficients are small and jointly insig nificant 1pvalue 5 282 The second lag is very significant and indicates a positive relationship between changes in pe and subsequent changes in g fr two years hence This makes more sense than having a contemporaneous effect See Computer Exercise C5 for further analysis of the equation in first differences Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 360 When the series in question has an obvious upward or downward trend it makes more sense to obtain the first order autocorrelation after detrending If the data are not detrended the autoregressive correlation tends to be overestimated which biases toward finding a unit root in a trending process exaMPLe 117 Wages and Productivity The variable hrwage is average hourly wage in the US economy and outphr is output per hour One way to estimate the elasticity of hourly wage with respect to output per hour is to estimate the equation log1hrwaget2 5 b0 1 b1log1outphrt2 1 b2t 1 ut where the time trend is included because log1hrwaget2 and log1outphrt2 both display clear upward linear trends Using the data in EARNS for the years 1947 through 1987 we obtain log1hrwaget2 5 2533 1 164 log1outphrt2 2 018 t 1372 1092 10022 1128 n 5 41 R2 5 971 R2 5 970 We have reported the usual goodnessoffit measures here it would be better to report those based on the detrended dependent variable as in Section 105 The estimated elasticity seems too large a 1 increase in productivity increases real wages by about 164 Because the standard error is so small the 95 confidence interval easily excludes a unit elasticity US workers would prob ably have trouble believing that their wages increase by more than 15 for every 1 increase in productivity The regression results in 1128 must be viewed with caution Even after linearly detrending loghrwage the first order autocorrelation is 967 and for detrended logoutphr r 1 5 945 These suggest that both series have unit roots so we reestimate the equation in first differences and we no longer need a time trend Dlog1hrwaget2 5 20036 1 809 Dlog1outphr2 100422 11732 1129 n 5 40 R2 5 364 R2 5 348 Now a 1 increase in productivity is estimated to increase real wages by about 81 and the esti mate is not statistically different from one The adjusted Rsquared shows that the growth in output explains about 35 of the growth in real wages See Computer Exercise C2 for a simple distributed lag version of the model in first differences In the previous two examples both the dependent and independent variables appear to have unit roots In other cases we might have a mixture of processes with unit roots and those that are weakly dependent though possibly trending An example is given in Computer Exercise C1 114 Dynamically Complete Models and the Absence of Serial Correlation In the AR1 model in 1112 we showed that under assumption 1113 the errors ut must be serially uncorrelated in the sense that Assumption TS5r is satisfied assuming that no serial correlation exists is practically the same thing as assuming that only one lag of y appears in E1yt0yt21 yt22 p2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 361 Can we make a similar statement for other regression models The answer is yes although the assumptions required for the errors to be serially uncorrelated might be implausible Consider for example the simple static regression model yt 5 b0 1 b1zt 1 ut 1130 where yt and zt are contemporaneously dated For consistency of OLS we only need E1ut0zt2 5 0 Generally the 5ut6 will be serially correlated However if we assume that E1ut0zt yt21 zt21 p2 5 0 1131 then as we will show generally later Assumption TS5r holds In particular the 5ut6 are serially uncorrelated Naturally assumption 1131 implies that zt is contemporaneously exogenous that is E1ut0zt2 5 0 To gain insight into the meaning of 1131 we can write 1130 and 1131 equivalently as E1yt0zt yt21 zt21 p2 5 E1yt0zt2 5 b0 1 b1zt 1132 where the first equality is the one of current interest It says that once zt has been controlled for no lags of either y or z help to explain current y This is a strong requirement and is implausible when the lagged dependent variable has predictive power which is often the case if it is false then we can expect the errors to be serially correlated Next consider a finite distributed lag model with two lags yt 5 b0 1 b1zt 1 b2zt21 1 b3zt22 1 ut 1133 Since we are hoping to capture the lagged effects that z has on y we would naturally assume that 1133 captures the distributed lag dynamics E1yt0zt zt21 zt22 zt23 p2 5 E1yt0zt zt21 zt222 1134 that is at most two lags of z matter If 1131 holds we can make a stronger statement once we have controlled for z and its two lags no lags of y or additional lags of z affect current y E1yt0zt yt21 zt21 p2 5 E1yt0zt zt21 zt222 1135 Equation 1135 is more likely than 1132 but it still rules out lagged y having extra predictive power for current y Next consider a model with one lag of both y and z yt 5 b0 1 b1zt 1 b2yt21 1 b3zt21 1 ut Since this model includes a lagged dependent variable 1131 is a natural assumption as it implies that E1yt0zt yt21 zt21 yt22 p2 5 E1yt0zt yt21 zt212 in other words once zt yt21 and zt21 have been controlled for no further lags of either y or z affect current y In the general model yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ut 1136 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 362 where the explanatory variables xt 5 1xt1 p xtk2 may or may not contain lags of y or z 1131 becomes E1ut0xt yt21 xt21 p2 5 0 1137 Written in terms of yt E1yt0xt yt21 xt21 p2 5 E1yt0xt2 1138 In other words whatever is in xt enough lags have been included so that further lags of y and the explanatory variables do not matter for explaining yt When this condition holds we have a dynami cally complete model As we saw earlier dynamic completeness can be a very strong assumption for static and finite distributed lag models Once we start putting lagged y as explanatory variables we often think that the model should be dynamically complete We will touch on some exceptions to this claim in Chapter 18 Since 1137 is equivalent to E1ut0xt ut21 xt21 ut22 p2 5 0 1139 we can show that a dynamically complete model must satisfy Assumption TS5r This derivation is not crucial and can be skipped without loss of continuity For concreteness take s t Then by the law of iterated expectations see Appendix B E1utus0xt xs2 5 E3E1utus0xt xs us2 0xt xs4 5 E3usE1ut0xt xs us2 0xt xs4 where the second equality follows from E1utus0xt xs us2 5 usE1ut0xt xs us2 Now since s t 1xt xs us2 is a subset of the conditioning set in 1139 Therefore 1139 implies that E1ut0xt xs us2 5 0 and so E1utus0xt xs2 5 E1us 00xt xs2 5 0 which says that Assumption TS5r holds Since specifying a dynamically complete model means that there is no serial correlation does it fol low that all models should be dynamically complete As we will see in Chapter 18 for forecasting pur poses the answer is yes Some think that all models should be dynamically complete and that serial corre lation in the errors of a model is a sign of misspecifi cation This stance is too rigid Sometimes we really are interested in a static model such as a Phillips curve or a finite distributed lag model such as measuring the longrun percentage change in wages given a 1 increase in productivity In the next chapter we will show how to detect and correct for serial correlation in such models exaMPLe 118 fertility equation In equation 1127 we estimated a distributed lag model for Dgfr on Dpe allowing for two lags of Dpe For this model to be dynamically complete in the sense of 1138 neither lags of Dgfr nor fur ther lags of Dpe should appear in the equation We can easily see that this is false by adding Dgfr21 the coefficient estimate is 300 and its t statistic is 284 Thus the model is not dynamically complete in the sense of 1138 If 1133 holds where ut 5 et 1 a1et21 and where 5et6 is an iid sequence with mean zero and variance s2 e can equation 1133 be dynamically complete Exploring FurthEr 113 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 363 What should we make of this We will postpone an interpretation of general models with lagged dependent variables until Chapter 18 But the fact that 1127 is not dynamically complete suggests that there may be serial correlation in the errors We will see how to test and correct for this in Chapter 12 The notion of dynamic completeness should not be confused with a weaker assumption concern ing including the appropriate lags in a model In the model 1136 the explanatory variables xt are said to be sequentially exogenous if E1ut0xt xt21 p2 5 E1ut2 5 0 t 5 1 2 p 1140 As discussed in Problem 8 in Chapter 10 sequential exogeneity is implied by strict exogeneity and sequential exogeneity implies contemporaneous exogeneity Further because 1xt xt21 p2 is a subset of 1xt yt21 xt21 p2 sequential exogeneity is implied by dynamic completeness If xt contains yt21 the dynamic completeness and sequential exogeneity are the same condition The key point is that when xt does not contain yt21 sequential exogeneity allows for the possibility that the dynamics are not complete in the sense of capturing the relationship between yt and all past values of y and other explan atory variables But in finite distributed lag modelssuch as that estimated in equation 1127we may not care whether past y has predictive power for current y We are primarily interested in whether we have included enough lags of the explanatory variables to capture the distributed lag dynamics For example if we assume E1yt0zt zt21 zt22 zt23 p2 5 E1yt0zt zt21 zt222 5 a0 1 d0zt 1 d1zt21 1 d2zt22 then the regressors xt 5 1zt zt21 zt222 are sequentially exogenous because we have assumed that two lags suffice for the distributed lag dynamics But typically the model would not be dynamically com plete in the sense that E1yt0zt yt21 zt21 yt22 zt22 p2 5 E1yt0zt zt21 zt222 and we may not care In addition the explanatory variables in an FDL model may or may not be strictly exogenous 115 The Homoskedasticity Assumption for Time Series Models The homoskedasticity assumption for time series regressions particularly TS4r looks very similar to that for crosssectional regressions However since xt can contain lagged y as well as lagged explana tory variables we briefly discuss the meaning of the homoskedasticity assumption for different time series regressions In the simple static model say yt 5 b0 1 b1zt 1 ut 1141 Assumption TS4r requires that Var1ut0zt2 5 s2 Therefore even though E1yt0zt2 is a linear function of zt Var1yt0zt2 must be constant This is pretty straightforward In Example 114 we saw that for the AR1 model in 1112 the homoskedasticity assumption is Var1ut0yt212 5 Var1yt0yt212 5 s2 even though E1yt0yt212 depends on yt21 Var1yt0yt212 does not Thus the spread in the distribution of yt cannot depend on yt21 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 364 PART 2 Regression Analysis with Time Series Data Hopefully the pattern is clear now If we have the model yt 5 b0 1 b1zt 1 b2yt21 1 b3zt21 1 ut the homoskedasticity assumption is Var1ut0zt yt21 zt212 5 Var1yt0zt yt21 zt212 5 s2 so that the variance of ut cannot depend on zt yt21 or zt21 or some other function of time Generally whatever explanatory variables appear in the model we must assume that the variance of yt given these explanatory variables is constant If the model contains lagged y or lagged explanatory vari ables then we are explicitly ruling out dynamic forms of heteroskedasticity something we study in Chapter 12 But in a static model we are only concerned with Var1yt0zt2 In equation 1141 no direct restrictions are placed on say Var1yt0yt212 Summary In this chapter we have argued that OLS can be justified using asymptotic analysis provided certain condi tions are met Ideally the time series processes are stationary and weakly dependent although stationarity is not crucial Weak dependence is necessary for applying the standard large sample results particularly the central limit theorem Processes with deterministic trends that are weakly dependent can be used directly in regression anal ysis provided time trends are included in the model as in Section 105 A similar statement holds for processes with seasonality When the time series are highly persistent they have unit roots we must exercise extreme caution in using them directly in regression models unless we are convinced the CLM assumptions from Chapter 10 hold An alternative to using the levels is to use the first differences of the variables For most highly persistent economic time series the first difference is weakly dependent Using first differences changes the nature of the model but this method is often as informative as a model in levels When data are highly persistent we usually have more faith in firstdifference results In Chapter 18 we will cover some recent more advanced methods for using I1 variables in multiple regression analysis When models have complete dynamics in the sense that no further lags of any variable are needed in the equation we have seen that the errors will be serially uncorrelated This is useful because certain models such as autoregressive models are assumed to have complete dynamics In static and distributed lag models the dynamically complete assumption is often false which generally means the errors will be serially correlated We will see how to address this problem in Chapter 12 The AsympToTic GAussmArkov AssumpTions for Time series reGression Following is a summary of the five assumptions that we used in this chapter to perform largesample infer ence for time series regressions Recall that we introduced this new set of assumptions because the time series versions of the classical linear model assumptions are often violated especially the strict exogene ity no serial correlation and normality assumptions A key point in this chapter is that some sort of weak dependence is required to ensure that the central limit theorem applies We only used Assumptions TS1r through TS3r for consistency not unbiasedness of OLS When we add TS4r and TS5r we can use the usual confidence intervals t statistics and F statistics as being approximately valid in large samples Un like the GaussMarkov and classical linear model assumptions there is no historically significant name at tached to Assumptions TS1r to TS5r Nevertheless the assumptions are the analogs to the GaussMarkov assumptions that allow us to use standard inference As usual for largesample analysis we dispense with the normality assumption entirely Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 365 Assumption TS19 Linearity and Weak Dependence The stochastic process 5 1xt1 xt2 p xtk yt2 t 5 1 2 p n6 is stationary weakly dependent and follows the linear model yt 5 b0 1 b1xt1 1 b2xt2 1 p 1 bkxtk 1 ut where 5ut t 5 1 2 p n6 is the sequence of errors or disturbances Here n is the number of observations time periods Assumption TS29 No Perfect Collinearity In the sample and therefore in the underlying time series process no independent variable is constant nor a perfect linear combination of the others Assumption TS39 Zero Conditional Mean The explanatory variables are contemporaneously exogenous that is E1ut0xt1 p xtk2 5 0 Remember TS3r is notably weaker than the strict exogeneity Assumption TS3r Assumption TS49 Homoskedasticity The errors are contemporaneously homoskedastic that is Var1ut0xt2 5 s2 where xt is shorthand for 1xt1 xt2 p xtk2 Assumption TS59 No Serial Correlation For all t 2 s E1ut us0xt xs2 5 0 Key Terms Asymptotically Uncorrelated Autoregressive Process of Order One AR1 Contemporaneously Exogenous Contemporaneously Homoskedastic Covariance Stationary DifferenceStationary Process Dynamically Complete Model First Difference First Order Autocorrelation Growth Rate Highly Persistent Integrated of Order One I1 Integrated of Order Zero I0 Moving Average Process of Order One MA1 Nonstationary Process Random Walk Random Walk with Drift Sequentially Exogenous Serially Uncorrelated Stable AR1 Process Stationary Process Strongly Dependent TrendStationary Process Unit Root Process Weakly Dependent Problems 1 Let 5xt t 5 1 2 p6 be a covariance stationary process and define gh 5 Cov1xt xt1h2 for h 0 Therefore g0 5 Var1xt2 Show that Corr1xt xt1h2 5 ghg0 2 Let 5et t 5 21 0 1 p6 be a sequence of independent identically distributed random variables with mean zero and variance one Define a stochastic process by xt 5 et 2 1122et21 1 1122et22 t 5 1 2 p i Find E1xt2 and Var1xt2 Do either of these depend on t ii Show that Corr1xt xt112 5 212 and Corr1xt xt122 5 13 Hint It is easiest to use the formula in Problem 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 366 PART 2 Regression Analysis with Time Series Data iii What is Corr1xt xt1h2 for h 2 iv Is 5xt6 an asymptotically uncorrelated process 3 Suppose that a time series process 5yt6 is generated by yt 5 z 1 et for all t 5 1 2 p where 5et6 is an iid sequence with mean zero and variance s2 e The random variable z does not change over time it has mean zero and variance s2 z Assume that each et is uncorrelated with z i Find the expected value and variance of yt Do your answers depend on t ii Find Cov1yt yt1h2 for any t and h Is 5yt6 covariance stationary iii Use parts i and ii to show that Corr1yt yt1h2 5 s2 z1s2 z 1 s2 e2 for all t and h iv Does yt satisfy the intuitive requirement for being asymptotically uncorrelated Explain 4 Let 5yt t 5 1 2 p6 follow a random walk as in 1120 with y0 5 0 Show that Corr1yt yt1h2 5 t1t 1 h2 for t 1 h 0 5 For the US economy let gprice denote the monthly growth in the overall price level and let gwage be the monthly growth in hourly wages These are both obtained as differences of logarithms gprice 5 Dlog1price2 and gwage 5 Dlog1wage24 Using the monthly data in WAGEPRC we estimate the following distributed lag model gprice 5 200093 1 119 gwage 1 097 gwage21 1 040 gwage22 1000572 10522 10392 10392 1 038 gwage23 1 081 gwage24 1 107 gwage25 1 095 gwage26 10392 10392 10392 10392 1 104 gwage27 1 103 gwage28 1 159 gwage29 1 110 gwage210 10392 10392 10392 10392 1 103 gwage211 1 016 gwage212 10392 10522 n 5 273 R2 5 317 R2 5 283 i Sketch the estimated lag distribution At what lag is the effect of gwage on gprice largest Which lag has the smallest coefficient ii For which lags are the t statistics less than two iii What is the estimated longrun propensity Is it much different than one Explain what the LRP tells us in this example iv What regression would you run to obtain the standard error of the LRP directly v How would you test the joint significance of six more lags of gwage What would be the dfs in the F distribution Be careful here you lose six more observations 6 Let hy6t denote the threemonth holding yield in percent from buying a sixmonth Tbill at time 1t 2 12 and selling it at time t three months hence as a threemonth Tbill Let hy3t21 be the three month holding yield from buying a threemonth Tbill at time 1t 2 12 At time 1t 2 12 hy3t21 is known whereas hy6t is unknown because p3t the price of threemonth Tbills is unknown at time 1t 2 12 The expectations hypothesis EH says that these two different threemonth investments should be the same on average Mathematically we can write this as a conditional expectation E1hy6t0It212 5 hy3t21 where It21 denotes all observable information up through time t 1 This suggests estimating the model hy6t 5 b0 1 b1hy3t21 1 ut and testing H0 b1 5 1 We can also test H0 b0 5 0 but we often allow for a term premium for buying assets with different maturities so that b0 2 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 367 i Estimating the previous equation by OLS using the data in INTQRT spaced every three months gives hy6t 5 2058 1 1104 hy3t21 10702 10392 n 5 123 R2 5 866 Do you reject H0 b1 5 1 against H0 b1 2 1 at the 1 significance level Does the estimate seem practically different from one ii Another implication of the EH is that no other variables dated as t 2 1 or earlier should help explain hy6t once hy3t21 has been controlled for Including one lag of the spread between six month and threemonth Tbill rates gives hy6t 5 2123 1 1053 hy3t21 1 480 1r6t21 2 r3t212 10672 10392 11092 n 5 123 R2 5 885 Now is the coefficient on hy3t21 statistically different from one Is the lagged spread term significant According to this equation if at time t 2 1 r6 is above r3 should you invest in sixmonth or threemonth Tbills iii The sample correlation between hy3t and hy3t21 is 914 Why might this raise some concerns with the previous analysis iv How would you test for seasonality in the equation estimated in part ii 7 A partial adjustment model is yp t 5 g0 1 g1xt 1 et yt 2 yt21 5 l1yp t 2 yt212 1 at where yp t is the desired or optimal level of y and yt is the actual observed level For example yp t is the desired growth in firm inventories and xt is growth in firm sales The parameter g1 measures the effect of xt on yp t The second equation describes how the actual y adjusts depending on the relationship between the desired y in time t and the actual y in time t 1 The parameter l measures the speed of adjustment and satisfies 0 l 1 i Plug the first equation for yp t into the second equation and show that we can write yt 5 b0 1 b1yt21 1 b2xt 1 ut In particular find the bj in terms of the gj and l and find ut in terms of et and at Therefore the partial adjustment model leads to a model with a lagged dependent variable and a contemporaneous x ii If E1et0xt yt21 xt21 p2 5 E1at0xt yt21 xt21 p2 5 0 and all series are weakly dependent how would you estimate the bj iii If b 1 5 7 and b 2 5 2 what are the estimates of g1 and l 8 Suppose that the equation yt 5 a 1 dt 1 b1xt1 1 p 1 bkxtk 1 ut satisfies the sequential exogeneity assumption in equation 1140 i Suppose you difference the equation to obtain Dyt 5 d 1 b1Dxt1 1 p 1 bkDxtk 1 Dut How come applying OLS on the differenced equation does not generally result in consistent estimators of the bj Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 368 PART 2 Regression Analysis with Time Series Data ii What assumption on the explanatory variables in the original equation would ensure that OLS on the differences consistently estimates the bj iii Let zt1 ztk be a set of explanatory variables dated contemporaneously with yt If we specify the static regression model yt 5 b0 1 b1zt1 1 p 1 bkztk 1 ut describe what we need to assume for xt 5 zt to be sequentially exogenous Do you think the assumptions are likely to hold in economic applications Computer Exercises C1 Use the data in HSEINV for this exercise i Find the first order autocorrelation in loginvpc Now find the autocorrelation after linearly detrending loginvpc Do the same for logprice Which of the two series may have a unit root ii Based on your findings in part i estimate the equation log1invpct2 5 b0 1 b1Dlog1pricet2 1 b2t 1 ut and report the results in standard form Interpret the coefficient b 1 and determine whether it is statistically significant iii Linearly detrend log1invpct2 and use the detrended version as the dependent variable in the regression from part ii see Section 105 What happens to R2 iv Now use Dlog1invpct2 as the dependent variable How do your results change from part ii Is the time trend still significant Why or why not C2 In Example 117 define the growth in hourly wage and output per hour as the change in the natu ral log ghrwage 5 Dlog1hrwage2 and goutphr 5 Dlog1outphr2 Consider a simple extension of the model estimated in 1129 ghrwaget 5 b0 1 b1goutphrt 1 b2goutphrt21 1 ut This allows an increase in productivity growth to have both a current and lagged effect on wage growth i Estimate the equation using the data in EARNS and report the results in standard form Is the lagged value of goutphr statistically significant ii If b1 1 b2 5 1 a permanent increase in productivity growth is fully passed on in higher wage growth after one year Test H0 b1 1 b2 5 1 against the twosided alternative Remember one way to do this is to write the equation so that u 5 b1 1 b2 appears directly in the model as in Example 104 from Chapter 10 iii Does goutphrt22 need to be in the model Explain C3 i In Example 114 it may be that the expected value of the return at time t given past returns is a quadratic function of returnt21 To check this possibility use the data in NYSE to estimate returnt 5b0 1b1returnt21 1b2returnt21 2 1ut report the results in standard form ii State and test the null hypothesis that E1returnt0returnt212 does not depend on returnt21 Hint There are two restrictions to test here What do you conclude iii Drop return2 t21 from the model but add the interaction term returnt21 returnt22 Now test the efficient markets hypothesis iv What do you conclude about predicting weekly stock returns based on past stock returns C4 Use the data in PHILLIPS for this exercise but only through 1996 i In Example 115 we assumed that the natural rate of unemployment is constant An alternative form of the expectations augmented Phillips curve allows the natural rate of unemployment to depend on past levels of unemployment In the simplest case the natural rate at time t equals Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 369 unemt21 If we assume adaptive expectations we obtain a Phillips curve where inflation and unemployment are in first differences Dinf 5 b0 1 b1Dunem 1 u Estimate this model report the results in the usual form and discuss the sign size and statistical significance of b 1 ii Which model fits the data better 1119 or the model from part i Explain C5 i Add a linear time trend to equation 1127 Is a time trend necessary in the firstdifference equation ii Drop the time trend and add the variables ww2 and pill to 1127 do not difference these dummy variables Are these variables jointly significant at the 5 level iii Add the linear time trend ww2 and pill all to equation 1127 What happens to the magnitude and statistical significance of the time trend as compared with that in part i What about the coefficient on pill as compared with that in part ii iv Using the model from part iii estimate the LRP and obtain its standard error Compare this to 1019 where gfr and pe appeared in levels rather than in first differences Would you say that the link between fertility and the value of the personal exemption is a particularly robust finding C6 Let invent be the real value inventories in the United States during year t let GDPt denote real gross domestic product and let r3t denote the ex post real interest rate on threemonth Tbills The ex post real interest rate is approximately r3t 5 i3t 2 inft where i3t is the rate on threemonth Tbills and inft is the annual inflation rate see Mankiw 1994 Section 64 The change in inventories cinvent is the inventory investment for the year The accelerator model of inventory investment relates cinven to the cGDP the change in GDP cinvent 5 b0 1 b1cGDPt 1 ut where b1 0 See for example Mankiw 1994 Chapter 17 i Use the data in INVEN to estimate the accelerator model Report the results in the usual form and interpret the equation Is b 1 statistically greater than zero ii If the real interest rate rises then the opportunity cost of holding inventories rises and so an increase in the real interest rate should decrease inventories Add the real interest rate to the accelerator model and discuss the results iii Does the level of the real interest rate work better than the first difference cr3t C7 Use CONSUMP for this exercise One version of the permanent income hypothesis PIH of con sumption is that the growth in consumption is unpredictable Another version is that the change in consumption itself is unpredictable see Mankiw 1994 Chapter 15 for discussion of the PIH Let gct 5 log1ct2 2 log1ct212 be the growth in real per capita consumption of nondurables and services Then the PIH implies that E1gct0It212 5 E1gct2 where It21 denotes information known at time 1t 2 12 in this case t denotes a year i Test the PIH by estimating gct 5 b0 1 b1gct21 1 ut Clearly state the null and alternative hypotheses What do you conclude ii To the regression in part i add the variables gyt21 i3t21 and inft21 Are these new variables individually or jointly significant at the 5 level Be sure to report the appropriate pvalues iii In the regression from part ii what happens to the pvalue for the t statistic on gct21 Does this mean the PIH hypothesis is now supported by the data iv In the regression from part ii what is the F statistic and its associated pvalue for joint significance of the four explanatory variables Does your conclusion about the PIH now agree with what you found in part i Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 370 PART 2 Regression Analysis with Time Series Data C8 Use the data in PHILLIPS for this exercise i Estimate an AR1 model for the unemployment rate Use this equation to predict the unemployment rate for 2004 Compare this with the actual unemployment rate for 2004 You can find this information in a recent Economic Report of the President ii Add a lag of inflation to the AR1 model from part i Is inft21 statistically significant iii Use the equation from part ii to predict the unemployment rate for 2004 Is the result better or worse than in the model from part i iv Use the method from Section 64 to construct a 95 prediction interval for the 2004 unemployment rate Is the 2004 unemployment rate in the interval C9 Use the data in TRAFFIC2 for this exercise Computer Exercise C11 in Chapter 10 previously asked for an analysis of these data i Compute the first order autocorrelation coefficient for the variable prcfat Are you concerned that prcfat contains a unit root Do the same for the unemployment rate ii Estimate a multiple regression model relating the first difference of prcfat Dprcfat to the same variables in part vi of Computer Exercise C11 in Chapter 10 except you should first difference the unemployment rate too Then include a linear time trend monthly dummy variables the weekend variable and the two policy variables do not difference these Do you find any interesting results iii Comment on the following statement We should always first difference any time series we suspect of having a unit root before doing multiple regression because it is the safe strategy and should give results similar to using the levels In answering this you may want to do the regression from part vi of Computer Exercise C11 in Chapter 10 if you have not already C10 Use all the data in PHILLIPS to answer this question You should now use 56 years of data i Reestimate equation 1119 and report the results in the usual form Do the intercept and slope estimates change notably when you add the recent years of data ii Obtain a new estimate of the natural rate of unemployment Compare this new estimate with that reported in Example 115 iii Compute the first order autocorrelation for unem In your opinion is the root close to one iv Use cunem as the explanatory variable instead of unem Which explanatory variable gives a higher Rsquared C11 Okuns Lawsee for example Mankiw 1994 Chapter 2implies the following relationship between the annual percentage change in real GDP pcrgdp and the change in the annual unemploy ment rate cunem pcrgdp 5 3 2 2 cunem If the unemployment rate is stable real GDP grows at 3 annually For each percentage point increase in the unemployment rate real GDP grows by two percentage points less This should not be interpreted in any causal sense it is more like a statistical description To see if the data on the US economy support Okuns Law we specify a model that allows deviations via an error term pcrgdpt 5 b0 1 b1cunemt 1 ut i Use the data in OKUN to estimate the equation Do you get exactly 3 for the intercept and 2 for the slope Did you expect to ii Find the t statistic for testing H0 b1 5 22 Do you reject H0 against the twosided alternative at any reasonable significance level iii Find the t statistic for testing H0 b0 5 3 Do you reject H0 at the 5 level against the twosided alternative Is it a strong rejection iv Find the F statistic and pvalue for testing H0 b0 5 3 b1 5 22 against the alternative that H0 is false Does the test reject at the 10 level Overall would you say the data reject or tend to support Okuns Law Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 11 Further Issues in Using OLS with Time Series Data 371 C12 Use the data in MINWAGE for this exercise focusing on the wage and employment series for sector 232 Mens and Boys Furnishings The variable gwage232 is the monthly growth change in logs in the average wage in sector 232 gemp232 is the growth in employment in sector 232 gmwage is the growth in the federal minimum wage and gcpi is the growth in the urban Consumer Price Index i Find the first order autocorrelation in gwage232 Does this series appear to be weakly dependent ii Estimate the dynamic model gwage232t 5 b0 1 b1gwage232t21 1 b2gmwaget 1 b3gcpit 1 ut by OLS Holding fixed last months growth in wage and the growth in the CPI does an increase in the federal minimum result in a contemporaneous increase in gwage232t Explain iii Now add the lagged growth in employment gemp232t21 to the equation in part ii Is it statistically significant iv Compared with the model without gwage232t21 and gemp232t21 does adding the two lagged variables have much of an effect on the gmwage coefficient v Run the regression of gmwaget on gwage232t21 and gemp232t21 and report the Rsquared Comment on how the value of Rsquared helps explain your answer to part iv C13 Use the data in BEVERIDGE to answer this question The data set includes monthly observations on vacancy rates and unemployment rates for the United States from December 2000 through February 2012 i Find the correlation between urate and urate1 Would you say the correlation points more toward a unit root process or a weakly dependent process ii Repeat part i but with the vacancy rate vrate iii The Beveridge Curve relates the unemployment rate to the vacancy rate with the simplest relationship being linear uratet 5 b0 1 b1vratet 1 ut where b1 0 is expected Estimate b0 and b1 by OLS and report the results in the usual form Do you find a negative relationship iv Explain why you cannot trust the confidence interval for b1 reported by the OLS output in part iii The tools needed to study regressions of this type are presented in Chapter 18 v If you difference urate and vrate before running the regression how does the estimated slope coefficient compare with part iii Is it statistically different from zero This example shows that differencing before running an OLS regression is not always a sensible strategy But we cannot say more until Chapter 18 C14 Use the data in APPROVAL to answer the following questions See also Computer Exercise C14 in Chapter 10 i Compute the first order autocorrelations for the variables approve and lrgasprice Do they seem close enough to unity to worry about unit roots ii Consider the model approvet 5 b0 1 b1lcpifoodt 1 b2lrgaspricet 1 b3unemployt 1 b4sep11t 1 b5iraqinvadet 1 ut where the first two variables are in logarithmic form Given what you found in part i why might you hesitate to estimate this model by OLS iii Estimate the equation in part ii by differencing all variables including the dummy variables How do you interpret your estimate of b2 Is it statistically significant Report the pvalue iv Interpret your estimate of b4 and discuss its statistical significance v Add lsp500 to the model in part ii and estimate the equation by first differencing Discuss what you find for the stock market variable Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 372 I n this chapter we discuss the critical problem of serial correlation in the error terms of a multiple regression model We saw in Chapter 11 that when in an appropriate sense the dynamics of a model have been completely specified the errors will not be serially correlated Thus testing for serial correlation can be used to detect dynamic misspecification Furthermore static and finite dis tributed lag models often have serially correlated errors even if there is no underlying misspecification of the model Therefore it is important to know the consequences and remedies for serial correlation for these useful classes of models In Section 121 we present the properties of OLS when the errors contain serial correlation In Section 122 we demonstrate how to test for serial correlation We cover tests that apply to mod els with strictly exogenous regressors and tests that are asymptotically valid with general regressors including lagged dependent variables Section 123 explains how to correct for serial correlation under the assumption of strictly exogenous explanatory variables while Section 124 shows how using differenced data often eliminates serial correlation in the errors Section 125 covers more recent advances on how to adjust the usual OLS standard errors and test statistics in the presence of very general serial correlation In Chapter 8 we discussed testing and correcting for heteroskedasticity in crosssectional appli cations In Section 126 we show how the methods used in the crosssectional case can be extended to the time series case The mechanics are essentially the same but there are a few subtleties associ ated with the temporal correlation in time series observations that must be addressed In addition we briefly touch on the consequences of dynamic forms of heteroskedasticity Serial Correlation and Heteroskedasticity in Time Series Regressions c h a p t e r 12 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 373 121 Properties of OLS with Serially Correlated Errors 121a Unbiasedness and Consistency In Chapter 10 we proved unbiasedness of the OLS estimator under the first three GaussMarkov assumptions for time series regressions TS1 through TS3 In particular Theorem 101 assumed nothing about serial correlation in the errors It follows that as long as the explanatory variables are strictly exogenous the b j are unbiased regardless of the degree of serial correlation in the errors This is analogous to the observation that heteroskedasticity in the errors does not cause bias in the b j In Chapter 11 we relaxed the strict exogeneity assumption to E1ut0xt2 5 0 and showed that when the data are weakly dependent the b j are still consistent although not necessarily unbiased This result did not hinge on any assumption about serial correlation in the errors 121b Efficiency and Inference Because the GaussMarkov Theorem Theorem 104 requires both homoskedasticity and serially uncorrelated errors OLS is no longer BLUE in the presence of serial correlation Even more impor tantly the usual OLS standard errors and test statistics are not valid even asymptotically We can see this by computing the variance of the OLS estimator under the first four GaussMarkov assumptions and the AR1 serial correlation model for the error terms More precisely we assume that ut 5 rut21 1 et t 5 1 2 p n 121 0r0 1 122 where the et are uncorrelated random variables with mean zero and variance s2 e recall from Chapter 11 that assumption 122 is the stability condition We consider the variance of the OLS slope estimator in the simple regression model yt 5 b0 1 b1xt 1 ut and just to simplify the formula we assume that the sample average of the xt is zero 1x 5 02 Then the OLS estimator b1 of b1 can be written as b1 5 b1 1 SST21 x a n t51 xtut 123 where SSTx 5 g n t51 x2 t Now in computing the variance of b 1 conditional on X we must account for the serial correlation in the ut Var1b 12 5 SST22 x Vara a n t51 xtutb 5 SST22 x a a n t51 x2 tVar1ut2 1 2 a n21 t51 a n2t j51 xtxt1j E1utut1j2 b 124 5 s2SSTx 1 21s2SST2 x2 a n21 t51 a n2t j51 rjxtxt1j where s2 5 Var1ut2 and we have used the fact that E1utut1j2 5 Cov1ut ut1j2 5 rjs2 see equation 114 The first term in equation 124 s2SSTx is the variance of b 1 when r 5 0 which is the familiar OLS variance under the GaussMarkov assumptions If we ignore the serial correlation and estimate the variance in the usual way the variance estimator will usually be biased when r 2 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 374 because it ignores the second term in 124 As we will see through later examples r 0 is most common in which case r j 0 for all j Further the independent variables in regression models are often positively correlated over time so that xtxt1j is positive for most pairs t and t j Therefore in most economic applications the term g n21 t51 g n2t j51 rjxtxt1j is positive and so the usual OLS variance formula s2SSTx understates the true variance of the OLS estimator If r is large or xt has a high degree of positive serial correlationa common casethe bias in the usual OLS variance estimator can be substantial We will tend to think the OLS slope estimator is more precise than it actually is When r 0 r j is negative when j is odd and positive when j is even and so it is difficult to deter mine the sign of g n21 t51 g n2t j51 rjxtxt1j In fact it is pos sible that the usual OLS variance formula actually overstates the true variance of b 1 In either case the usual variance estimator will be biased for Var1b 12 in the presence of serial correlation Because the standard error of b 1 is an estimate of the standard deviation of b 1 using the usual OLS standard error in the presence of serial correlation is invalid Therefore t statistics are no longer valid for testing single hypotheses Since a smaller standard error means a larger t statistic the usual t statistics will often be too large when r 0 The usual F and LM statistics for testing multiple hypotheses are also invalid 121c Goodness of Fit Sometimes one sees the claim that serial correlation in the errors of a time series regression model invalidates our usual goodnessoffit measures Rsquared and adjusted Rsquared Fortunately this is not the case provided the data are stationary and weakly dependent To see why these measures are still valid recall that we defined the population Rsquared in a crosssectional context to be 1 2 s2 us2 y see Section 63 This definition is still appropriate in the context of time series regressions with stationary weakly dependent data the variances of both the error and the dependent variable do not change over time By the law of large numbers R2 and R2 both consistently estimate the population Rsquared The argument is essentially the same as in the crosssectional case in the presence of heteroskedasticity see Section 81 Because there is never an unbiased estimator of the population Rsquared it makes no sense to talk about bias in R2 caused by serial correlation All we can really say is that our goodnessoffit measures are still consistent estimators of the population parameter This argument does not go through if 5yt6 is an I1 process because Var1yt2 grows with t goodness of fit does not make much sense in this case As we discussed in Section 105 trends in the mean of yt or seasonality can and should be accounted for in computing an Rsquared Other departures from stationarity do not cause difficulty in interpreting R2 and R2 in the usual ways 121d Serial Correlation in the Presence of Lagged Dependent Variables Beginners in econometrics are often warned of the dangers of serially correlated errors in the pres ence of lagged dependent variables Almost every textbook on econometrics contains some form of the statement OLS is inconsistent in the presence of lagged dependent variables and serially cor related errors Unfortunately as a general assertion this statement is false There is a version of the statement that is correct but it is important to be very precise To illustrate suppose that the expected value of yt given yt21 is linear E1yt0yt212 5 b0 1 b1yt21 125 Suppose that rather than the AR1 model ut follows the MA1 model ut 5 et 1 aet21 Find Var1b 12 and show that it is different from the usual formula if a 2 0 Exploring FurthEr 121 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 375 where we assume stability 0b10 1 We know we can always write this with an error term as yt 5 b0 1 b1yt21 1 ut 126 E1ut0yt212 5 0 127 By construction this model satisfies the key zero conditional mean Assumption TS3r for consistency of OLS therefore the OLS estimators b 0 and b 1 are consistent It is important to see that without fur ther assumptions the errors 5ut6 can be serially correlated Condition 127 ensures that ut is uncor related with yt21 but ut and yt22 could be correlated Then because ut21 5 yt21 2 b0 2 b1yt22 the covariance between ut and ut21 is 2b1Cov1ut yt222 which is not necessarily zero Thus the errors exhibit serial correlation and the model contains a lagged dependent variable but OLS consistently estimates b0 and b1 because these are the parameters in the conditional expectation 125 The serial correlation in the errors will cause the usual OLS statistics to be invalid for testing purposes but it will not affect consistency So when is OLS inconsistent if the errors are serially correlated and the regressors contain a lagged dependent variable This happens when we write the model in error form exactly as in 126 but then we assume that 5ut6 follows a stable AR1 model as in 121 and 122 where E1et0ut21 ut22 p2 5 E1et0yt21 yt22 p2 5 0 128 Because et is uncorrelated with yt21 by assumption Cov1yt21 ut2 5 rCov1yt21 ut212 which is not zero unless r 5 0 This causes the OLS estimators of b0 and b1 from the regression of yt on yt21 to be inconsistent We now see that OLS estimation of 126 when the errors ut also follow an AR1 model leads to inconsistent estimators However the correctness of this statement makes it no less wrongheaded We have to ask What would be the point in estimating the parameters in 126 when the errors fol low an AR1 model It is difficult to think of cases where this would be interesting At least in 125 the parameters tell us the expected value of yt given yt21 When we combine 126 and 121 we see that yt really follows a second order autoregressive model or AR2 model To see this write ut21 5 yt21 2 b0 2 b1yt22 and plug this into ut 5 rut21 1 et Then 126 can be rewritten as yt 5 b0 1 b1yt21 1 r1yt21 2 b0 2 b1yt222 1 et 5 b011 2 r2 1 1b1 1 r2yt21 2 rb1yt22 1 et 5 a0 1 a1yt21 1 a2yt22 1 et where a0 5 b011 2 r2 a1 5 b1 1 r and a2 5 2rb1 Given 128 it follows that E1yt0yt21 yt22 p2 5 E1yt0yt21 yt222 5 a0 1 a1yt21 1 a2yt22 129 This means that the expected value of yt given all past y depends on two lags of y It is equation 129 that we would be interested in using for any practical purpose including forecasting as we will see in Chapter 18 We are especially interested in the parameters aj Under the appropriate stability conditions for an AR2 modelwhich we will cover in Section 123OLS estimation of 129 pro duces consistent and asymptotically normal estimators of the aj The bottom line is that you need a good reason for having both a lagged dependent variable in a model and a particular model of serial correlation in the errors Often serial correlation in the errors of a dynamic model simply indicates that the dynamic regression function has not been completely specified in the previous example we should add yt22 to the equation In Chapter 18 we will see examples of models with lagged dependent variables where the errors are serially correlated and are also correlated with yt21 But even in these cases the errors do not follow an autoregressive process Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 376 122 Testing for Serial Correlation In this section we discuss several methods of testing for serial correlation in the error terms in the multiple linear regression model yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ut We first consider the case when the regressors are strictly exogenous Recall that this requires the error ut to be uncorrelated with the regressors in all time periods see Section 103 so among other things it rules out models with lagged dependent variables 122a A t Test for AR1 Serial Correlation with Strictly Exogenous Regressors Although there are numerous ways in which the error terms in a multiple regression model can be serially correlated the most popular modeland the simplest to work withis the AR1 model in equations 121 and 122 In the previous section we explained the implications of performing OLS when the errors are serially correlated in general and we derived the variance of the OLS slope estimator in a simple regression model with AR1 errors We now show how to test for the presence of AR1 serial correlation The null hypothesis is that there is no serial correlation Therefore just as with tests for heteroskedasticity we assume the best and require the data to provide reasonably strong evidence that the ideal assumption of no serial correlation is violated We first derive a largesample test under the assumption that the explanatory variables are strictly exogenous the expected value of ut given the entire history of independent variables is zero In addi tion in 121 we must assume that E1et0ut21 ut22 p2 5 0 1210 and Var1et0ut212 5 Var1et2 5 s2 e 1211 These are standard assumptions in the AR1 model which follow when 5et6 is an iid sequence and they allow us to apply the largesample results from Chapter 11 for dynamic regression As with testing for heteroskedasticity the null hypothesis is that the appropriate GaussMarkov assumption is true In the AR1 model the null hypothesis that the errors are serially uncorrelated is H0 r 5 0 1212 How can we test this hypothesis If the ut were observed then under 1210 and 1211 we could immediately apply the asymptotic normality results from Theorem 112 to the dynamic regression model ut 5 rut21 1 et t 5 2 p n 1213 Under the null hypothesis r 5 0 5ut6 is clearly weakly dependent In other words we could esti mate r from the regression of ut on ut21 for all t 5 2 p n without an intercept and use the usual t statistic for r This does not work because the errors ut are not observed Nevertheless just as with testing for heteroskedasticity we can replace ut with the corresponding OLS residual u t Since u t depends on the OLS estimators b 0 b 1 p b k it is not obvious that using u t for ut in the regression has no effect on the distribution of the t statistic Fortunately it turns out that because of the strict exogeneity assumption the largesample distribution of the t statistic is not affected by using the OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 377 residuals in place of the errors A proof is well beyond the scope of this text but it follows from the work of Wooldridge 1991b We can summarize the asymptotic test for AR1 serial correlation very simply Testing for AR1 Serial Correlation with Strictly Exogenous Regressors i Run the OLS regression of yt on xt1 p xtk and obtain the OLS residuals u t for all t 5 1 2 p n ii Run the regression of u t on u t21 for all t 5 2 p n 1214 obtaining the coefficient r on u t21 and its t statistic tr This regression may or may not contain an intercept the t statistic for r will be slightly affected but it is asymptotically valid either way iii Use tr to test H0 r 5 0 against H1 r 2 0 in the usual way Actually since r 0 is often expected a priori the alternative can be H1 r 0 Typically we conclude that serial correla tion is a problem to be dealt with only if H0 is rejected at the 5 level As always it is best to report the pvalue for the test In deciding whether serial correlation needs to be addressed we should remember the differ ence between practical and statistical significance With a large sample size it is possible to find serial correlation even though r is practically small when r is close to zero the usual OLS inference procedures will not be far off see equation 124 Such outcomes are somewhat rare in time series applications because time series data sets are usually small ExamplE 121 Testing for aR1 Serial Correlation in the phillips Curve In Chapter 10 we estimated a static Phillips curve that explained the inflationunemployment tradeoff in the United States see Example 101 In Chapter 11 we studied a particular expectations aug mented Phillips curve where we assumed adaptive expectations see Example 115 We now test the error term in each equation for serial correlation Since the expectations augmented curve uses Dinft 5 inft 2 inft21 as the dependent variable we have one fewer observation For the static Phillips curve the regression in 1214 yields r 5 573 t 5 493 and pvalue 5 000 with 48 observations through 1996 This is very strong evidence of positive first order serial correlation One consequence of this is that the standard errors and t statistics from Chapter 10 are not valid By contrast the test for AR1 serial correlation in the expectations aug mented curve gives r 5 2036 t 5 2287 and pvalue 5 775 with 47 observations there is no evidence of AR1 serial correlation in the expectations augmented Phillips curve Although the test from 1214 is derived from the AR1 model the test can detect other kinds of serial correlation Remember r is a consistent estimator of the correlation between ut and ut21 Any serial correlation that causes adjacent errors to be correlated can be picked up by this test On the other hand it does not detect serial correlation where adjacent errors are uncorrelated Corr1ut ut212 5 0 For example ut and ut22 could be correlated In using the usual t statistic from 1214 we must assume that the errors in 1213 satisfy the appropriate homoskedasticity assumption 1211 In fact it is easy to make the test robust to heteroskedasticity in et we simply use the usual heteroskedasticityrobust t statistic from Chapter 8 For the static Phillips curve in Example 121 the heteroskedasticityrobust t statistic is 403 which is How would you use regression 1214 to construct an approximate 95 confidence interval for r Exploring FurthEr 122 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 378 smaller than the nonrobust t statistic but still very significant In Section 127 we further discuss het eroskedasticity in time series regressions including its dynamic forms 122b The DurbinWatson Test under Classical Assumptions Another test for AR1 serial correlation is the DurbinWatson test The DurbinWatson DW statistic is also based on the OLS residuals DW 5 a n t52 1u t 2 u t212 2 a n t51 u 2 t 1215 Simple algebra shows that DW and r from 1214 are closely linked DW 211 2 r 2 1216 One reason this relationship is not exact is that r has a n t52u 2 t21 in its denominator while the DW statistic has the sum of squares of all OLS residuals in its denominator Even with moderate sample sizes the approximation in 1216 is often pretty close Therefore tests based on DW and the t test based on r are conceptually the same Durbin and Watson 1950 derive the distribution of DW conditional on X something that requires the full set of classical linear model assumptions including normality of the error terms Unfortunately this distribution depends on the values of the independent variables It also depends on the sample size the number of regressors and whether the regression contains an intercept Although some econometrics packages tabulate critical values and pvalues for DW many do not In any case they depend on the full set of CLM assumptions Several econometrics texts report upper and lower bounds for the critical values that depend on the desired significance level the alternative hypothesis the number of observations and the number of regressors We assume that an intercept is included in the model Usually the DW test is com puted for the alternative H1 r 0 1217 From the approximation in 1216 r 0 implies that DW 2 and r 5 0 implies that DW 2 Thus to reject the null hypothesis 1212 in favor of 1217 we are looking for a value of DW that is significantly less than two Unfortunately because of the problems in obtaining the null distribution of DW we must compare DW with two sets of critical values These are usually labeled as dU for upper and dL for lower If DW dL then we reject H0 in favor of 1217 if DW dU we fail to reject H0 If dL DW dU the test is inconclusive As an example if we choose a 5 significance level with n 5 45 and k 5 4 dU 5 1720 and dL 5 1336 see Savin and White 1977 If DW 1336 we reject the null of no serial correlation at the 5 level if DW 172 we fail to reject H0 if 1336 DW 172 the test is inconclusive In Example 121 for the static Phillips curve DW is computed to be DW 5 80 We can obtain the lower 1 critical value from Savin and White 1977 for k 5 1 and n 5 50 dL 5 132 Therefore we reject the null of no serial correlation against the alternative of positive serial correlation at the 1 level Using the previous t test we can conclude that the pvalue equals zero to three decimal places For the expectations augmented Phillips curve DW 5 177 which is well within the failtoreject region at even the 5 level dU 5 159 The fact that an exact sampling distribution for DW can be tabulated is the only advantage that DW has over the t test from 1214 Given that the tabulated critical values are exactly valid only under the full set of CLM assumptions and that they can lead to a wide inconclusive region the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 379 practical disadvantages of the DW statistic are substantial The t statistic from 1214 is simple to compute and asymptotically valid without normally distributed errors The t statistic is also valid in the presence of heteroskedasticity that depends on the xtj Plus it is easy to make it robust to any form of heteroskedasticity 122c Testing for AR1 Serial Correlation without Strictly Exogenous Regressors When the explanatory variables are not strictly exogenous so that one or more xtj are correlated with ut21 neither the t test from regression 1214 nor the DurbinWatson statistic are valid even in large samples The leading case of nonstrictly exogenous regressors occurs when the model contains a lagged dependent variable yt21 and ut21 are obviously correlated Durbin 1970 suggested two alter natives to the DW statistic when the model contains a lagged dependent variable and the other regres sors are nonrandom or more generally strictly exogenous The first is called Durbins h statistic This statistic has a practical drawback in that it cannot always be computed so we do not cover it here Durbins alternative statistic is simple to compute and is valid when there are any number of non strictly exogenous explanatory variables The test also works if the explanatory variables happen to be strictly exogenous Testing for Serial Correlation with General Regressors i Run the OLS regression of yt on xt1 p xtk and obtain the OLS residuals u t for all t 5 1 2 p n ii Run the regression of u t on xt1 xt2 p xtk u t21 for all t 5 2 p n 1218 to obtain the coefficient r on u t21 and its t statistic tr iii Use tr to test H0 r 5 0 against H1 r 2 0 in the usual way or use a onesided alternative In equation 1218 we regress the OLS residuals on all independent variables including an intercept and the lagged residual The t statistic on the lagged residual is a valid test of 1212 in the AR1 model 1213 when we add Var1ut0xt ut212 5 s2 under H0 Any number of lagged dependent vari ables may appear among the xtj and other nonstrictly exogenous explanatory variables are allowed as well The inclusion of xt1 p xtk explicitly allows for each xtj to be correlated with ut21 and this ensures that tr has an approximate t distribution in large samples The t statistic from 1214 ignores possible correlation between xtj and ut21 so it is not valid without strictly exogenous regressors Incidentally because u t 5 yt 2 b 0 2 b 1xt1 2 p 2 b kxtk it can be shown that the t statistic on u t21 is the same if yt is used in place of u t as the dependent variable in 1218 The t statistic from 1218 is easily made robust to heteroskedasticity of unknown form in par ticular when Var1ut0xt ut212 is not constant just use the heteroskedasticityrobust t statistic on u t21 ExamplE 122 Testing for aR1 Serial Correlation in the minimum Wage Equation In Chapter 10 see Example 109 we estimated the effect of the minimum wage on the Puerto Rican employment rate We now check whether the errors appear to contain serial correlation using the test that does not assume strict exogeneity of the minimum wage or GNP variables We add the log of Puerto Rican real GNP to equation 1038 as in Computer Exercise C3 in Chapter 10 We are assuming that the underlying stochastic processes are weakly dependent but we allow them to contain a linear time trend by including t in the regression Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 380 Letting u t denote the OLS residuals we run the regression of u t on log1mincovt2 log1prgnpt2 log1usgnpt2 t and u t21 using the 37 available observations The estimated coefficient on u t21 is r 5 481 with t 5 289 two sided pvalue 5 007 Therefore there is strong evidence of AR1 serial correlation in the errors which means the t statistics for the b j that we obtained before are not valid for inference Remember though the b j are still consistent if ut is contemporaneously uncorrelated with each explanatory var iable Incidentally if we use regression 1214 instead we obtain r 5 417 and t 5 263 so the outcome of the test is similar in this case 122d Testing for Higher Order Serial Correlation The test from 1218 is easily extended to higher orders of serial correlation For example suppose that we wish to test H0 r1 5 0 r2 5 0 1219 in the AR2 model ut 5 r1ut21 1 r2ut22 1 et This alternative model of serial correlation allows us to test for second order serial correlation As always we estimate the model by OLS and obtain the OLS residuals u t Then we can run the regression of u t on xt1 xt2 p xtk u t21 and u t22 for all t 5 3 p n to obtain the F test for joint significance of u t21 and u t22 If these two lags are jointly significant at a small enough level say 5 then we reject 1219 and conclude that the errors are serially correlated More generally we can test for serial correlation in the autoregressive model of order q ut 5 r1ut21 1 r2ut22 1 p 1 rqut2q 1 et 1220 The null hypothesis is H0 r1 5 0 r2 5 0 p rq 5 0 1221 Testing for ARq Serial Correlation i Run the OLS regression of yt on xt1 p xtk and obtain the OLS residuals u t for all t 5 1 2 p n ii Run the regression of u t on xt1 xt2 p xtk u t21 u t22 p u t2q for all t 5 1q 1 12 p n 1222 iii Compute the F test for joint significance of u t21 u t22 p u t2q in 1222 The F statistic with yt as the dependent variable in 1222 can also be used as it gives an identical answer If the xtj are assumed to be strictly exogenous so that each xtj is uncorrelated with ut21 ut22 p ut2q then the xtj can be omitted from 1222 Including the xtj in the regression makes the test valid with or without the strict exogeneity assumption The test requires the homoskedasticity assumption Var1ut0xt ut21 p ut2q2 5 s2 1223 A heteroskedasticityrobust version can be computed as described in Chapter 8 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 381 An alternative to computing the F test is to use the Lagrange multiplier LM form of the statistic We covered the LM statistic for testing exclusion restrictions in Chapter 5 for crosssectional analysis The LM statistic for testing 1221 is simply LM 5 1n 2 q2R2 u 1224 where R2 u is just the usual Rsquared from regression 1222 Under the null hypothesis LM a x2 q This is usually called the BreuschGodfrey test for ARq serial correlation The LM statistic also requires 1223 but it can be made robust to heteroskedasticity For details see Wooldridge 1991b ExamplE 123 Testing for aR3 Serial Correlation In the event study of the barium chloride industry see Example 105 we used monthly data so we may wish to test for higher orders of serial correlation For illustration purposes we test for AR3 serial correlation in the errors underlying equation 1022 Using regression 1222 we find the F statistic for joint significance of u t21 u t22 and u t23 is F 5 512 Originally we had n 5 131 and we lose three observations in the auxiliary regression 1222 Because we estimate 10 parameters in 1222 for this example the df in the F statistic are 3 and 118 The pvalue of the F statistic is 0023 so there is strong evidence of AR3 serial correlation With quarterly or monthly data that have not been seasonally adjusted we sometimes wish to test for seasonal forms of serial correlation For example with quarterly data we might postulate the autoregressive model ut 5 r4ut24 1 et 1225 From the AR1 serial correlation tests it is pretty clear how to proceed When the regressors are strictly exogenous we can use a t test on u t24 in the regression of u t on u t24 for all t 5 5 p n A modification of the DurbinWatson statistic is also available see Wallis 1972 When the xtj are not strictly exogenous we can use the regression in 1218 with u t24 replacing u t21 In Example 123 the data are monthly and are not seasonally adjusted Therefore it makes sense to test for correlation between ut and ut212 A regression of u t on u t212 yields r 12 5 2187 and pvalue 5 028 so there is evidence of negative seasonal autocorrelation Including the regressors changes things only modestly r 12 5 2170 and pvalue 5 052 This is somewhat unusual and does not have an obvious explanation 123 Correcting for Serial Correlation with Strictly Exogenous Regressors If we detect serial correlation after applying one of the tests in Section 122 we have to do something about it If our goal is to estimate a model with complete dynamics we need to respecify the model In applications where our goal is not to estimate a fully dynamic model we need to find a way to carry out statistical inference as we saw in Section 121 the usual OLS test statistics are no longer valid In this section we begin with the important case of AR1 serial correlation The traditional approach to this problem assumes fixed regressors What are actually needed are strictly exogenous regressors Therefore at a minimum we should not use these corrections when the explanatory variables include lagged dependent variables Suppose you have quarterly data and you want to test for the presence of first order or fourth order serial correlation With strictly exogenous regressors how would you proceed Exploring FurthEr 123 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 382 123a Obtaining the Best Linear Unbiased Estimator in the AR1 Model We assume the GaussMarkov assumptions TS1 through TS4 but we relax Assumption TS5 In particular we assume that the errors follow the AR1 model ut 5 rut21 1 et for all t 5 1 2 p 1226 Remember that Assumption TS3 implies that ut has a zero mean conditional on X In the following analysis we let the conditioning on X be implied in order to simplify the notation Thus we write the variance of ut as Var1ut2 5 s2 e11 2 r22 1227 For simplicity consider the case with a single explanatory variable yt 5 b0 1 b1xt 1 ut for all t 5 1 2 p n Because the problem in this equation is serial correlation in the ut it makes sense to transform the equation to eliminate the serial correlation For t 2 we write yt21 5 b0 1 b1xt21 1 ut21 yt 5 b0 1 b1xt 1 ut Now if we multiply this first equation by r and subtract it from the second equation we get yt 2 ryt21 5 11 2 r2b0 1 b11xt 2 rxt212 1 et t 2 where we have used the fact that et 5 ut 2 rut21 We can write this as y 1 5 11 2 r2b0 1 b1x t 1 et t 2 1228 where y 1 5yt 2ryt21 x t 5xt 2rxt21 1229 are called the quasidifferenced data If r 5 1 these are differenced data but remember we are assuming 0r0 1 The error terms in 1228 are serially uncorrelated in fact this equation satisfies all of the GaussMarkov assumptions This means that if we knew r we could estimate b0 and b1 by regressing yt on xt provided we divide the estimated intercept by 1 2 r The OLS estimators from 1228 are not quite BLUE because they do not use the first time period This is easily fixed by writing the equation for t 5 1 as y1 5 b0 1 b1x1 1 u1 1230 Since each et is uncorrelated with u1 we can add 1230 to 1228 and still have serially uncorrelated errors However using 1227 Var1u12 5 s2 e11 2 r22 s2 e 5 Var1et2 Equation 1227 clearly does not hold when 0r0 1 which is why we assume the stability condition Thus we must multiply 1230 by 11 2 r22 12 to get errors with the same variance 11 2 r22 12y1 5 11 2 r22 12b0 1 b111 2 r22 12x1 1 11 2 r22 12u1 or y 1 5 11 2 r22 12b0 1 b1x 1 1 u 1 1231 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 383 where u1 5 11 2 r22 12u1 y1 5 11 2 r22 12y1 and so on The error in 1231 has variance Var1 u12 5 11 2 r22Var1u12 5 s2 e so we can use 1231 along with 1228 in an OLS regression This gives the BLUE estimators of b0 and b1 under Assumptions TS1 through TS4 and the AR1 model for ut This is another example of a generalized least squares or GLS estimator We saw other GLS estimators in the context of heteroskedasticity in Chapter 8 Adding more regressors changes very little For t 2 we use the equation y t 5 11 2 r2b0 1 b1x t1 1 p 1 bkx tk 1 et 1232 where xtj 5 xtj 2 rxt21 j For t 5 1 we have y1 5 11 2 r22 12y1 x1j 5 11 2 r22 12x1j and the inter cept is 11 2 r22 12b0 For given r it is fairly easy to transform the data and to carry out OLS Unless r 5 0 the GLS estimator that is OLS on the transformed data will generally be different from the original OLS estimator The GLS estimator turns out to be BLUE and since the errors in the trans formed equation are serially uncorrelated and homoskedastic t and F statistics from the transformed equation are valid at least asymptotically and exactly if the errors et are normally distributed 123b Feasible GLS Estimation with AR1 Errors The problem with the GLS estimator is that r is rarely known in practice However we already know how to get a consistent estimator of r we simply regress the OLS residuals on their lagged counter parts exactly as in equation 1214 Next we use this estimate r in place of r to obtain the quasi differenced variables We then use OLS on the equation y t 5 b0x t0 1 b1x t1 1 p 1 bkx tk 1 errort 1233 where x t0 5 11 2 r 2 for t 2 and x 10 5 11 2 r 22 12 This results in the feasible GLSFGLS esti mator of the bj The error term in 1233 contains et and also the terms involving the estimation error in r Fortunately the estimation error in r does not affect the asymptotic distribution of the FGLS estimators Feasible GLS Estimation of the AR1 Model i Run the OLS regression of yt on xt1 p xtk and obtain the OLS residuals u t t 5 1 2 p n ii Run the regression in equation 1214 and obtain r iii Apply OLS to equation 1233 to estimate b0 b1 p bk The usual standard errors t statistics and F statistics are asymptotically valid The cost of using r in place of r is that the FGLS estimator has no tractable finite sample properties In particular it is not unbiased although it is consistent when the data are weakly dependent Further even if et in 1232 is normally distributed the t and F statistics are only approximately t and F distributed because of the estimation error in r This is fine for most purposes although we must be careful with small sample sizes Since the FGLS estimator is not unbiased we certainly cannot say it is BLUE Nevertheless it is asymptotically more efficient than the OLS estimator when the AR1 model for serial correlation holds and the explanatory variables are strictly exogenous Again this statement assumes that the time series are weakly dependent There are several names for FGLS estimation of the AR1 model that come from different methods of estimating r and different treatment of the first observation CochraneOrcutt CO estimation omits the first observation and uses r from 1214 whereas PraisWinsten PW estimation uses the first observa tion in the previously suggested way Asymptotically it makes no difference whether or not the first obser vation is used but many time series samples are small so the differences can be notable in applications In practice both the CochraneOrcutt and PraisWinsten methods are used in an iterative scheme That is once the FGLS estimator is found using r from 1214 we can compute a new set of residu als obtain a new estimator of r from 1214 transform the data using the new estimate of r and Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 384 estimate 1233 by OLS We can repeat the whole process many times until the estimate of r changes by very little from the previous iteration Many regression packages implement an iterative procedure automatically so there is no additional work for us It is difficult to say whether more than one itera tion helps It seems to be helpful in some cases but theoretically the largesample properties of the iterated estimator are the same as the estimator that uses only the first iteration For details on these and other methods see Davidson and MacKinnon 1993 Chapter 10 ExamplE 124 praisWinsten Estimation in the Event Study Again using the data in BARIUM we estimate the equation in Example 105 using iterated Prais Winsten estimation For comparison we also present the OLS results in Table 121 The coefficients that are statistically significant in the PraisWinsten estimation do not differ by much from the OLS estimates in particular the coefficients on logchempi logrtwex and afdec6 It is not surprising for statistically insignificant coefficients to change perhaps markedly across dif ferent estimation methods Notice how the standard errors in the second column are uniformly higher than the standard errors in column 1 This is common The PraisWinsten standard errors account for serial correla tion the OLS standard errors do not As we saw in Section 121 the OLS standard errors usually understate the actual sampling variation in the OLS estimates and should not be relied upon when sig nificant serial correlation is present Therefore the effect on Chinese imports after the International Trade Commissions decision is now less statistically significant than we thought tafdec6 5 2169 Finally an Rsquared is reported for the PW estimation that is well below the Rsquared for the OLS estimation in this case However these Rsquareds should not be compared For OLS the Rsquared as usual is based on the regression with the untransformed dependent and independ ent variables For PW the Rsquared comes from the final regression of the transformed dependent variable on the transformed independent variables It is not clear what this R2 is actually measuring nevertheless it is traditionally reported TAblE 121 Dependent Variable logchnimp Coefficient OLS PraisWinsten logchempi 312 048 294 063 loggas 196 907 105 098 logrtwex 983 400 113 051 befile6 060 261 2016 322 affile6 2032 264 2033 322 afdec6 2565 286 2577 342 intercept 21780 2105 23708 2278 r 293 Observations Rsquared 131 305 131 202 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 385 123c Comparing OLS and FGLS In some applications of the CochraneOrcutt or PraisWinsten methods the FGLS estimates differ in practically important ways from the OLS estimates This was not the case in Example 124 Typically this has been interpreted as a verification of FGLSs superiority over OLS Unfortunately things are not so simple To see why consider the regression model yt 5 b0 1 b1xt 1 ut where the time series processes are stationary Now assuming that the law of large numbers holds consistency of OLS for b1 holds if Cov1xt ut2 5 0 1234 Earlier we asserted that FGLS was consistent under the strict exogeneity assumption which is more restrictive than 1234 In fact it can be shown that the weakest assumption that must hold for FGLS to be consistent in addition to 1234 is that the sum of xt21 and xt11 is uncorrelated with ut Cov3 1xt21 1 xt112 ut4 5 0 1235 Practically speaking consistency of FGLS requires ut to be uncorrelated with xt21 xt and xt11 How can we show that condition 1235 is needed along with 1234 The argument is simple if we assume r is known and drop the first time period as in CochraneOrcutt The argument when we use r is technically harder and yields no additional insights Since one observation cannot affect the asymptotic properties of an estimator dropping it does not affect the argument Now with known r the GLS estimator uses xt 2 rxt21 as the regressor in an equation where ut 2 rut21 is the error From Theorem 111 we know the key condition for consistency of OLS is that the error and the regressor are uncorrelated In this case we need E3 1xt 2 rxt212 1ut 2 rut212 4 5 0 If we expand the expecta tion we get E3 1xt 2 rxt212 1ut 2 rut212 4 5 E1xtut2 2 rE1xt21ut2 2 rE1xtut212 1 r2E1xt21ut212 5 2r3E1xt21ut2 1 E1xtut212 4 because E1xtut2 5 E1xt21ut212 5 0 by assumption 1234 Now under stationarity E1xtut2125 E1xt11ut2 because we are just shifting the time index one period forward Therefore E1xt21ut2 1 E1xtut212 5 E3 1xt21 1 xt112ut4 and the last expectation is the covariance in equation 1235 because E1ut2 5 0 We have shown that 1235 is necessary along with 1234 for GLS to be consistent for b1 Of course if r 5 0 we do not need 1235 because we are back to doing OLS Our derivation shows that OLS and FGLS might give significantly different estimates because 1235 fails In this case OLSwhich is still consistent under 1234is preferred to FGLS which is inconsistent If x has a lagged effect on y or xt11 reacts to changes in ut FGLS can produce mis leading results Because OLS and FGLS are different estimation procedures we never expect them to give the same estimates If they provide similar estimates of the bj then FGLS is preferred if there is evi dence of serial correlation because the estimator is more efficient and the FGLS test statistics are at least asymptotically valid A more difficult problem arises when there are practical differences in the OLS and FGLS estimates it is hard to determine whether such differences are statistically sig nificant The general method proposed by Hausman 1978 can be used but it is beyond the scope of this text The next example gives a case where OLS and FGLS are different in practically important ways Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 386 ExamplE 125 Static phillips Curve Table 122 presents OLS and iterated PraisWinsten estimates of the static Phillips curve from Example 101 using the observations through 1996 TAblE 122 Dependent Variable inf Coefficient OLS PraisWinsten unem 468 289 2716 313 intercept 1424 1719 8296 2231 r 781 Observations Rsquared 49 053 49 136 The coefficient of interest is on unem and it differs markedly between PW and OLS Because the PW estimate is consistent with the inflationunemployment tradeoff our tendency is to focus on the PW estimates In fact these estimates are fairly close to what is obtained by first differencing both inf and unem see Computer Exercise C4 in Chapter 11 which makes sense because the quasidifferencing used in PW with r 5 781 is similar to first differencing It may just be that inf and unem are not related in levels but they have a negative relationship in first differences Examples like the static Phillips curve can pose difficult problems for empirical researchers On the one hand if we are truly interested in a static relationship and if unemployment and inflation are I0 processes then OLS produces consistent estimators without additional assumptions But it could be that unemployment inflation or both have unit roots in which case OLS need not have its usual desirable properties we discuss this further in Chapter 18 In Example 125 FGLS gives more economically sensible estimates because it is similar to first differencing FGLS has the advantage of approximately eliminating unit roots 123d Correcting for Higher Order Serial Correlation It is also possible to correct for higher orders of serial correlation A general treatment is given in Harvey 1990 Here we illustrate the approach for AR2 serial correlation ut 5 r1ut21 1 r2ut22 1 et where 5et6 satisfies the assumptions stated for the AR1 model The stability conditions are more complicated now They can be shown to be see Harvey 1990 r2 21 r2 2 r1 1 and r1 1 r2 1 For example the model is stable if r1 5 8 and r2 5 23 the model is unstable if r1 5 7 and r2 5 4 Assuming the stability conditions hold we can obtain the transformation that eliminates the serial correlation In the simple regression model this is easy when t 2 yt 2 r1yt21 2 r2yt22 5 b011 2 r1 2 r22 1 b11xt 2 r1xt21 2 r2xt222 1 et or yt 5 b011 2 r1 2 r22 1 b1xt 1 et t 5 3 4 p n 1236 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 387 If we know r1 and r2 we can easily estimate this equation by OLS after obtaining the transformed variables Since we rarely know r1 and r2 we have to estimate them As usual we can use the OLS residuals u t obtain r 1 and r 2 from the regression of u t on u t21 u t22 t 5 3 p n This is the same regression used to test for AR2 serial correlation with strictly exogenous regres sors Then we use r 1 and r 2 in place of r1 and r2 to obtain the transformed variables This gives one version of the FGLS estimator If we have multiple explanatory variables then each one is trans formed by xtj 5 xtj 2 r1 xt21 j 2 r2 xt22 j when t 2 The treatment of the first two observations is a little tricky It can be shown that the dependent variable and each independent variable including the intercept should be transformed by z1 5 511 1 r22 3 11 2 r22 2 2 r2 1411 2 r22 612z1 z2 5 11 2 r2 22 12z2 2 3r111 2 r2 12 1211 2 r22 4z1 where z1 and z2 denote either the dependent or an independent variable at t 5 1 and t 5 2 respec tively We will not derive these transformations Briefly they eliminate the serial correlation between the first two observations and make their error variances equal to s2 e Fortunately econometrics packages geared toward time series analysis easily estimate models with general ARq errors we rarely need to directly compute the transformed variables ourselves 124 Differencing and Serial Correlation In Chapter 11 we presented differencing as a transformation for making an integrated process weakly dependent There is another way to see the merits of differencing when dealing with highly persistent data Suppose that we start with the simple regression model yt 5 b0 1 b1xt 1 ut t 5 1 2 p 1237 where ut follows the AR1 process in 1226 As we mentioned in Section 113 and as we will discuss more fully in Chapter 18 the usual OLS inference procedures can be very misleading when the variables yt and xt are integrated of order one or I1 In the extreme case where the errors 5ut6 in 1237 follow a random walk the equation makes no sense because among other things the variance of ut grows with t It is more logical to difference the equation Dyt 5 b1Dxt 1 Dut t 5 2 p n 1238 If ut follows a random walk then et Dut has zero mean and a constant variance and is serially uncorrelated Thus assuming that et and Dxt are uncorrelated we can estimate 1238 by OLS where we lose the first observation Even if ut does not follow a random walk but r is positive and large first differencing is often a good idea it will eliminate most of the serial correlation Of course equation 1238 is different from 1237 but at least we can have more faith in the OLS standard errors and t statistics in 1238 Allowing for multiple explanatory variables does not change anything ExamplE 126 Differencing the Interest Rate Equation In Example 102 we estimated an equation relating the threemonth Tbill rate to inflation and the federal deficit see equation 1015 If we obtain the residuals obtained from estimating 1015 and regress them on a single lag we obtain r 5 62311102 which is large and very statistically signifi cant Therefore at a minimum serial correlation is a problem in this equation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 388 If we difference the data and run the regression we obtain Di3t 5 042 1 149 D inf t 2 181 Ddeft 1 et 11712 10922 11482 1239 n 5 55 R2 5 176 R2 5 145 The coefficients from this regression are very different from the equation in levels suggesting either that the explanatory variables are not strictly exogenous or that one or more of the variables has a unit root In fact the correlation between i3t and i3t21 is about 885 which may indicate a problem with interpreting 1015 as a meaningful regression Plus the regression in differences has essentially no serial correlation a regression of et on et21 gives r 5 072 11342 Because first differencing elimi nates possible unit roots as well as serial correlation we probably have more faith in the estimates and standard errors from 1239 than 1015 The equation in differences shows that annual changes in interest rates are only weakly positively related to annual changes in inflation and the coefficient on Ddeft is actually negative though not statistically significant at even the 20 significance level against a twosided alternative As we explained in Chapter 11 the decision of whether or not to difference is a tough one But this discussion points out another benefit of differencing which is that it removes serial correlation We will come back to this issue in Chapter 18 125 Serial CorrelationRobust Inference after OLS In recent years it has become more popular to estimate models by OLS but to correct the standard errors for fairly arbitrary forms of serial correlation and heteroskedasticity Even though we know OLS will be inefficient there are some good reasons for taking this approach First the explanatory variables may not be strictly exogenous In this case FGLS is not even consistent let alone efficient Second in most applications of FGLS the errors are assumed to follow an AR1 model It may be better to compute standard errors for the OLS estimates that are robust to more general forms of serial correlation To get the idea consider equation 124 which is the variance of the OLS slope estimator in a simple regression model with AR1 errors We can estimate this variance very simply by plugging in our standard estimators of r and s2 The only problems with this are that it assumes the AR1 model holds and also assumes homoskedasticity It is possible to relax both of these assumptions A general treatment of standard errors that are both heteroskedasticity and serial correlation robust is given in Davidson and MacKinnon 1993 Here we provide a simple method to compute the robust standard error of any OLS coefficient Our treatment here follows Wooldridge 1989 Consider the standard multiple linear regression model yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ut t 5 1 2 p n 1240 which we have estimated by OLS For concreteness we are interested in obtaining a serial correlation robust standard error for b 1 This turns out to be fairly easy Write xt1 as a linear function of the remaining independent variables and an error term xt1 5 d0 1 d2xt2 1 p 1 dkxtk 1 rt where the error rt has zero mean and is uncorrelated with xt2 xt3 p xtk Suppose after estimating a model by OLS that you estimate r from regression 1214 and you obtain r 5 92 What would you do about this Exploring FurthEr 124 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 389 Then it can be shown that the asymptotic variance of the OLS estimator b 1 is AVar1b 12 5 a a n t51 E1r2 t 2 b 22 Vara a n t51 rtutb Under the no serial correlation Assumption TS5r 5at rtut6 is serially uncorrelated so either the usual OLS standard errors under homoskedasticity or the heteroskedasticityrobust standard errors will be valid But if TS5r fails our expression for AVar1b 12 must account for the correlation between at and as when t 2 s In practice it is common to assume that once the terms are farther apart than a few periods the correlation is essentially zero Remember that under weak dependence the correla tion must be approaching zero so this is a reasonable approach Following the general framework of Newey and West 1987 Wooldridge 1989 shows that AVar1b 12 can be estimated as follows Let se1b 12 denote the usual but incorrect OLS standard error and let s be the usual standard error of the regression or root mean squared error from estimat ing 1240 by OLS Let rt denote the residuals from the auxiliary regression of xt1 on xt2 xt3 p xtk 1241 including a constant as usual For a chosen integer g 0 define n 5 a n t51 a 2 t 1 2 a g h51 31 2 h1g 1 12 4a a n t5h11 a ta t2hb 1242 where a t 5 rt u t t 5 1 2 p n This looks somewhat complicated but in practice it is easy to obtain The integer g in 1242 controls how much serial correlation we are allowing in computing the standard error Once we have n the serial correlationrobust standard error of b 1 is simply se1b 12 5 3se1b 12s 42n 1243 In other words we take the usual OLS standard error of b 1 divide it by s square the result and then multiply by the square root of n This can be used to construct confidence intervals and t statistics for b 1 It is useful to see what n looks like in some simple cases When g 5 1 n 5 a n t51 a 2 t 1 a n t52 a t a t21 1244 and when g 5 2 n 5 a n t51 a 2 t 1 1432 a a n t52 a t a t21b 1 1232 a a n t53 a t a t22b 1245 The larger that g is the more terms are included to correct for serial correlation The purpose of the factor 1 hg 1 in 1242 is to ensure that n is in fact nonnegative Newey and West 1987 verify this We clearly need n 0 since n is estimating a variance and the square root of n appears in 1243 The standard error in 1243 is also robust to arbitrary heteroskedasticity In the time series liter ature the serial correlationrobust standard errors are sometimes called heteroskedasticity and auto correlation consistent or HAC standard errors In fact if we drop the second term in 1242 then 1243 becomes the usual heteroskedasticityrobust standard error that we discussed in Chapter 8 without the degrees of freedom adjustment Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 390 The theory underlying the standard error in 1243 is technical and somewhat subtle Remember we started off by claiming we do not know the form of serial correlation If this is the case how can we select the integer g Theory states that 1243 works for fairly arbitrary forms of serial correla tion provided g grows with sample size n The idea is that with larger sample sizes we can be more flexible about the amount of correlation in 1242 There has been much recent work on the relation ship between g and n but we will not go into that here For annual data choosing a small g such as g 5 1 or g 5 2 is likely to account for most of the serial correlation For quarterly or monthly data g should probably be larger such as g 5 4 or 8 for quarterly and g 5 12 or 24 for monthly assum ing that we have enough data Newey and West 1987 recommend taking g to be the integer part of 41n1002 29 others have suggested the integer part of n14 The NeweyWest suggestion is imple mented by the econometrics program Eviews For say n 5 50 which is reasonable for annual postwar data from World War II g 5 3 The integer part of n14 gives g 5 2 We summarize how to obtain a serial correlationrobust standard error for b 1 Of course since we can list any independent variable first the following procedure works for computing a standard error for any slope coefficient Serial CorrelationRobust Standard Error for b 1 i Estimate 1240 by OLS which yields se1b 12 s and the OLS residuals 5u t t 5 1 p n6 ii Compute the residuals 5rt t 5 1 p n6 from the auxiliary regression 1241 Then form a t 5 rt u t for each t iii For your choice of g compute n as in 1242 iv Compute se1b 12 from 1243 Empirically the serial correlationrobust standard errors are typically larger than the usual OLS standard errors when there is serial correlation This is true because in most cases the errors are posi tively serially correlated However it is possible to have substantial serial correlation in 5ut6 but to also have similarities in the usual and serial correlationrobust SCrobust standard errors of some coef ficients it is the sample autocorrelations of a t 5 rtu t that determine the robust standard error for b 1 The use of SCrobust standard errors has somewhat lagged behind the use of standard errors robust only to heteroskedasticity for several reasons First large cross sections where the heteroskedasticityrobust standard errors will have good properties are more common than large time series The SCrobust stand ard errors can be poorly behaved when there is substantial serial correlation and the sample size is small where small can even be as large as say 100 Second since we must choose the integer g in equa tion 1242 computation of the SCrobust standard errors is not automatic As mentioned earlier some econometrics packages have automated the selection but you still have to abide by the choice Another important reason that SCrobust standard errors are not yet routinely reported is that in the presence of severe serial correlation OLS can be very inefficient especially in small sample sizes After performing OLS and correcting the standard errors for serial correlation we find the coefficients are often insignificant or at least less significant than they were with the usual OLS standard errors If we are confident that the explanatory variables are strictly exogenous yet are skeptical about the errors following an AR1 process we can still get estimators more efficient than OLS by using a standard FGLS estimator such as PraisWinsten or CochraneOrcutt With substantial serial correla tion the quasidifferencing transformation used by PW and CO is likely to be better than doing nothing and just using OLS But if the errors do not follow an AR1 model then the standard errors reported from PW or CO estimation will be incorrect Nevertheless we can manually quasi difference the data after estimating r use pooled OLS on the transformed data and then use SCrobust standard errors in the transformed equation Computing an SCrobust standard error after quasidifferencing would ensure that any extra serial correlation is accounted for in statistical inference In fact the SCrobust standard errors probably work better after much serial correlation has been eliminated using quasi differencing or some other transformation such as that used for AR2 serial correlation Such an Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 391 approach is analogous to using weighted least squares in the presence of heteroskedasticity but then computing standard errors that are robust to having the variance function incorrectly specified see Section 84 The SCrobust standard errors after OLS estimation are most useful when we have doubts about some of the explanatory variables being strictly exogenous so that methods such as PraisWinsten and CochraneOrcutt are not even consistent It is also valid to use the SCrobust standard errors in models with lagged dependent variables assuming of course that there is good reason for allowing serial correlation in such models ExamplE 127 The puerto Rican minimum Wage We obtain an SCrobust standard error for the minimum wage effect in the Puerto Rican employment equation In Example 122 we found pretty strong evidence of AR1 serial correlation As in that example we use as additional controls logusgnp logprgnp and a linear time trend The OLS estimate of the elasticity of the employment rate with respect to the minimum wage is b1 5 22123 and the usual OLS standard error is se1b 12 5 0402 The standard error of the regression is s 5 0328 Further using the previous procedure with g 5 2 see 1245 we obtain n 5 000805 This gives the SCrobust standard error as se1b 12 5 3 1040203282 24 000805 0426 Interestingly the robust standard error is only slightly greater than the usual OLS standard error The robust t statistic is about 498 and so the estimated elasticity is still very statistically significant For comparison the iterated PW estimate of b1 is 1477 with a standard error of 0458 Thus the FGLS estimate is closer to zero than the OLS estimate and we might suspect violation of the strict exogeneity assumption Or the difference in the OLS and FGLS estimates might be explainable by sampling error It is very difficult to tell Kiefer and Vogelsang 2005 provide a different way to obtain valid inference in the presence of arbitrary serial correlation Rather than worry about the rate at which g is allowed to grow as a function of n in order for the t statistics to have asymptotic standard normal distributions Kiefer and Vogelsang derive the largesample distribution of the t statistic when b 5 1g 1 12n is allowed to settle down to a nonzero fraction In the NeweyWest setup g 1n always converges to zero For example when b 5 1 g 5 n 2 1 which means that we include every covariance term in equa tion 1242 The resulting t statistic does not have a largesample standard normal distribution but Kiefer and Vogelsang show that it does have an asymptotic distribution and they tabulate the appro priate critical values For a twosided 5 level test the critical value is 4771 and for a twosided 10 level test the critical value is 3764 Compared with the critical values from the standard normal distribution we need a t statistic substantially larger But we do not have to worry about choosing the number of covariances in 1242 Before leaving this section we note that it is possible to construct SCrobust Ftype statistics for testing multiple hypotheses but these are too advanced to cover here See Wooldridge 1991b 1995 and Davidson and MacKinnon 1993 for treatments 126 Heteroskedasticity in Time Series Regressions We discussed testing and correcting for heteroskedasticity for crosssectional applications in Chapter 8 Heteroskedasticity can also occur in time series regression models and the presence of heteroske dasticity while not causing bias or inconsistency in the b j does invalidate the usual standard errors t statistics and F statistics This is just as in the crosssectional case Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 392 In time series regression applications heteroskedasticity often receives little if any attention the problem of serially correlated errors is usually more pressing Nevertheless it is useful to briefly cover some of the issues that arise in applying tests and corrections for heteroskedasticity in time series regressions Because the usual OLS statistics are asymptotically valid under Assumptions TS1r through TS5r we are interested in what happens when the homoskedasticity assumption TS4r does not hold Assumption TS3r rules out misspecifications such as omitted variables and certain kinds of measure ment error while TS5r rules out serial correlation in the errors It is important to remember that seri ally correlated errors cause problems that adjustments for heteroskedasticity are not able to address 126a HeteroskedasticityRobust Statistics In studying heteroskedasticity for crosssectional regressions we noted how it has no bearing on the unbiasedness or consistency of the OLS estimators Exactly the same conclusions hold in the time series case as we can see by reviewing the assumptions needed for unbiasedness Theorem 101 and consistency Theorem 111 In Section 82 we discussed how the usual OLS standard errors t statistics and F statistics can be adjusted to allow for the presence of heteroskedasticity of unknown form These same adjustments work for time series regressions under Assumptions TS1r TS2r TS3r and TS5r Thus provided the only assumption violated is the homoskedasticity assumption valid inference is easily obtained in most econometric packages 126b Testing for Heteroskedasticity Sometimes we wish to test for heteroskedasticity in time series regressions especially if we are con cerned about the performance of heteroskedasticityrobust statistics in relatively small sample sizes The tests we covered in Chapter 8 can be applied directly but with a few caveats First the errors ut should not be serially correlated any serial correlation will generally invalidate a test for heteroske dasticity Thus it makes sense to test for serial correlation first using a heteroskedasticityrobust test if heteroskedasticity is suspected Then after something has been done to correct for serial correla tion we can test for heteroskedasticity Second consider the equation used to motivate the BreuschPagan test for heteroskedasticity u2 t 5 d0 1 d1xt1 1 p 1 dkxtk 1 nt 1246 where the null hypothesis is H0 d1 5 d2 5 p 5 dk 5 0 For the F statisticwith u 2 t replacing u2 t as the dependent variableto be valid we must assume that the errors 5vt6 are themselves homoskedastic as in the crosssectional case and serially uncorrelated These are implicitly assumed in computing all standard tests for heteroskedasticity including the version of the White test we covered in Section 83 Assuming that the 5vt6 are serially uncorrelated rules out certain forms of dynamic heteroskedasticity something we will treat in the next subsection If heteroskedasticity is found in the ut and the ut are not serially correlated then the heteroskedasticityrobust test statistics can be used An alternative is to use weighted least squares as in Section 84 The mechanics of weighted least squares for the time series case are identical to those for the crosssectional case How would you compute the White test for heteroskedasticity in equation 1247 Exploring FurthEr 125 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 393 ExamplE 128 Heteroskedasticity and the Efficient markets Hypothesis In Example 114 we estimated the simple model returnt 5 b0 1 b1returnt21 1 ut 1247 The EMH states that b1 5 0 When we tested this hypothesis using the data in NYSE we obtained tb1 5 155 with n 5 689 With such a large sample this is not much evidence against the EMH Although the EMH states that the expected return given past observable information should be con stant it says nothing about the conditional variance In fact the BreuschPagan test for heteroskedas ticity entails regressing the squared OLS residuals u 2 t on returnt21 u 2 t 5 466 2 1104 returnt21 1 residualt 10432 102012 1248 n 5 689 R2 5 042 The t statistic on returnt21 is about 55 indicating strong evidence of heteroskedasticity Because the coefficient on returnt21 is negative we have the interesting finding that volatility in stock returns is lower when the previous return was high and vice versa Therefore we have found what is common in many financial studies the expected value of stock returns does not depend on past returns but the variance of returns does 126c Autoregressive Conditional Heteroskedasticity In recent years economists have become interested in dynamic forms of heteroskedasticity Of course if xt contains a lagged dependent variable then heteroskedasticity as in 1246 is dynamic But dynamic forms of heteroskedasticity can appear even in models with no dynamics in the regres sion equation To see this consider a simple static regression model yt 5 b0 1 b1zt 1 ut and assume that the GaussMarkov assumptions hold This means that the OLS estimators are BLUE The homoskedasticity assumption says that Var1ut0Z2 is constant where Z denotes all n outcomes of zt Even if the variance of ut given Z is constant there are other ways that heteroskedasticity can arise Engle 1982 suggested looking at the conditional variance of ut given past errors where the conditioning on Z is left implicit Engle suggested what is known as the autoregressive conditional heteroskedasticity ARCH model The first order ARCH model is E1u2 t 0ut21 ut22 p2 5 E1u2 t 0ut212 5 a0 1 a1u2 t21 1249 where we leave the conditioning on Z implicit This equation represents the conditional variance of ut given past ut only if E1ut0ut21 ut22 p2 5 0 which means that the errors are serially uncorrelated Since conditional variances must be positive this model only makes sense if a0 0 and a1 0 if a1 5 0 there are no dynamics in the variance equation It is instructive to write 1249 as u2 t 5 a0 1 a1u2 t21 1 vt 1250 where the expected value of vt given ut21 ut22 p is zero by definition However the vt are not independent of past ut because of the constraint nt 2a0 2 a1u2 t21 Equation 1250 looks like an autoregressive model in u2 t hence the name ARCH The stability condition for this equation is a1 1 just as in the usual AR1 model When a1 0 the squared errors contain positive serial correlation even though the ut themselves do not Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 394 What implications does 1250 have for OLS Because we began by assuming the Gauss Markov assumptions hold OLS is BLUE Further even if ut is not normally distributed we know that the usual OLS test statistics are asymptotically valid under Assumptions TS1r through TS5r which are satisfied by static and distributed lag models with ARCH errors If OLS still has desirable properties under ARCH why should we care about ARCH forms of heteroskedasticity in static and distributed lag models We should be concerned for two reasons First it is possible to get consistent but not unbiased estimators of the bj that are asymptotically more efficient than the OLS estimators A weighted least squares procedure based on estimating 1250 will do the trick A maximum likelihood procedure also works under the assumption that the errors ut have a conditional normal distribution Second economists in various fields have become interested in dynamics in the conditional variance Engles original application was to the variance of United Kingdom inflation where he found that a larger magnitude of the error in the previous time period larger u2 t21 was associated with a larger error variance in the current period Since variance is often used to measure volatility and volatility is a key element in asset pricing theories ARCH models have become important in empirical finance ARCH models also apply when there are dynamics in the conditional mean Suppose we have the dependent variable yt a contemporaneous exogenous variable zt and E1yt0zt yt21 zt21 yt22 p2 5 b0 1 b1zt 1 b2yt21 1 b3zt21 so that at most one lag of y and z appears in the dynamic regression The typical approach is to assume that Var1yt0zt yt21 zt21 yt22 p2 is constant as we discussed in Chapter 11 But this variance could follow an ARCH model Var1yt0zt yt21 zt21 yt22 p2 5 Var1ut0zt yt21 zt21 yt22 p2 5 a0 1 a1u2 t21 where ut 5 yt 2 E1yt0zt yt21 zt21 yt22 p2 As we know from Chapter 11 the presence of ARCH does not affect consistency of OLS and the usual heteroskedasticityrobust standard errors and test statistics are valid Remember these are valid for any form of heteroskedasticity and ARCH is just one particular form of heteroskedasticity If you are interested in the ARCH model and its extensions see Bollerslev Chou and Kroner 1992 and Bollerslev Engle and Nelson 1994 for recent surveys ExamplE 129 aRCH in Stock Returns In Example 128 we saw that there was heteroskedasticity in weekly stock returns This heteroske dasticity is actually better characterized by the ARCH model in 1250 If we compute the OLS residuals from 1247 square these and regress them on the lagged squared residual we obtain u 2 t 5 295 1 337 u 2 t21 1 residualt 1442 10362 1251 n 5 688 R2 5 114 The t statistic on u 2 t21 is over nine indicating strong ARCH As we discussed earlier a larger error at time t 1 implies a larger variance in stock returns today It is important to see that though the squared OLS residuals are autocorrelated the OLS residu als themselves are not as is consistent with the EMH Regressing u t on u t21 gives r 5 0014 with tr 5 038 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 395 126d Heteroskedasticity and Serial Correlation in Regression Models Nothing rules out the possibility of both heteroskedasticity and serial correlation being present in a regression model If we are unsure we can always use OLS and compute fully robust standard errors as described in Section 125 Much of the time serial correlation is viewed as the most important problem because it usually has a larger impact on standard errors and the efficiency of estimators than does heteroskedasticity As we concluded in Section 122 obtaining tests for serial correlation that are robust to arbitrary heter oskedasticity is fairly straightforward If we detect serial correlation using such a test we can employ the CochraneOrcutt or PraisWinsten transformation see equation 1232 and in the transformed equation use heteroskedasticityrobust standard errors and test statistics Or we can even test for het eroskedasticity in 1232 using the BreuschPagan or White tests Alternatively we can model heteroskedasticity and serial correlation and correct for both through a combined weighted least squares AR1 procedure Specifically consider the model yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ut ut 5 htnt 1252 nt 5 rnt21 1 et 0r0 1 where the explanatory variables X are independent of et for all t and ht is a function of the xtj The process 5et6 has zero mean and constant variance s2 e and is serially uncorrelated Therefore 5vt6 satis fies a stable AR1 process The error ut is heteroskedastic in addition to containing serial correlation Var1ut0xt2 5 s2 nht where s2 n 5 s2 e11 2 r22 But nt 5 ut ht is homoskedastic and follows a stable AR1 model Therefore the transformed equation yt ht 5 b011 ht2 1 b11xt1 ht2 1 p 1 bk1xtk ht2 1 nt 1253 has AR1 errors Now if we have a particular kind of heteroskedasticity in mindthat is we know htwe can estimate 1253 using standard CO or PW methods In most cases we have to estimate ht first The following method combines the weighted least squares method from Section 84 with the AR1 serial correlation correction from Section 123 Feasible GLS with Heteroskedasticity and AR1 Serial Correlation i Estimate 1252 by OLS and save the residuals u t ii Regress log1u 2 t 2 on xt1 p xtk or on yt y2 t and obtain the fitted values say g t iii Obtain the estimates of ht h t 5 exp1g t2 iv Estimate the transformed equation h 212 t yt 5 h 212 t b0 1 b1h 212 t xt1 1 p 1 bkh 212 t xtk 1 errort 1254 by standard CochraneOrcutt or PraisWinsten methods The FGLS estimators obtained from the procedure are asymptotically efficient provided the assumptions in model 1252 hold More importantly all standard errors and test statistics from the CO or PW estimation are asymptotically valid If we allow the variance function to be misspecified or allow the possibility that any serial correlation does not follow an AR1 model then we can apply quasidifferencing to 1254 estimating the resulting equation by OLS and then obtain the Newey West standard errors By doing so we would be using a procedure that could be asymptotically effi cient while ensuring that our inference is valid asymptotically if we have misspecified our model of either heteroskedasticity or serial correlation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 396 Summary We have covered the important problem of serial correlation in the errors of multiple regression models Positive correlation between adjacent errors is common especially in static and finite distributed lag models This causes the usual OLS standard errors and statistics to be misleading although the b j can still be un biased or at least consistent Typically the OLS standard errors underestimate the true uncertainty in the parameter estimates The most popular model of serial correlation is the AR1 model Using this as the starting point it is easy to test for the presence of AR1 serial correlation using the OLS residuals An asymptotically valid t statistic is obtained by regressing the OLS residuals on the lagged residuals assuming the regressors are strictly exogenous and a homoskedasticity assumption holds Making the test robust to heteroskedasticity is simple The DurbinWatson statistic is available under the classical linear model assumptions but it can lead to an inconclusive outcome and it has little to offer over the t test For models with a lagged dependent variable or other nonstrictly exogenous regressors the standard t test on u t21 is still valid provided all independent variables are included as regressors along with u t21 We can use an F or an LM statistic to test for higher order serial correlation In models with strictly exogenous regressors we can use a feasible GLS procedureCochraneOrcutt or PraisWinstento correct for AR1 serial correlation This gives estimates that are different from the OLS estimates the FGLS estimates are obtained from OLS on quasidifferenced variables All of the usual test statistics from the transformed equation are asymptotically valid Almost all regression packages have builtin features for estimating models with AR1 errors Another way to deal with serial correlation especially when the strict exogeneity assumption might fail is to use OLS but to compute serial correlationrobust standard errors that are also robust to heter oskedasticity Many regression packages follow a method suggested by Newey and West 1987 it is also possible to use standard regression packages to obtain one standard error at a time Finally we discussed some special features of heteroskedasticity in time series models As in the crosssectional case the most important kind of heteroskedasticity is that which depends on the explana tory variables this is what determines whether the usual OLS statistics are valid The BreuschPagan and White tests covered in Chapter 8 can be applied directly with the caveat that the errors should not be serially correlated In recent years economistsespecially those who study the financial marketshave become interested in dynamic forms of heteroskedasticity The ARCH model is the leading example Key Terms AR1 Serial Correlation Autoregressive Conditional Het eroskedasticity ARCH BreuschGodfrey Test CochraneOrcutt CO Estimation DurbinWatson DW Statistic Feasible GLS FGLS PraisWinsten PW Estimation QuasiDifferenced Data Serial CorrelationRobust Stand ard Error Weighted Least Squares Problems 1 When the errors in a regression model have AR1 serial correlation why do the OLS standard errors tend to underestimate the sampling variation in the b j Is it always true that the OLS standard errors are too small 2 Explain what is wrong with the following statement The CochraneOrcutt and PraisWinsten methods are both used to obtain valid standard errors for the OLS estimates when there is a serial correlation 3 In Example 106 we used the data in FAIR to estimate a variant on Fairs model for predicting presidential election outcomes in the United States Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 397 i What argument can be made for the error term in this equation being serially uncorrelated Hint How often do presidential elections take place ii When the OLS residuals from 1023 are regressed on the lagged residuals we obtain r 5 2068 and se1r 2 5 240 What do you conclude about serial correlation in the ut iii Does the small sample size in this application worry you in testing for serial correlation 4 True or false If the errors in a regression model contain ARCH they must be serially correlated 5 i In the enterprise zone event study in Computer Exercise C5 in Chapter 10 a regression of the OLS residuals on the lagged residuals produces r 5 841 and se1r 2 5 053 What implications does this have for OLS ii If you want to use OLS but also want to obtain a valid standard error for the EZ coefficient what would you do 6 In Example 128 we found evidence of heteroskedasticity in ut in equation 1247 Thus we compute the heteroskedasticityrobust standard errors in 34 along with the usual standard errors returnt 5 180 1 059 returnt2l 10812 10382 30854 30694 n 5 689 R2 5 0035 R2 5 0020 What does using the heteroskedasticityrobust t statistic do to the significance of returnt21 7 Consider a standard multiple linear regression model with time series data yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ut Assume that Assumptions TS1 TS2 TS3 and TS4 all hold i Suppose we think that the errors 5ut6 follow an AR1 model with parameter r and so we apply the PraisWinsten method If the errors do not follow an AR1 modelfor example suppose they follow an AR2 model or an MA1 modelwhy will the usual PraisWinsten standard errors be incorrect ii Can you think of a way to use the NeweyWest procedure in conjunction with PraisWinsten estimation to obtain valid standard errors Be very specific about the steps you would follow Hint It may help to study equation 1232 and note that if 5ut6 does not follow an AR1 process et generally should be replaced by ut 2 rut21 where r is the probability limit of the estimator r Now is the error 5ut 2 rut216 serially uncorrelated in general What can you do if it is not iii Explain why your answer to part ii should not change if we drop Assumption TS4 Computer Exercises C1 In Example 116 we estimated a finite DL model in first differences changes cgfrt 5 g0 1 d0cpet 1 d1cpet21 1 d2cpet22 1 ut Use the data in FERTIL3 to test whether there is AR1 serial correlation in the errors C2 i Using the data in WAGEPRC estimate the distributed lag model from Problem 5 in Chapter 11 Use regression 1214 to test for AR1 serial correlation ii Reestimate the model using iterated CochraneOrcutt estimation What is your new estimate of the longrun propensity Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 398 iii Using iterated CO find the standard error for the LRP This requires you to estimate a modified equation Determine whether the estimated LRP is statistically different from one at the 5 level C3 i In part i of Computer Exercise C6 in Chapter 11 you were asked to estimate the accelerator model for inventory investment Test this equation for AR1 serial correlation ii If you find evidence of serial correlation reestimate the equation by CochraneOrcutt and compare the results C4 i Use NYSE to estimate equation 1248 Let h t be the fitted values from this equation the esti mates of the conditional variance How many h t are negative ii Add return2 t21 to 1248 and again compute the fitted values h t Are any h t negative iii Use the h t from part ii to estimate 1247 by weighted least squares as in Section 84 Compare your estimate of b1 with that in equation 1116 Test H0 b1 5 0 and compare the outcome when OLS is used iv Now estimate 1247 by WLS using the estimated ARCH model in 1251 to obtain the h t Does this change your findings from part iii C5 Consider the version of Fairs model in Example 106 Now rather than predicting the proportion of the twoparty vote received by the Democrat estimate a linear probability model for whether or not the Democrat wins i Use the binary variable demwins in place of demvote in 1023 and report the results in standard form Which factors affect the probability of winning Use the data only through 1992 ii How many fitted values are less than zero How many are greater than one iii Use the following prediction rule if demwins 5 you predict the Democrat wins otherwise the Republican wins Using this rule determine how many of the 20 elections are correctly predicted by the model iv Plug in the values of the explanatory variables for 1996 What is the predicted probability that Clinton would win the election Clinton did win did you get the correct prediction v Use a heteroskedasticityrobust t test for AR1 serial correlation in the errors What do you find vi Obtain the heteroskedasticityrobust standard errors for the estimates in part i Are there notable changes in any t statistics C6 i In Computer Exercise C7 in Chapter 10 you estimated a simple relationship between consump tion growth and growth in disposable income Test the equation for AR1 serial correlation using CONSUMP ii In Computer Exercise C7 in Chapter 11 you tested the permanent income hypothesis by regressing the growth in consumption on one lag After running this regression test for heteroskedasticity by regressing the squared residuals on gct21 and gc2 t21 What do you conclude C7 i For Example 124 using the data in BARIUM obtain the iterative CochraneOrcutt estimates ii Are the PraisWinsten and CochraneOrcutt estimates similar Did you expect them to be C8 Use the data in TRAFFIC2 for this exercise i Run an OLS regression of prcfat on a linear time trend monthly dummy variables and the variables wkends unem spdlaw and beltlaw Test the errors for AR1 serial correlation using the regression in equation 1214 Does it make sense to use the test that assumes strict exogeneity of the regressors ii Obtain serial correlation and heteroskedasticityrobust standard errors for the coefficients on spdlaw and beltlaw using four lags in the NeweyWest estimator How does this affect the statistical significance of the two policy variables iii Now estimate the model using iterative PraisWinsten and compare the estimates with the OLS estimates Are there important changes in the policy variable coefficients or their statistical significance Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 399 C9 The file FISH contains 97 daily price and quantity observations on fish prices at the Fulton Fish Market in New York City Use the variable logavgprc as the dependent variable i Regress logavgprc on four daily dummy variables with Friday as the base Include a linear time trend Is there evidence that price varies systematically within a week ii Now add the variables wave2 and wave3 which are measures of wave heights over the past several days Are these variables individually significant Describe a mechanism by which stormy seas would increase the price of fish iii What happened to the time trend when wave2 and wave3 were added to the regression What must be going on iv Explain why all explanatory variables in the regression are safely assumed to be strictly exogenous v Test the errors for AR1 serial correlation vi Obtain the NeweyWest standard errors using four lags What happens to the t statistics on wave2 and wave3 Did you expect a bigger or smaller change compared with the usual OLS t statistics vii Now obtain the PraisWinsten estimates for the model estimated in part ii Are wave2 and wave3 jointly statistically significant C10 Use the data in PHILLIPS to answer these questions i Using the entire data set estimate the static Phillips curve equation inft 5 b0 1 b1 unemt 1 ut by OLS and report the results in the usual form ii Obtain the OLS residuals from part i u t and obtain r from the regression u t on u t21 It is fine to include an intercept in this regression Is there strong evidence of serial correlation iii Now estimate the static Phillips curve model by iterative PraisWinsten Compare the estimate of b1 with that obtained in Table 122 Is there much difference in the estimate when the later years are added iv Rather than using PraisWinsten use iterative CochraneOrcutt How similar are the final estimates of r How similar are the PW and CO estimates of b1 C11 Use the data in NYSE to answer these questions i Estimate the model in equation 1247 and obtain the squared OLS residuals Find the average minimum and maximum values of u 2 t over the sample ii Use the squared OLS residuals to estimate the following model of heteroskedasticity Var1ut0returnt21 returnt22 p2 5 Var1ut0returnt212 5 d0 1 d1returnt21 1 d2return2 t21 Report the estimated coefficients the reported standard errors the Rsquared and the adjusted Rsquared iii Sketch the conditional variance as a function of the lagged return21 For what value of return21 is the variance the smallest and what is the variance iv For predicting the dynamic variance does the model in part ii produce any negative variance estimates v Does the model in part ii seem to fit better or worse than the ARCH1 model in Example 129 Explain vi To the ARCH1 regression in equation 1251 add the second lag u 2 t22 Does this lag seem important Does the ARCH2 model fit better than the model in part ii C12 Use the data in INVEN for this exercise see also Computer Exercise C6 in Chapter 11 i Obtain the OLS residuals from the accelerator model Dinvent 5 b0 1 b1DGDPt 1 ut and use the regression u t on u t21 to test for serial correlation What is the estimate of r How big a problem does serial correlation seem to be ii Estimate the accelerator model by PW and compare the estimate of b1 to the OLS estimate Why do you expect them to be similar Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 2 Regression Analysis with Time Series Data 400 C13 Use the data in OKUN to answer this question see also Computer Exercise C11 in Chapter 11 i Estimate the equation pcrgdpt 5 b0 1 b1cunemt 1 ut and test the errors for AR1 serial correlation without assuming 5cunemt t 5 1 2 p6 is strictly exogenous What do you conclude ii Regress the squared residuals u 2 t on cunemt this is the BreuschPagan test for heteroskedasticity in the simple regression case What do you conclude iii Obtain the heteroskedasticityrobust standard error for the OLS estimate b 1 Is it substantially different from the usual OLS standard error C14 Use the data in MINWAGE for this exercise focusing on sector 232 i Estimate the equation gwage232t 5 b0 1 b1gmwaget 1 b2gcpii 1 ut and test the errors for AR1 serial correlation Does it matter whether you assume gmwaget and gcpit are strictly exogenous What do you conclude overall ii Obtain the NeweyWest standard error for the OLS estimates in part i using a lag of 12 How do the NeweyWest standard errors compare to the usual OLS standard errors iii Now obtain the heteroskedasticityrobust standard errors for OLS and compare them with the usual standard errors and the NeweyWest standard errors Does it appear that serial correlation or heteroskedasticity is more of a problem in this application iv Use the BreuschPagan test in the original equation to verify that the errors exhibit strong heteroskedasticity v Add lags 1 through 12 of gmwage to the equation in part i Obtain the pvalue for the joint F test for lags 1 through 12 and compare it with the pvalue for the heteroskedasticityrobust test How does adjusting for heteroskedasticity affect the significance of the lags vi Obtain the pvalue for the joint significance test in part v using the NeweyWest approach What do you conclude now vii If you leave out the lags of gmwage is the estimate of the longrun propensity much different C15 Use the data in BARIUM to answer this question i In Table 121 the reported standard errors for OLS are uniformly below those of the corresponding standard errors for GLS PraisWinsten Explain why comparing the OLS and GLS standard errors is flawed ii Reestimate the equation represented by the column labeled OLS in Table 121 by OLS but now find the NeweyWest standard errors using a window g 5 4 four months How does the NeweyWest standard error on lchempi compare to the usual OLS standard error How does it compare to the PW standard error Make the same comparisons for the afdec6 variable iii Redo part ii now using a window g 5 12 What happens to the standard errors on lchempi and afdec6 when the window increases from 4 to 12 C16 Use the data in APPROVAL to answer the following questions See also Computer Exercise C14 in Chapter 11 i Estimate the equation approvet 5 b0 1 b1lcpifoodt 1 b2lrgaspricet 1 b3unemployt 1 b4sep11t 1 b5iraqinvadet 1 ut using first differencing and test the errors in the firstdifferenced FD equation for AR1 serial correlation In particular let et be the OLS residuals in the FD estimation and regress et on et21 report the pvalue of the test What is the estimate of r ii Estimate the FD equation using PraisWinsten How does the estimate of b2 compare with the OLS estimate on the FD equation What about its statistical significance iii Return to estimating the FD equation by OLS Now obtain the NeweyWest standard errors using lags of one four and eight Discuss the statistical significance of the estimate of b2 using each of the three standard errors Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 401 W e now turn to some more specialized topics that are not usually covered in a oneterm introductory course Some of these topics require few more mathematical skills than the multiple regression analysis did in Parts 1 and 2 In Chapter 13 we show how to apply multiple regression to independently pooled cross sections The issues raised are very similar to standard crosssectional analysis except that we can study how relationships change over time by including time dummy variables We also illustrate how panel data sets can be analyzed in a re gression framework Chapter 14 covers more advanced panel data methods that are nevertheless used routinely in applied work Chapters 15 and 16 investigate the problem of endogenous explanatory variables In Chapter 15 we introduce the method of instrumental variables as a way of solving the omitted variable problem as well as the measurement error problem The method of twostage least squares is used quite often in empirical economics and is indispensable for estimating simultaneous equation models a topic we turn to in Chapter 16 Chapter 17 covers some fairly advanced topics that are typically used in crosssectional analy sis including models for limited dependent variables and methods for correcting sample selection bias Chapter 18 heads in a different direction by covering some recent advances in time series econometrics that have proven to be useful in estimating dynamic relationships Chapter 19 should be helpful to students who must write either a term paper or some other paper in the applied social sciences The chapter offers suggestions for how to select a topic col lect and analyze the data and write the paper Advanced Topics Part 3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 402 U ntil now we have covered multiple regression analysis using pure crosssectional or pure time series data Although these two cases arise often in applications data sets that have both cross sectional and time series dimensions are being used more and more often in empirical research Multiple regression methods can still be used on such data sets In fact data with crosssectional and time series aspects can often shed light on important policy questions We will see several examples in this chapter We will analyze two kinds of data sets in this chapter An independently pooled cross section is obtained by sampling randomly from a large population at different points in time usually but not necessarily different years For instance in each year we can draw a random sample on hourly wages education experience and so on from the population of working people in the United States Or in every other year we draw a random sample on the selling price square footage number of bathrooms and so on of houses sold in a particular metropolitan area From a statistical standpoint these data sets have an important feature they consist of independently sampled observations This was also a key aspect in our analysis of crosssectional data among other things it rules out correla tion in the error terms across different observations An independently pooled cross section differs from a single random sample in that sampling from the population at different points in time likely leads to observations that are not identically distributed For example distributions of wages and education have changed over time in most countries As we will see this is easy to deal with in practice by allowing the intercept in a multiple regression Pooling Cross Sections across Time Simple Panel Data Methods c h a p t e r 13 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 403 model and in some cases the slopes to change over time We cover such models in Section 131 In Section 131 we discuss how pooling cross sections over time can be used to evaluate policy changes A panel data set while having both a crosssectional and a time series dimension differs in some important respects from an independently pooled cross section To collect panel datasometimes called longitudinal datawe follow or attempt to follow the same individuals families firms cit ies states or whatever across time For example a panel data set on individual wages hours educa tion and other factors is collected by randomly selecting people from a population at a given point in time Then these same people are reinterviewed at several subsequent points in time This gives us data on wages hours education and so on for the same group of people in different years Panel data sets are fairly easy to collect for school districts cities counties states and countries and policy analysis is greatly enhanced by using panel data sets we will see some examples in the following discussion For the econometric analysis of panel data we cannot assume that the obser vations are independently distributed across time For example unobserved factors such as ability that affect someones wage in 1990 will also affect that persons wage in 1991 unobserved factors that affect a citys crime rate in 1985 will also affect that citys crime rate in 1990 For this reason special models and methods have been developed to analyze panel data In Sections 133 134 and 135 we describe the straightforward method of differencing to remove timeconstant unobserved attributes of the units being studied Because panel data methods are somewhat more advanced we will rely mostly on intuition in describing the statistical properties of the estimation procedures leav ing detailed assumptions to the chapter appendix We follow the same strategy in Chapter 14 which covers more complicated panel data methods 131 Pooling Independent Cross Sections across Time Many surveys of individuals families and firms are repeated at regular intervals often each year An example is the Current Population Survey or CPS which randomly samples households each year See for example CPS7885 which contains data from the 1978 and 1985 CPS If a random sample is drawn at each time period pooling the resulting random samples gives us an independently pooled cross section One reason for using independently pooled cross sections is to increase the sample size By pool ing random samples drawn from the same population but at different points in time we can get more precise estimators and test statistics with more power Pooling is helpful in this regard only insofar as the relationship between the dependent variable and at least some of the independent variables remain constant over time As mentioned in the introduction using pooled cross sections raises only minor statistical com plications Typically to reflect the fact that the population may have different distributions in different time periods we allow the intercept to differ across periods usually years This is easily accom plished by including dummy variables for all but one year where the earliest year in the sample is usually chosen as the base year It is also possible that the error variance changes over time some thing we discuss later Sometimes the pattern of coefficients on the year dummy variables is itself of interest For exam ple a demographer may be interested in the following question After controlling for education has the pattern of fertility among women over age 35 changed between 1972 and 1984 The following example illustrates how this question is simply answered by using multiple regression analysis with year dummy variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 404 ExamplE 131 Womens Fertility over Time The data set in FERTIL1 which is similar to that used by Sander 1992 comes from the National Opinion Research Centers General Social Survey for the even years from 1972 to 1984 inclusively We use these data to estimate a model explaining the total number of kids born to a woman kids One question of interest is After controlling for other observable factors what has happened to fertility rates over time The factors we control for are years of education age race region of the country where living at age 16 and living environment at age 16 The estimates are given in Table 131 The base year is 1972 The coefficients on the year dummy variables show a sharp drop in fertil ity in the early 1980s For example the coefficient on y82 implies that holding education age and other factors fixed a woman had on average 52 less children or about onehalf a child in 1982 than in 1972 This is a very large drop holding educ age and the other factors fixed 100 women in 1982 are predicted to have about 52 fewer children than 100 comparable women in 1972 Since we are controlling for education this drop is separate from the decline in fertility that is due to the increase in average education levels The average years of education are 122 for 1972 and 133 for 1984 The coefficients on y82 and y84 represent drops in fertility for reasons that are not captured in the explana tory variables Given that the 1982 and 1984 year dummies are individually quite significant it is not surprising that as a group the year dummies are jointly very significant the Rsquared for the regression without the year dummies is 1019 and this leads to F61111 5 587 and pvalue 0 TAblE 131 Determinants of Womens Fertility Dependent Variable kids Independent Variables Coefficients Standard Errors educ 2128 018 age 532 138 age 2 20058 0016 black 1076 174 east 217 133 northcen 363 121 west 198 167 farm 2053 147 othrural 2163 175 town 084 124 smcity 212 160 y74 268 173 y76 2097 179 y78 2069 182 y80 2071 183 y82 2522 172 y84 2545 175 constant 27742 3052 n 5 1129 R 2 5 1295 R 2 5 1162 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 405 Women with more education have fewer children and the estimate is very statistically signifi cant Other things being equal 100 women with a college education will have about 51 fewer children on average than 100 women with only a high school education 1284 5 512 Age has a diminish ing effect on fertility The turning point in the quadratic is at about age 5 46 by which time most women have finished having children The model estimated in Table 131 assumes that the effect of each explanatory variable particu larly education has remained constant This may or may not be true you will be asked to explore this issue in Computer Exercise C1 Finally there may be heteroskedasticity in the error term underlying the estimated equation This can be dealt with using the methods in Chapter 8 There is one interesting difference here now the error variance may change over time even if it does not change with the values of educ age black and so on The heteroskedasticityrobust standard errors and test statistics are nevertheless valid The BreuschPagan test would be obtained by regressing the squared OLS residuals on all of the inde pendent variables in Table 131 including the year dummies For the special case of the White sta tistic the fitted values kids and the squared fitted values are used as the independent variables as always A weighted least squares procedure should account for variances that possibly change over time In the procedure discussed in Section 84 year dummies would be included in equation 832 We can also interact a year dummy variable with key explanatory variables to see if the effect of that variable has changed over a certain time period The next example examines how the return to edu cation and the gender gap have changed from 1978 to 1985 ExamplE 132 Changes in the Return to Education and the Gender Wage Gap A logwage equation where wage is hourly wage pooled across the years 1978 the base year and 1985 is log1wage2 5 b0 1 d0y85 1 b1educ 1 d1y85 educ 1 b2exper 131 1 b3exper2 1 b4union 1 b5female 1 d5y85 female 1 u where most explanatory variables should by now be familiar The variable union is a dummy vari able equal to one if the person belongs to a union and zero otherwise The variable y85 is a dummy variable equal to one if the observation comes from 1985 and zero if it comes from 1978 There are 550 people in the sample in 1978 and a different set of 534 people in 1985 The intercept for 1978 is b0 and the intercept for 1985 is b0 1 d0 The return to education in 1978 is b1 and the return to education in 1985 is b1 1 d1 Therefore d1 measures how the return to another year of education has changed over the sevenyear period Finally in 1978 the logwage dif ferential between women and men is b5 the differential in 1985 is b5 1 d5 Thus we can test the null hypothesis that nothing has happened to the gender differential over this sevenyear period by testing H0 d5 5 0 The alternative that the gender differential has been reduced is H1 d5 0 For simplicity we have assumed that experience and union membership have the same effect on wages in both time periods Before we present the estimates there is one other issue we need to addressnamely hourly wage here is in nominal or current dollars Since nominal wages grow simply due to inflation we are really interested in the effect of each explanatory variable on real wages Suppose that we set tle on measuring wages in 1978 dollars This requires deflating 1985 wages to 1978 dollars Using In reading Table 131 someone claims that if everything else is equal in the table a black woman is expected to have one more child than a nonblack woman Do you agree with this claim Exploring FurthEr 131 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 406 the Consumer Price Index for the 1997 Economic Report of the President the deflation factor is 1076652 165 Although we can easily divide each 1985 wage by 165 it turns out that this is not necessary provided a 1985 year dummy is included in the regression and logwage as opposed to wage is used as the dependent variable Using real or nominal wage in a logarithmic functional form only affects the coefficient on the year dummy y85 To see this let P85 denote the deflation fac tor for 1985 wages 165 if we use the CPI Then the log of the real wage for each person i in the 1985 sample is log1wagei P852 5 log1wagei2 2 log1P852 Now while wagei differs across people P85 does not Therefore logP85 will be absorbed into the intercept for 1985 This conclusion would change if for example we used a different price index for people living in different parts of the country The bottom line is that for studying how the return to education or the gender gap has changed we do not need to turn nominal wages into real wages in equation 131 Computer Exercise C2 asks you to verify this for the current example If we forget to allow different intercepts in 1978 and 1985 the use of nominal wages can produce seriously misleading results If we use wage rather than logwage as the dependent variable it is important to use the real wage and to include a year dummy The previous discussion generally holds when using dollar values for either the dependent or independent variables Provided the dollar amounts appear in logarithmic form and dummy variables are used for all time periods except of course the base period the use of aggregate price deflators will only affect the intercepts none of the slope estimates will change Now we use the data in CPS7885 to estimate the equation log1wage2 5 459 1 118 y85 1 0747 educ 1 0185 y85 educ 10932 11242 100672 100942 1 0296 exper 2 00040 exper2 1 202 union 100362 1000082 10302 132 2 317 female 1 085 y85 female 10372 10512 n 5 1084 R2 5 426 R2 5 422 The return to education in 1978 is estimated to be about 75 the return to education in 1985 is about 185 percentage points higher or about 935 Because the t statistic on the interaction term is 01850094 197 the difference in the return to education is statistically significant at the 5 level against a twosided alternative What about the gender gap In 1978 other things being equal a woman earned about 317 less than a man 272 is the more accurate estimate In 1985 the gap in logwage is 2317 1 085 5 2232 Therefore the gender gap appears to have fallen from 1978 to 1985 by about 85 percentage points The t statistic on the interaction term is about 167 which means it is significant at the 5 level against the positive onesided alternative What happens if we interact all independent variables with y85 in equation 132 This is identi cal to estimating two separate equations one for 1978 and one for 1985 Sometimes this is desirable For example in Chapter 7 we discussed a study by Krueger 1993 in which he estimated the return to using a computer on the job Krueger estimates two separate equations one using the 1984 CPS and the other using the 1989 CPS By comparing how the return to education changes across time and whether or not computer usage is controlled for he estimates that onethird to onehalf of the observed increase in the return to education over the fiveyear period can be attributed to increased computer usage See Tables VIII and IX in Krueger 1993 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 407 131a The Chow Test for Structural Change across Time In Chapter 7 we discussed how the Chow testwhich is simply an F testcan be used to determine whether a multiple regression function differs across two groups We can apply that test to two different time periods as well One form of the test obtains the sum of squared residuals from the pooled estimation as the restricted SSR The unrestricted SSR is the sum of the SSRs for the two separately estimated time periods The mechanics of computing the statistic are exactly as they were in Section 74 A heteroskedasticityrobust version is also available see Section 82 Example 132 suggests another way to compute the Chow test for two time periods by interacting each variable with a year dummy for one of the two years and testing for joint significance of the year dummy and all of the interaction terms Since the intercept in a regression model often changes over time due to say inflation in the housing price example this fullblown Chow test can detect such changes It is usually more interesting to allow for an intercept difference and then to test whether certain slope coefficients change over time as we did in Example 132 A Chow test can also be computed for more than two time periods Just as in the twoperiod case it is usually more interesting to allow the intercepts to change over time and then test whether the slope coefficients have changed over time We can test the constancy of slope coefficients generally by interacting all of the timeperiod dummies except that defining the base group with one several or all of the explanatory variables and test the joint significance of the interaction terms Computer Exercises C1 and C2 are examples For many time periods and explanatory variables constructing a full set of interactions can be tedious Alternatively we can adapt the approach described in part vi of Computer Exercise C11 in Chapter 7 First estimate the restricted model by doing a pooled regression allowing for different time intercepts this gives SSRr Then run a regression for each of the say T time periods and obtain the sum of squared residuals for each time period The unrestricted sum of squared residuals is obtained as SSRur 5 SSR1 1 SSR2 1 p 1 SSRT If there are k explana tory variables not including the intercept or the time dummies with T time periods then we are test ing 1T 2 12k restrictions and there are T 1 Tk parameters estimated in the unrestricted model So if n 5 n1 1 n2 1 p 1 nT is the total number of observations then the df of the F test are 1T 2 12k and n 2 T 2 Tk We compute the F statistic as usual 3 1SSRr 2 SSRur2SSRur43 1n 2 T 2 Tk21T 2 12k4 Unfortunately as with any F test based on sums of squared residuals or Rsquareds this test is not robust to heteroskedasticity including changing variances across time To obtain a heteroskedasticityrobust test we must construct the interaction terms and do a pooled regression 132 Policy Analysis with Pooled Cross Sections Pooled cross sections can be very useful for evaluating the impact of a certain event or policy The fol lowing example of an event study shows how two crosssectional data sets collected before and after the occurrence of an event can be used to determine the effect on economic outcomes ExamplE 133 Effect of a Garbage Incinerators location on Housing prices Kiel and McClain 1995 studied the effect that a new garbage incinerator had on housing values in North Andover Massachusetts They used many years of data and a fairly complicated econometric analysis We will use two years of data and some simplified models but our analysis is similar The rumor that a new incinerator would be built in North Andover began after 1978 and construction began in 1981 The incinerator was expected to be in operation soon after the start of construction the incinerator actually began operating in 1985 We will use data on prices of houses that sold in 1978 and another sample on those that sold in 1981 The hypothesis is that the price of houses located near the incinerator would fall relative to the price of more distant houses Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 408 For illustration we define a house to be near the incinerator if it is within three miles In Computer Exercise C3 you are instead asked to use the actual distance from the house to the incin erator as in Kiel and McClain 1995 We will start by looking at the dollar effect on housing prices This requires us to measure price in constant dollars We measure all housing prices in 1978 dollars using the Boston housing price index Let rprice denote the house price in real terms A naive analyst would use only the 1981 data and estimate a very simple model rprice 5 g0 1 g1nearinc 1 u 133 where nearinc is a binary variable equal to one if the house is near the incinerator and zero otherwise Estimating this equation using the data in KIELMC gives rprice 5 1013075 2 3068827 nearinc 1309302 15827712 134 n 5 142 R2 5 165 Since this is a simple regression on a single dummy variable the intercept is the average selling price for homes not near the incinerator and the coefficient on nearinc is the difference in the average sell ing price between homes near the incinerator and those that are not The estimate shows that the aver age selling price for the former group was 3068827 less than for the latter group The t statistic is greater than five in absolute value so we can strongly reject the hypothesis that the average value for homes near and far from the incinerator are the same Unfortunately equation 134 does not imply that the siting of the incinerator is causing the lower housing values In fact if we run the same regression for 1978 before the incinerator was even rumored we obtain rprice 5 8251723 2 1882437 nearinc 12653792 14744592 135 n 5 179 R2 5 082 Therefore even before there was any talk of an incinerator the average value of a home near the site was 1882437 less than the average value of a home not near the site 8251723 the difference is statistically significant as well This is consistent with the view that the incinerator was built in an area with lower housing values How then can we tell whether building a new incinerator depresses housing values The key is to look at how the coefficient on nearinc changed between 1978 and 1981 The difference in aver age housing value was much larger in 1981 than in 1978 3068827 versus 1882437 even as a percentage of the average value of homes not near the incinerator site The difference in the two coef ficients on nearinc is d 1 5 23068827 2 1218824372 5 2118639 This is our estimate of the effect of the incinerator on values of homes near the incinerator site In empirical economics d 1 has become known as the differenceindifferences estimator because it can be expressed as d 1 5 1rprice81 nr 2 rprice81 fr2 2 1rprice78 nr 2 rprice78 fr2 136 where nr stands for near the incinerator site and fr stands for farther away from the site In other words d 1 is the difference over time in the average difference of housing prices in the two locations To test whether d 1 is statistically different from zero we need to find its standard error by using a regression analysis In fact d 1 can be obtained by estimating rprice 5 b0 1 d0y81 1 b1nearinc 1 d1y81 nearinc 1 u 137 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 409 using the data pooled over both years The intercept b0 is the average price of a home not near the incinerator in 1978 The parameter d0 captures changes in all housing values in North Andover from 1978 to 1981 A comparison of equations 134 and 135 shows that housing values in North Andover relative to the Boston housing price index increased sharply over this period The coef ficient on nearinc b1 measures the location effect that is not due to the presence of the incinerator as we saw in equation 135 even in 1978 homes near the incinerator site sold for less than homes farther away from the site The parameter of interest is on the interaction term y81nearinc d1 measures the decline in hous ing values due to the new incinerator provided we assume that houses both near and far from the site did not appreciate at different rates for other reasons The estimates of equation 137 are given in column 1 of Table 132 The only number we could not obtain from equations 134 and 135 is the standard error of d 1 The t statistic on d 1 is about 2159 which is marginally significant against a onesided alternative 1pvalue 0572 Kiel and McClain 1995 included various housing characteristics in their analysis of the incin erator siting There are two good reasons for doing this First the kinds of homes selling near the incinerator in 1981 might have been systematically different than those selling near the incinerator in 1978 if so it can be important to control for such characteristics Second even if the relevant house characteristics did not change including them can greatly reduce the error variance which can then shrink the standard error of d 1 See Section 63 for discussion In column 2 we control for the age of the houses using a quadratic This substantially increases the Rsquared by reducing the residual variance The coefficient on y81nearinc is now much larger in magnitude and its standard error is lower In addition to the age variables in column 2 column 3 controls for distance to the inter state in feet intst land area in feet land house area in feet area number of rooms rooms and number of baths baths This produces an estimate on y81nearinc closer to that without any controls but it yields a much smaller standard error the t statistic for d 1 is about 2284 Therefore we find a much more significant effect in column 3 than in column 1 The column 3 estimates are preferred because they control for the most factors and have the smallest standard errors except in the constant which is not important here The fact that nearinc has a much smaller coefficient and is insignificant in column 3 indicates that the characteristics included in column 3 largely capture the housing characteristics that are most important for determining housing prices TAblE 132 Effects of Incinerator Location on Housing Prices Dependent Variable rprice Independent Variable 1 2 3 constant 8251723 272691 8911654 240605 1380767 1116659 y81 1879029 405007 2132104 344363 1392848 279875 nearinc 1882437 487532 939794 481222 378034 445342 y81nearinc 1186390 745665 2192027 635975 1417793 498727 Other controls No age age2 Full Set Observations Rsquared 321 174 321 414 321 660 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 410 For the purpose of introducing the method we used the level of real housing prices in Table 132 It makes more sense to use log price or logrprice in the analysis in order to get an approximate percentage effect The basic model becomes log1price2 5 b0 1 d0y81 1 b1nearinc 1 d1y81 nearinc 1 u 138 Now 100 d1 is the approximate percentage reduction in housing value due to the incinerator Just as in Example 132 using logprice versus logrprice only affects the coefficient on y81 Using the same 321 pooled observations gives log1price2 5 1129 1 457 y81 2 340 nearinc 2 063 y81 nearinc 1312 10452 10552 10832 139 n 5 321 R2 5 409 The coefficient on the interaction term implies that because of the new incinerator houses near the incinerator lost about 63 in value However this estimate is not statistically different from zero But when we use a full set of controls as in column 3 of Table 132 but with intst land and area appearing in logarithmic form the coefficient on y81nearinc becomes 2132 with a t statistic of about 2253 Again controlling for other factors turns out to be important Using the logarithmic form we estimate that houses near the incinerator were devalued by about 132 The methodology used in the previous example has numerous applications especially when the data arise from a natural experiment or a quasiexperiment A natural experiment occurs when some exogenous eventoften a change in government policy changes the environment in which individuals families firms or cities operate A natural experiment always has a control group which is not affected by the policy change and a treatment group which is thought to be affected by the policy change Unlike a true experiment in which treatment and control groups are randomly and explicitly chosen the control and treatment groups in natural experiments arise from the particular policy change To control for systematic differences between the control and treatment groups we need two years of data one before the policy change and one after the change Thus our sample is usefully broken down into four groups the control group before the change the control group after the change the treatment group before the change and the treatment group after the change Call C the control group and T the treatment group letting dT equal unity for those in the treat ment group T and zero otherwise Then letting d2 denote a dummy variable for the second post policy change time period the equation of interest is y 5 b0 1 d0d2 1 b1dT 1 d1d2 dT 1 other factors 1310 where y is the outcome variable of interest As in Example 133 d1 measures the effect of the policy Without other factors in the regression d 1 will be the differenceindifferences estimator d 1 5 1y2T 2 y2C2 2 1y1T 2 y1C2 1311 where the bar denotes average the first subscript denotes the year and the second subscript denotes the group The general differenceindifferences setup is shown in Table 133 Table 133 suggests that the parameter d1 sometimes called the average treatment effect because it measures the effect of the treatment or policy on the average outcome of y can be estimated in two ways 1 Compute the differences in averages between the treatment and control groups in each time period and then difference the results over time this is just as in equation 1311 2 Compute the change in aver ages over time for each of the treatment and control groups and then difference these changes which means we simply write d 1 5 1y2T 2 y1T2 2 1y2C 2 y1C2 Naturally the estimate d 1 does not depend on how we do the differencing as is seen by simple rearrangement Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 411 When explanatory variables are added to equation 1310 to control for the fact that the popula tions sampled may differ systematically over the two periods the OLS estimate of d1 no longer has the simple form of 1311 but its interpretation is similar ExamplE 134 Effect of Worker Compensation laws on Weeks out of Work Meyer Viscusi and Durbin 1995 hereafter MVD studied the length of time in weeks that an injured worker receives workers compensation On July 15 1980 Kentucky raised the cap on weekly earnings that were covered by workers compensation An increase in the cap has no effect on the benefit for lowincome workers but it makes it less costly for a highincome worker to stay on work ers compensation Therefore the control group is lowincome workers and the treatment group is highincome workers highincome workers are defined as those who were subject to the prepolicy change cap Using random samples both before and after the policy change MVD were able to test whether more generous workers compensation causes people to stay out of work longer everything else fixed They started with a differenceindifferences analysis using logdurat as the dependent variable Let afchnge be the dummy variable for observations after the policy change and highearn the dummy variable for high earners Using the data in INJURY the estimated equation with stand ard errors in parentheses is log1durat2 5 1126 1 0077 afchnge 1 256 highearn 100312 104472 10472 1 191 afchnge highearn 1312 10692 n 5 5626 R2 5 021 Therefore d 1 5 1911t 5 2772 which implies that the average length of time on workers compen sation for high earners increased by about 19 due to the increased earnings cap The coefficient on afchnge is small and statistically insignificant as is expected the increase in the earnings cap has no effect on duration for lowincome workers This is a good example of how we can get a fairly precise estimate of the effect of a policy change even though we cannot explain much of the variation in the dependent variable The dummy variables in 1312 explain only 21 of the variation in logdurat This makes sense there are clearly many factors including severity of the injury that affect how long someone receives workers compensa tion Fortunately we have a very large sample size and this allows us to get a significant t statistic MVD also added a variety of controls for gender marital status age industry and type of injury This allows for the fact that the kinds of people and types of injuries may differ systematically by earnings group across the two years Controlling for these factors turns out to have little effect on the estimate of d1 See Computer Exercise C4 Sometimes the two groups consist of people living in two neighboring states in the United States For example to assess the impact of changing cigarette taxes on cigarette consumption we can obtain random samples from two states for two years In State A the control group there was no change in TAblE 133 Illustration of the DifferenceinDifferences Estimator Before After After 2 Before Control b0 b0 1 d0 d0 Treatment b0 1 b1 b0 1 d0 1 b1 1 d1 d0 1 d1 TreatmentControl b1 b1 1 d1 d1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 412 the cigarette tax In State B the treatment group the tax increased or decreased between the two years The outcome variable would be a measure of ciga rette consumption and equation 1310 can be esti mated to determine the effect of the tax on cigarette consumption For an interesting survey on natural experiment methodology and several additional examples see Meyer 1995 133 TwoPeriod Panel Data Analysis We now turn to the analysis of the simplest kind of panel data for a cross section of individuals schools firms cities or whatever we have two years of data call these t 5 1 and t 5 2 These years need not be adjacent but t 5 1 corresponds to the earlier year For example the file CRIME2 con tains data on among other things crime and unemployment rates for 46 cities for 1982 and 1987 Therefore t 5 1 corresponds to 1982 and t 5 2 corresponds to 1987 What happens if we use the 1987 cross section and run a simple regression of crmrte on unem We obtain crmrte 5 12838 2 416 unem 120762 13422 n 5 46 R2 5 033 If we interpret the estimated equation causally it implies that an increase in the unemployment rate lowers the crime rate This is certainly not what we expect The coefficient on unem is not statistically significant at standard significance levels at best we have found no link between crime and unem ployment rates As we have emphasized throughout this text this simple regression equation likely suffers from omitted variable problems One possible solution is to try to control for more factors such as age distribution gender distribution education levels law enforcement efforts and so on in a multiple regression analysis But many factors might be hard to control for In Chapter 9 we showed how including the crmrte from a previous yearin this case 1982can help to control for the fact that different cities have historically different crime rates This is one way to use two years of data for estimating a causal effect An alternative way to use panel data is to view the unobserved factors affecting the dependent variable as consisting of two types those that are constant and those that vary over time Letting i denote the crosssectional unit and t the time period we can write a model with a single observed explanatory variable as yit 5b0 1d 0d2t 1b1xit 1ai 1uit t 51 2 1313 In the notation yit i denotes the person firm city and so on and t denotes the time period The vari able d2t is a dummy variable that equals zero when t 5 1 and one when t 5 2 it does not change across i which is why it has no i subscript Therefore the intercept for t 5 1 is b0 and the intercept for t 5 2 is b0 1 d0 Just as in using independently pooled cross sections allowing the intercept to change over time is important in most applications In the crime example secular trends in the United States will cause crime rates in all US cities to change perhaps markedly over a fiveyear period The variable ai captures all unobserved timeconstant factors that affect yit The fact that ai has no t subscript tells us that it does not change over time Generically ai is called an unobserved effect It is also common in applied work to find ai referred to as a fixed effect which helps us to remember that ai is fixed over time The model in 1313 is called an unobserved effects model or What do you make of the coefficient and t statistic on highearn in equation 1312 Exploring FurthEr 132 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 413 a fixed effects model In applications you might see ai referred to as unobserved heterogeneity as well or individual heterogeneity firm heterogeneity city heterogeneity and so on The error uit is often called the idiosyncratic error or timevarying error because it represents unobserved factors that change over time and affect yit These are very much like the errors in a straight time series regression equation A simple unobserved effects model for city crime rates for 1982 and 1987 is crmrteit 5 b0 1 d0d87t 1 b1unemit 1 ai 1 uit 1314 where d87 is a dummy variable for 1987 Since i denotes different cities we call ai an unobserved city effect or a city fixed effect it represents all factors affecting city crime rates that do not change over time Geographical features such as the citys location in the United States are included in ai Many other factors may not be exactly constant but they might be roughly constant over a fiveyear period These might include certain demographic features of the population age race and educa tion Different cities may have their own methods for reporting crimes and the people living in the cities might have different attitudes toward crime these are typically slow to change For historical reasons cities can have very different crime rates and historical factors are effectively captured by the unobserved effect ai How should we estimate the parameter of interest b1 given two years of panel data One pos sibility is just to pool the two years and use OLS essentially as in Section 131 This method has two drawbacks The most important of these is that in order for pooled OLS to produce a consistent esti mator of b1 we would have to assume that the unobserved effect ai is uncorrelated with xit We can easily see this by writing 1313 as yit 5 b0 1 d0d2t 1 b1xit 1 vit t 5 1 2 1315 where vit 5 ai 1 uit is often called the composite error From what we know about OLS we must assume that vit is uncorrelated with xit where t 5 1 or 2 for OLS to estimate b1 and the other param eters consistently This is true whether we use a single cross section or pool the two cross sections Therefore even if we assume that the idiosyncratic error uit is uncorrelated with xit pooled OLS is biased and inconsistent if ai and xit are correlated The resulting bias in pooled OLS is sometimes called het erogeneity bias but it is really just bias caused from omitting a timeconstant variable To illustrate what happens we use the data in CRIME2 to estimate 1314 by pooled OLS Since there are 46 cities and two years for each city there are 92 total observations crmrte 5 9342 1 794 d87 1 427 unem 112742 17982 111882 1316 n 5 92 R2 5 012 When reporting the estimated equation we usually drop the i and t subscripts The coefficient on unem though positive in 1316 has a very small t statistic Thus using pooled OLS on the two years has not substantially changed anything from using a single cross section This is not surprising since using pooled OLS does not solve the omitted variables problem The standard errors in this equation are incorrect because of the serial correlation described in Question 133 but we ignore this since pooled OLS is not the focus here In most applications the main reason for collecting panel data is to allow for the unobserved effect ai to be correlated with the explanatory variables For example in the crime equation we want to allow the unmeasured city factors in ai that affect the crime rate also to be correlated with the Suppose that ai ui1 and ui2 have zero means and are pairwise uncorrelated Show that Cov1vi1 vi22 5 Var1ai2 so that the com posite errors are positively serially correlated across time unless ai 5 0 What does this imply about the usual OLS standard errors from pooled OLS estimation Exploring FurthEr 133 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 414 unemployment rate It turns out that this is simple to allow because ai is constant over time we can difference the data across the two years More precisely for a crosssectional observation i write the two years as yi2 5 1b0 1 d02 1 b1xi2 1 ai 1 ui2 1t 5 22 yi1 5 b0 1 b1xi1 1 ai 1 ui1 1t 5 12 If we subtract the second equation from the first we obtain 1yi2 2 yi12 5 d0 1 b11xi2 2 xi12 1 1ui2 2 ui12 or Dyi 5 d0 1 b1Dxi 1 Dui 1317 where denotes the change from t 5 1 to t 5 2 The unobserved effect ai does not appear in 1317 it has been differenced away Also the intercept in 1317 is actually the change in the intercept from t 5 1 to t 5 2 Equation 1317 which we call the firstdifferenced equation is very simple It is just a single crosssectional equation but each variable is differenced over time We can analyze 1317 using the methods we developed in Part 1 provided the key assumptions are satisfied The most important of these is that Dui is uncorrelated with Dxi This assumption holds if the idiosyncratic error at each time t uit is uncorrelated with the explanatory variable in both time periods This is another version of the strict exogeneity assumption that we encountered in Chapter 10 for time series models In particular this assumption rules out the case where xit is the lagged dependent variable yi t21 Unlike in Chapter 10 we allow xit to be correlated with unobservables that are constant over time When we obtain the OLS estimator of b1 from 1317 we call the resulting estimator the firstdifferenced estimator In the crime example assuming that Dui and Dunemi are uncorrelated may be reasonable but it can also fail For example suppose that law enforcement effort which is in the idiosyncratic error increases more in cities where the unemployment rate decreases This can cause negative correlation between Dui and Dunemi which would then lead to bias in the OLS estimator Naturally this problem can be overcome to some extent by including more factors in the equation something we will cover later As usual it is always possible that we have not accounted for enough timevarying factors Another crucial condition is that Dxi must have some variation across i This qualification fails if the explanatory variable does not change over time for any crosssectional observation or if it changes by the same amount for every observation This is not an issue in the crime rate example because the unemployment rate changes across time for almost all cities But if i denotes an individual and xit is a dummy variable for gender Dxi 5 0 for all i we clearly cannot estimate 1317 by OLS in this case This actually makes perfectly good sense since we allow ai to be correlated with xit we cannot hope to separate the effect of ai on yit from the effect of any variable that does not change over time The only other assumption we need to apply to the usual OLS statistics is that 1317 satisfies the homoskedasticity assumption This is reasonable in many cases and if it does not hold we know how to test and correct for heteroskedasticity using the methods in Chapter 8 It is sometimes fair to assume that 1317 fulfills all of the classical linear model assumptions The OLS estimators are unbiased and all statistical inference is exact in such cases When we estimate 1317 for the crime rate example we get Dcrmrte 5 1540 1 222 Dunem 14702 1882 1318 n 5 46 R2 5 127 which now gives a positive statistically significant relationship between the crime and unemployment rates Thus differencing to eliminate timeconstant effects makes a big difference in this example The Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 415 intercept in 1318 also reveals something interesting Even if Dunem 5 0 we predict an increase in the crime rate crimes per 1000 people of 1540 This reflects a secular increase in crime rates throughout the United States from 1982 to 1987 Even if we do not begin with the unobserved effects model 1313 using differences across time makes intuitive sense Rather than estimating a standard crosssectional relationshipwhich may suf fer from omitted variables thereby making ceteris paribus conclusions difficultequation 1317 explicitly considers how changes in the explanatory variable over time affect the change in y over the same time period Nevertheless it is still very useful to have 1313 in mind it explicitly shows that we can estimate the effect of xit on yit holding ai fixed Although differencing two years of panel data is a powerful way to control for unobserved effects it is not without cost First panel data sets are harder to collect than a single cross section especially for individuals We must use a survey and keep track of the individual for a followup sur vey It is often difficult to locate some people for a second survey For units such as firms some will go bankrupt or merge with other firms Panel data are much easier to obtain for schools cities coun ties states and countries Even if we have collected a panel data set the differencing used to eliminate ai can greatly reduce the variation in the explanatory variables While xit frequently has substantial variation in the cross section for each t Dxi may not have much variation We know from Chapter 3 that a little variation in Dxi can lead to a large standard error for b 1 when estimating 1317 by OLS We can combat this by using a large cross section but this is not always possible Also using longer differences over time is sometimes better than using yeartoyear changes As an example consider the problem of estimating the return to education now using panel data on individuals for two years The model for person i is log1wageit2 5 b0 1 d0d2t 1 b1educit 1 ai 1 uit t 5 1 2 where ai contains unobserved abilitywhich is probably correlated with educit Again we allow dif ferent intercepts across time to account for aggregate productivity gains and inflation if wageit is in nominal terms Since by definition innate ability does not change over time panel data methods seem ideally suited to estimate the return to education The equation in first differences is Dlog1wagei2 5 d0 1 b1Deduci 1 Dui 1319 and we can estimate this by OLS The problem is that we are interested in working adults and for most employed individuals education does not change over time If only a small fraction of our sample has Deduci different from zero it will be difficult to get a precise estimator of b1 from 1319 unless we have a rather large sample size In theory using a firstdifferenced equation to estimate the return to education is a good idea but it does not work very well with most currently available panel data sets Adding several explanatory variables causes no difficulties We begin with the unobserved effects model yit 5 b0 1 d0d2t 1 b1xit1 1 b2xit2 1 p 1 bkxitk 1 ai 1 uit 1320 for t 5 1 and 2 This equation looks more complicated than it is because each explanatory variable has three subscripts The first denotes the crosssectional observation number the second denotes the time period and the third is just a variable label ExamplE 135 Sleeping versus Working We use the two years of panel data in SLP7581 from Biddle and Hamermesh 1990 to estimate the tradeoff between sleeping and working In Problem 3 in Chapter 3 we used just the 1975 cross sec tion The panel data set for 1975 and 1981 has 239 people which is much smaller than the 1975 cross Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 416 section that includes over 700 people An unobserved effects model for total minutes of sleeping per week is slpnapit 5 b0 1 d0d81t 1 b1totwrkit 1 b2educit 1 b3marrit 1 b4yngkidit 1 b5gdhlthit 1 ai 1 uit t 5 1 2 The unobserved effect ai would be called an unobserved individual effect or an individual fixed effect It is potentially important to allow ai to be correlated with totwrkit the same factors some bio logical that cause people to sleep more or less captured in ai are likely correlated with the amount of time spent working Some people just have more energy and this causes them to sleep less and work more The variable educ is years of education marr is a marriage dummy variable yngkid is a dummy variable indicating the presence of a small child and gdhlth is a good health dummy vari able Notice that we do not include gender or race as we did in the crosssectional analysis since these do not change over time they are part of ai Our primary interest is in b1 Differencing across the two years gives the estimable equation Dslpnapi 5 d0 1 b1Dtotwrki 1 b2Deduci 1 b3Dmarri 1 b4Dyngkidi 1 b5Dgdhlthi 1 Dui Assuming that the change in the idiosyncratic error Dui is uncorrelated with the changes in all explanatory variables we can get consistent estimators using OLS This gives Dslpnap 5 29263 2 227 Dtotwrk 2 024 Deduc 145872 10362 1487592 1 10421 Dmarr 1 9467 Dyngkid 1 8758 Dgdhlth 1321 192862 187652 176602 n 5 239 R2 5 150 The coefficient on totwrk indicates a tradeoff between sleeping and working holding other factors fixed one more hour of work is associated with 2271602 5 1362 fewer minutes of sleeping The t statistic 631 is very significant No other estimates except the intercept are statistically different from zero The F test for joint significance of all variables except totwrk gives pvalue 49 which means they are jointly insignificant at any reasonable significance level and could be dropped from the equation The standard error on educ is especially large relative to the estimate This is the phenomenon described earlier for the wage equation In the sample of 239 people 183 766 have no change in education over the sixyear period 90 of the people have a change in education of at most one year As reflected by the extremely large standard error of b 2 there is not nearly enough variation in educa tion to estimate b2 with any precision Anyway b 2 is practically very small Panel data can also be used to estimate finite distributed lag models Even if we specify the equa tion for only two years we need to collect more years of data to obtain the lagged explanatory vari ables The following is a simple example ExamplE 136 Distributed lag of Crime Rate on ClearUp Rate Eide 1994 uses panel data from police districts in Norway to estimate a distributed lag model for crime rates The single explanatory variable is the clearup percentage clrprcthe percentage of crimes that led to a conviction The crime rate data are from the years 1972 and 1978 Following Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 417 Eide we lag clrprc for one and two years it is likely that past clearup rates have a deterrent effect on current crime This leads to the following unobserved effects model for the two years log1crimeit2 5 b0 1 d0d78t 1 b1clrprci t21 1 b2clrprci t22 1 at 1 uit When we difference the equation and estimate it using the data in CRIME3 we get Dlog1crime2 5 086 2 0040 Dclrprc21 2 0132 Dclrprc22 10642 100472 100522 1322 n 5 53 R2 5 193 R2 5 161 The second lag is negative and statistically significant which implies that a higher clearup percent age two years ago would deter crime this year In particular a 10 percentage point increase in clrprc two years ago would lead to an estimated 132 drop in the crime rate this year This suggests that using more resources for solving crimes and obtaining convictions can reduce crime in the future 133a Organizing Panel Data In using panel data in an econometric study it is important to know how the data should be stored We must be careful to arrange the data so that the different time periods for the same crosssectional unit person firm city and so on are easily linked For concreteness suppose that the data set is on cities for two different years For most purposes the best way to enter the data is to have two records for each city one for each year the first record for each city corresponds to the early year and the second record is for the later year These two records should be adjacent Therefore a data set for 100 cities and two years will contain 200 records The first two records are for the first city in the sample the next two records are for the second city and so on See Table 15 in Chapter 1 for an example This makes it easy to construct the differences to store these in the second record for each city and to do a pooled crosssectional analysis which can be compared with the differencing estimation Most of the twoperiod panel data sets accompanying this text are stored in this way for exam ple CRIME2 CRIME3 GPA3 LOWBRTH and RENTAL We use a direct extension of this scheme for panel data sets with more than two time periods A second way of organizing two periods of panel data is to have only one record per cross sectional unit This requires two entries for each variable one for each time period The panel data in SLP7581 are organized in this way Each individual has data on the variables slpnap75 slpnap81 totwrk75 totwrk81 and so on Creating the differences from 1975 to 1981 is easy Other panel data sets with this structure are TRAFFIC1 and VOTE2 Putting the data in one record however does not allow a pooled OLS analysis using the two time periods on the original data Also this organizational method does not work for panel data sets with more than two time periods a case we will consider in Section 135 134 Policy Analysis with TwoPeriod Panel Data Panel data sets are very useful for policy analysis and in particular program evaluation In the sim plest program evaluation setup a sample of individuals firms cities and so on is obtained in the first time period Some of these units those in the treatment group then take part in a particular program in a later time period the ones that do not are the control group This is similar to the natural experi ment literature discussed earlier with one important difference the same crosssectional units appear in each time period As an example suppose we wish to evaluate the effect of a Michigan job training program on worker productivity of manufacturing firms see also Computer Exercise C3 in Chapter 9 Let scrapit denote the scrap rate of firm i during year t the number of items per 100 that must be scrapped due Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 418 to defects Let grantit be a binary indicator equal to one if firm i in year t received a job training grant For the years 1987 and 1988 the model is scrapit 5 b0 1 d0y88t 1 b1grantit 1 ai 1 uit t 5 1 2 1323 where y88t is a dummy variable for 1988 and ai is the unobserved firm effect or the firm fixed effect The unobserved effect contains such factors as average employee ability capital and managerial skill these are roughly constant over a twoyear period We are concerned about ai being systemati cally related to whether a firm receives a grant For example administrators of the program might give priority to firms whose workers have lower skills Or the opposite problem could occur to make the job training program appear effective administrators may give the grants to employers with more productive workers Actually in this particular program grants were awarded on a firstcome first served basis But whether a firm applied early for a grant could be correlated with worker productiv ity In that case an analysis using a single cross section or just a pooling of the cross sections will produce biased and inconsistent estimators Differencing to remove ai gives Dscrapi 5 d0 1 b1Dgranti 1 Dui 1324 Therefore we simply regress the change in the scrap rate on the change in the grant indicator Because no firms received grants in 1987 granti1 5 0 for all i and so Dgranti 5 granti2 2 granti1 5 granti2 which simply indicates whether the firm received a grant in 1988 However it is generally important to difference all variables dummy variables included because this is necessary for removing ai in the unobserved effects model 1323 Estimating the firstdifferenced equation using the data in JTRAIN gives Dscrap 5 2564 2 739 Dgrant 14052 16832 n 5 54 R2 5 022 Therefore we estimate that having a job training grant lowered the scrap rate on average by 2739 But the estimate is not statistically different from zero We get stronger results by using logscrap and estimating the percentage effect Dlog1scrap2 5 2057 2 317Dgrant 10972 11642 n 5 54 R2 5 067 Having a job training grant is estimated to lower the scrap rate by about 272 We obtain this estimate from equation 710 exp123172 2 1 22724 The t statistic is about 2193 which is marginally significant By contrast using pooled OLS of logscrap on y88 and grant gives b 1 5 057 standard error 5 431 Thus we find no significant relationship between the scrap rate and the job training grant Since this differs so much from the firstdifference estimates it suggests that firms that have lowerability workers are more likely to receive a grant It is useful to study the program evaluation model more generally Let yit denote an outcome varia ble and let progit be a program participation dummy variable The simplest unobserved effects model is yit 5 b0 1 d0d2t 1 b1progit 1 ai 1 uit 1325 If program participation only occurred in the second period then the OLS estimator of b1 in the dif ferenced equation has a very simple representation b 1 5 Dytreat 2 Dycontrol 1326 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 419 That is we compute the average change in y over the two time periods for the treatment and con trol groups Then b 1 is the difference of these This is the panel data version of the differencein differences estimator in equation 1311 for two pooled cross sections With panel data we have a potentially important advantage we can difference y across time for the same crosssectional units This allows us to control for person firm or cityspecific effects as the model in 1325 makes clear If program participation takes place in both periods b 1 cannot be written as in 1326 but we interpret it in the same way it is the change in the average value of y due to program participation Controlling for timevarying factors does not change anything of significance We simply differ ence those variables and include them along with Dprog This allows us to control for timevarying variables that might be correlated with program designation The same differencing method works for analyzing the effects of any policy that varies across city or state The following is a simple example ExamplE 137 Effect of Drunk Driving laws on Traffic Fatalities Many states in the United States have adopted different policies in an attempt to curb drunk driv ing Two types of laws that we will study here are open container lawswhich make it illegal for passengers to have open containers of alcoholic beveragesand administrative per se lawswhich allow courts to suspend licenses after a driver is arrested for drunk driving but before the driver is convicted One possible analysis is to use a single cross section of states to regress driving fatalities or those related to drunk driving on dummy variable indicators for whether each law is present This is unlikely to work well because states decide through legislative processes whether they need such laws Therefore the presence of laws is likely to be related to the average drunk driving fatali ties in recent years A more convincing analysis uses panel data over a time period where some states adopted new laws and some states may have repealed existing laws The file TRAFFIC1 contains data for 1985 and 1990 for all 50 states and the District of Columbia The dependent variable is the number of traffic deaths per 100 million miles driven dthrte In 1985 19 states had open container laws while 22 states had such laws in 1990 In 1985 21 states had per se laws the number had grown to 29 by 1990 Using OLS after first differencing gives Ddthrte 5 2497 2 420 Dopen 2 151Dadmn 10522 12062 11172 1327 n 5 51 R2 5 119 The estimates suggest that adopting an open con tainer law lowered the traffic fatality rate by 42 a nontrivial effect given that the average death rate in 1985 was 27 with a standard deviation of about 6 The estimate is statistically significant at the 5 level against a twosided alternative The administrative per se law has a smaller effect and its t statistic is only 2129 but the estimate is the sign we expect The intercept in this equation shows that traffic fatalities fell substantially for all states over the five year period whether or not there were any law changes The states that adopted an open container law over this period saw a further drop on average in fatality rates Other laws might also affect traffic fatalities such as seat belt laws motorcycle helmet laws and maximum speed limits In addition we might want to control for age and gender distributions as well as measures of how influential an organization such as Mothers Against Drunk Driving is in each state In Example 137 Dadmn 5 21 for the state of Washington Explain what this means Exploring FurthEr 134 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 420 135 Differencing with More Than Two Time Periods We can also use differencing with more than two time periods For illustration suppose we have N individuals and T 5 3 time periods for each individual A general fixed effects model is yit 5 d1 1 d2d2t 1 d3d3t 1 b1xit1 1 p 1 bkxitk 1 ai 1 uit 1328 for t 5 1 2 and 3 The total number of observations is therefore 3N Notice that we now include two timeperiod dummies in addition to the intercept It is a good idea to allow a separate intercept for each time period especially when we have a small number of them The base period as always is t 5 1 The intercept for the second time period is d1 1 d2 and so on We are primarily interested in b1 b2 p bk If the unobserved effect ai is correlated with any of the explanatory variables then using pooled OLS on the three years of data results in biased and inconsistent estimates The key assumption is that the idiosyncratic errors are uncorrelated with the explanatory variable in each time period Cov1xitj uis2 5 0 for all t s and j 1329 That is the explanatory variables are strictly exogenous after we take out the unobserved effect ai The strict exogeneity assumption stated in terms of a zero conditional expectation is given in the chapter appendix Assumption 1329 rules out cases where future explanatory variables react to current changes in the idiosyncratic errors as must be the case if xitj is a lagged dependent variable If we have omitted an important timevarying variable then 1329 is generally violated Measurement error in one or more explanatory variables can cause 1329 to be false just as in Chapter 9 In Chapters 15 and 16 we will discuss what can be done in such cases If ai is correlated with xitj then xitj will be correlated with the composite error vit 5 ai 1 uit under 1329 We can eliminate ai by differencing adjacent periods In the T 5 3 case we subtract time period one from time period two and time period two from time period three This gives Dyit 5 d2Dd2t 1 d3Dd3t 1 b1Dxit1 1 p 1 bkDxitk 1 Duit 1330 for t 5 2 and 3 We do not have a differenced equation for t 5 1 because there is nothing to subtract from the t 5 1 equation Now 1330 represents two time periods for each individual in the sam ple If this equation satisfies the classical linear model assumptions then pooled OLS gives unbiased estimators and the usual t and F statistics are valid for hypothesis We can also appeal to asymptotic results The important requirement for OLS to be consistent is that Duit is uncorrelated with Dxitj for all j and t 5 2 and 3 This is the natural extension from the two time period case Notice how 1330 contains the differences in the year dummies d2t and d3t For t 5 2 Dd2t 5 1 and Dd3t 5 0 for t 5 3 Dd2t 5 21 and Dd3t 5 1 Therefore 1330 does not contain an intercept This is inconvenient for certain purposes including the computation of Rsquared Unless the time intercepts in the original model 1328 are of direct interestthey rarely areit is better to estimate the firstdifferenced equation with an intercept and a single timeperiod dummy usually for the third period In other words the equation becomes Dyit 5 a0 1 a3d3t 1 b1Dxit1 1 p 1 bkDxitk 1 Duit for t 5 2 and 3 The estimates of the bj are identical in either formulation With more than three time periods things are similar If we have the same T time periods for each of N crosssectional units we say that the data set is a balanced panel we have the same time periods for all individuals firms cities and so on When T is small relative to N we should include a dummy variable for each time period to account for secular changes that are not being modeled Therefore after first differencing the equation looks like Dyit 5 a0 1 a3d3t 1 a4d4t 1 p 1 aT dTt 1 b1Dxit1 1 p 1 bkDxitk 1 Duit t 5 2 3 p T 1331 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 421 where we have T 2 1 time periods on each unit i for the firstdifferenced equation The total number of observations is N1T 2 12 It is simple to estimate 1331 by pooled OLS provided the observations have been properly organized and the differencing carefully done To facilitate first differencing the data file should con sist of NT records The first T records are for the first crosssectional observation arranged chronolog ically the second T records are for the second crosssectional observations arranged chronologically and so on Then we compute the differences with the change from t 2 1 to t stored in the time t record Therefore the differences for t 5 1 should be missing values for all N crosssectional obser vations Without doing this you run the risk of using bogus observations in the regression analysis An invalid observation is created when the last observation for say person i 2 1 is subtracted from the first observation for person i If you do the regression on the differenced data and NT or NT 2 1 observations are reported then you forgot to set the t 5 1 observations as missing When using more than two time periods we must assume that Duit is uncorrelated over time for the usual standard errors and test statistics to be valid This assumption is sometimes reasonable but it does not follow if we assume that the original idiosyncratic errors uit are uncorrelated over time an assumption we will use in Chapter 14 In fact if we assume the uit are serially uncorrelated with constant variance then the correlation between Duit and Dui t11 can be shown to be 25 If uit follows a stable AR1 model then Duit will be serially correlated Only when uit follows a random walk will Duit be serially uncorrelated It is easy to test for serial correlation in the firstdifferenced equation Let rit 5 Duit denote the first difference of the original error If rit follows the AR1 model rit 5 rri t21 1 eit then we can easily test H0 r 5 0 First we estimate 1331 by pooled OLS and obtain the residuals rit Then we run a simple pooled OLS regression of rit on ri t21 t 5 3 p T i 5 1 p N and compute a standard t test for the coefficient on ri t21 Or we can make the t statistic robust to het eroskedasticity The coefficient r on ri t21 is a consistent estimator of r Because we are using the lagged residual we lose another time period For example if we started with T 5 3 the differenced equation has two time periods and the test for serial correlation is just a crosssectional regression of the residuals from the third time period on the residuals from the second time period We will give an example later We can correct for the presence of AR1 serial correlation in rit by using feasible GLS Essentially within each crosssectional observation we would use the PraisWinsten transformation based on r described in the previous paragraph We clearly prefer PraisWinsten to CochraneOrcutt here as dropping the first time period would now mean losing N crosssectional observations Unfortunately standard packages that perform AR1 corrections for time series regressions will not work Standard PraisWinsten methods will treat the observations as if they followed an AR1 process across i and t this makes no sense as we are assuming the observations are independent across i Corrections to the OLS standard errors that allow arbitrary forms of serial correlation and heteroskedasticity can be computed when N is large and N should be nota bly larger than T A detailed treatment of standard errors and test statistics that are robust to any forms of serial correlation and heteroskedasticity is beyond the scope of this text see for example Wooldridge 2010 Chapter 10 Nevertheless such statistics are easy to compute in many econometrics software packages and the appendix contains an intuitive discussion If there is no serial correlation in the errors the usual methods for dealing with heteroskedasticity are valid We can use the BreuschPagan and White tests for heteroskedasticity from Chapter 8 and we can also compute robust standard errors Differencing more than two years of panel data is very useful for policy analysis as shown by the following example Does serial correlation in Duit cause the firstdifferenced estimator to be biased and inconsistent Why is serial correlation a concern Exploring FurthEr 135 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 422 ExamplE 138 Effect of Enterprise Zones on Unemployment Claims Papke 1994 studied the effect of the Indiana enterprise zone EZ program on unemployment claims She analyzed 22 cities in Indiana over the period from 1980 to 1988 Six enterprise zones were desig nated in 1984 and four more were assigned in 1985 Twelve of the cities in the sample did not receive an enterprise zone over this period they served as the control group A simple policy evaluation model is log1uclmsit2 5 ut 1 b1ezit 1 ai 1 uit where uclmsit is the number of unemployment claims filed during year t in city i The parameter ut just denotes a different intercept for each time period Generally unemployment claims were falling statewide over this period and this should be reflected in the different year intercepts The binary variable ezit is equal to one if city i at time t was an enterprise zone we are interested in b1 The unobserved effect ai represents fixed factors that affect the economic climate in city i Because enter prise zone designation was not determined randomlyenterprise zones are usually economically depressed areasit is likely that ezit and ai are positively correlated high ai means higher unemploy ment claims which lead to a higher chance of being given an EZ Thus we should difference the equation to eliminate ai Dlog1uclmsit2 5 a0 1 a1d82t 1 p 1 a7d88t 1 b1Dezit 1 Duit 1332 The dependent variable in this equation the change in log1uclmsit2 is the approximate annual growth rate in unemployment claims from year t 2 1 to t We can estimate this equation for the years 1981 to 1988 using the data in EZUNEM the total sample size is 228 5 176 The estimate of b1 is b 1 5 2182 1standard error 5 0782 Therefore it appears that the presence of an EZ causes about a 166 3exp121822 2 1 21664 fall in unemployment claims This is an economically large and statistically significant effect There is no evidence of heteroskedasticity in the equation the BreuschPagan F test yields F 5 85 pvalue 5 557 However when we add the lagged OLS residuals to the differenced equa tion and lose the year 1981 we get r 5 2197 1t 5 22442 so there is evidence of minimal nega tive serial correlation in the firstdifferenced errors Unlike with positive serial correlation the usual OLS standard errors may not greatly understate the correct standard errors when the errors are nega tively correlated see Section 121 Thus the significance of the enterprise zone dummy variable will probably not be affected ExamplE 139 County Crime Rates in North Carolina Cornwell and Trumbull 1994 used data on 90 counties in North Carolina for the years 1981 through 1987 to estimate an unobserved effects model of crime the data are contained in CRIME4 Here we estimate a simpler version of their model and we difference the equation over time to eliminate ai the unobserved effect Cornwell and Trumbull use a different transformation which we will cover in Chapter 14 Various factors including geographical location attitudes toward crime historical records and reporting conventions might be contained in ai The crime rate is number of crimes per person prbarr is the estimated probability of arrest prbconv is the estimated probability of convic tion given an arrest prbpris is the probability of serving time in prison given a conviction avgsen is the average sentence length served and polpc is the number of police officers per capita As is standard in criminometric studies we use the logs of all variables to estimate elasticities We also include a full set of year dummies to control for state trends in crime rates We can use the years 1982 through 1987 to estimate the differenced equation The quantities in parentheses are the usual OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 423 standard errors the quantities in brackets are standard errors robust to both serial correlation and heteroskedasticity Dlog1crmrte2 5 008 2 100 d83 2 048 d84 2 005 d85 10172 10242 10242 10232 30144 30224 30204 30254 1 028 d86 1 041 d87 2 327 Dlog1prbarr2 10242 10242 10302 30214 30244 30564 2 238 Dlog1prbconv2 2 165 Dlog1prbpris2 1333 10182 10262 30404 30464 2 022 Dlog1avgsen2 1 398 Dlog1polpc2 10222 10272 30264 31034 n 5 540 R2 5 433 R2 5 422 The three probability variablesof arrest conviction and serving prison timeall have the expected sign and all are statistically significant For example a 1 increase in the probability of arrest is pre dicted to lower the crime rate by about 33 The average sentence variable shows a modest deterrent effect but it is not statistically significant The coefficient on the police per capita variable is somewhat surprising and is a feature of most studies that seek to explain crime rates Interpreted causally it says that a 1 increase in police per capita increases crime rates by about 4 The usual t statistic is very large almost 15 It is hard to believe that having more police officers causes more crime What is going on here There are at least two possibilities First the crime rate variable is calculated from reported crimes It might be that when there are additional police more crimes are reported Second the police variable might be endogenous in the equation for other reasons counties may enlarge the police force when they expect crime rates to increase In this case 1333 cannot be interpreted in a causal fashion In Chapters 15 and 16 we will cover models and estimation methods that can account for this additional form of endogeneity The special case of the White test for heteroskedasticity in Section 83 gives F 5 7548 and pvalue 5 0000 so there is strong evidence of heteroskedasticity Technically this test is not valid if there is also serial correlation but it is strongly suggestive Testing for AR1 serial correlation yields r 5 2233 t 5 2477 so negative serial correlation exists The standard errors in brackets adjust for serial correlation and heteroskedasticity We will not give the details of this the calculations are similar to those described in Section 125 and are carried out by many econometric packages See Wooldridge 2010 Chapter 10 for more discussion No variables lose statistical significance but the t statistics on the significant deterrent variables get notably smaller For example the t statistic on the probability of conviction variable goes from 21322 using the usual OLS standard error to 2610 using the fully robust standard error Equivalently the confidence intervals constructed using the robust standard errors will appropriately be much wider than those based on the usual OLS standard errors Naturally we can apply the Chow test to panel data models estimated by first differencing As in the case of pooled cross sections we rarely want to test whether the intercepts are constant over time for many reasons we expect the intercepts to be different Much more interesting is to test whether slope coefficients have changed over time and we can easily carry out such tests by interacting the explanatory variables of interest with timeperiod dummy variables Interestingly while we cannot estimate the slopes on variables that do not change over time we can test whether the partial effects of Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 424 timeconstant variables have changed over time As an illustration suppose we observe three years of data on a random sample of people working in 2000 2002 and 2004 and specify the model for the log of wage lwage lwageit 5 b0 1 d1d02t 1 d2d04t 1 b1femalei 1 g1d02tfemalei 1 g2d04t femalei 1 zitl 1 ai 1 uit where zitl is shorthand for other explanatory variables included in the model and their coefficients When we first difference we eliminate the intercept for 2000 b0 and also the gender wage gap for 2000 b1 However the change in d01t femalei is 1Dd01t2femalei which does not drop out Consequently we can estimate how the wage gap has changed in 2002 and 2004 relative to 2000 and we can test whether g1 5 0 or g2 5 0 or both We might also ask whether the union wage premium has changed over time in which case we include in the model unionit d02tunionit and d04tunionit The coefficients on all of these explanatory variables can be estimated because unionit would presum ably have some time variation If one tries to estimate a model containing interactions by differencing by hand it can be a bit tricky For example in the previous equation with union status we must simply difference the interac tion terms d02tunionit and d04tunionit We cannot compute the proper differences as say d02tDunionit and d04tDunionit or even replacing d02t and d04t with their first differences As a general comment it is important to return to the original model and remember that the dif ferencing is used to eliminate ai It is easiest to use a builtin command that allows first differencing as an option in panel data analysis We will see some of the other options in Chapter 14 135a Potential Pitfalls in First Differencing Panel Data In this and previous sections we have argued that differencing panel data over time in order to elimi nate a timeconstant unobserved effect is a valuable method for obtaining causal effects Nevertheless differencing is not free of difficulties We have already discussed potential problems with the method when the key explanatory variables do not vary much over time and the method is useless for explan atory variables that never vary over time Unfortunately even when we do have sufficient time vari ation in the xitj firstdifferenced FD estimation can be subject to serious biases We have already mentioned that strict exogeneity of the regressors is a critical assumption Unfortunately as discussed in Wooldridge 2010 Section 111 having more time periods generally does not reduce the incon sistency in the FD estimator when the regressors are not strictly exogenous say if yi t21 is included among the xitj Another important drawback to the FD estimator is that it can be worse than pooled OLS if one or more of the explanatory variables is subject to measurement error especially the classical errors invariables model discussed in Section 93 Differencing a poorly measured regressor reduces its variation relative to its correlation with the differenced error caused by classical measurement error resulting in a potentially sizable bias Solving such problems can be very difficult See Section 158 and Wooldridge 2010 Chapter 11 Summary We have studied methods for analyzing independently pooled crosssectional and panel data sets Inde pendent cross sections arise when different random samples are obtained in different time periods usually years OLS using pooled data is the leading method of estimation and the usual inference procedures are available including corrections for heteroskedasticity Serial correlation is not an issue because the samples are independent across time Because of the time series dimension we often allow different time Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 425 Problems 1 In Example 131 assume that the averages of all factors other than educ have remained constant over time and that the average level of education is 122 for the 1972 sample and 133 in the 1984 sample Using the estimates in Table 131 find the estimated change in average fertility between 1972 and 1984 Be sure to account for the intercept change and the change in average education 2 Using the data in KIELMC the following equations were estimated using the years 1978 and 1981 log1price2 5 1149 2 547 nearinc 1 394 y81 nearinc 1262 10582 10802 n 5 321 R2 5 220 and log1price2 5 1118 1 563 y81 2 403 y81 nearinc 1272 10442 10672 n 5 321 R2 5 337 Compare the estimates on the interaction term y81nearinc with those from equation 139 Why are the estimates so different 3 Why can we not use first differences when we have independent cross sections in two years as opposed to panel data intercepts We might also interact time dummies with certain key variables to see how they have changed over time This is especially important in the policy evaluation literature for natural experiments Panel data sets are being used more and more in applied work especially for policy analysis These are data sets where the same crosssectional units are followed over time Panel data sets are most useful when controlling for timeconstant unobserved featuresof people firms cities and so onwhich we think might be correlated with the explanatory variables in our model One way to remove the unobserved effect is to difference the data in adjacent time periods Then a standard OLS analysis on the differences can be used Using two periods of data results in a crosssectional regression of the differenced data The usual inference procedures are asymptotically valid under homoskedasticity exact inference is available under normality For more than two time periods we can use pooled OLS on the differenced data we lose the first time period because of the differencing In addition to homoskedasticity we must assume that the differenced errors are serially uncorrelated in order to apply the usual t and F statistics The chapter appendix contains a careful listing of the assumptions Naturally any variable that is constant over time drops out of the analysis Key Terms Average Treatment Effect Balanced Panel Clustering Composite Error DifferenceinDifferences Estimator FirstDifferenced Equation FirstDifferenced Estimator Fixed Effect Fixed Effects Model Heterogeneity Bias Idiosyncratic Error Independently Pooled Cross Section Longitudinal Data Natural Experiment Panel Data QuasiExperiment Strict Exogeneity Unobserved Effect Unobserved Effects Model Unobserved Heterogeneity Year Dummy Variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 426 4 If we think that b1 is positive in 1314 and that Dui and Dunemi are negatively correlated what is the bias in the OLS estimator of b1 in the firstdifferenced equation Hint Review equation 54 5 Suppose that we want to estimate the effect of several variables on annual saving and that we have a panel data set on individuals collected on January 31 1990 and January 31 1992 If we include a year dummy for 1992 and use first differencing can we also include age in the original model Explain 6 In 1985 neither Florida nor Georgia had laws banning open alcohol containers in vehicle passenger compartments By 1990 Florida had passed such a law but Georgia had not i Suppose you can collect random samples of the drivingage population in both states for 1985 and 1990 Let arrest be a binary variable equal to unity if a person was arrested for drunk driv ing during the year Without controlling for any other factors write down a linear probability model that allows you to test whether the open container law reduced the probability of being arrested for drunk driving Which coefficient in your model measures the effect of the law ii Why might you want to control for other factors in the model What might some of these factors be iii Now suppose that you can only collect data for 1985 and 1990 at the county level for the two states The dependent variable would be the fraction of licensed drivers arrested for drunk driv ing during the year How does this data structure differ from the individuallevel data described in part i What econometric method would you use 7 i Using the data in INJURY for Kentucky we find the estimated equation when afchnge is dropped from 1312 is log1durat2 5 1129 1 253 highearn 1 198 afchnge highearn 100222 10422 10522 n 5 5626 R2 5 021 Is it surprising that the estimate on the interaction is fairly close to that in 1312 Explain ii When afchnge is included but highearn is dropped the result is log1durat2 5 1233 2 100 afchnge 1 447 afchnge highearn 100232 10402 10502 n 5 5626 R2 5 016 Why is the coefficient on the interaction term now so much larger than in 1312 Hint In equation 1310 what is the assumption being made about the treatment and control groups if b1 5 0 Computer Exercises C1 Use the data in FERTIL1 for this exercise i In the equation estimated in Example 131 test whether living environment at age 16 has an effect on fertility The base group is large city Report the value of the F statistic and the pvalue ii Test whether region of the country at age 16 South is the base group has an effect on fertility iii Let u be the error term in the population equation Suppose you think that the variance of u changes over time but not with educ age and so on A model that captures this is u2 5 g0 1 g1y74 1 g2y76 1 p 1 g6y84 1 v Using this model test for heteroskedasticity in u Hint Your F test should have 6 and 1122 degrees of freedom iv Add the interaction terms y74 educ y76 educ p y84 educ to the model estimated in Table 131 Explain what these terms represent Are they jointly significant Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 427 C2 Use the data in CPS7885 for this exercise i How do you interpret the coefficient on y85 in equation 132 Does it have an interesting interpretation Be careful here you must account for the interaction terms y85 educ and y85 female ii Holding other factors fixed what is the estimated percent increase in nominal wage for a male with 12 years of education Propose a regression to obtain a confidence interval for this estimate Hint To get the confidence interval replace y85 educ with y85 1educ 2 122 refer to Example 63 iii Reestimate equation 132 but let all wages be measured in 1978 dollars In particular define the real wage as rwage 5 wage for 1978 and as rwage 5 wage165 for 1985 Now use logrwage in place of logwage in estimating 132 Which coefficients differ from those in equation 132 iv Explain why the Rsquared from your regression in part iii is not the same as in equation 132 Hint The residuals and therefore the sum of squared residuals from the two regressions are identical v Describe how union participation changed from 1978 to 1985 vi Starting with equation 132 test whether the union wage differential changed over time This should be a simple t test vii Do your findings in parts v and vi conflict Explain C3 Use the data in KIELMC for this exercise i The variable dist is the distance from each home to the incinerator site in feet Consider the model log1price2 5 b0 1 d0y81 1 b1log1dist2 1 d1y81 log1dist2 1 u If building the incinerator reduces the value of homes closer to the site what is the sign of d1 What does it mean if b1 0 ii Estimate the model from part i and report the results in the usual form Interpret the coefficient on y81 log1dist2 What do you conclude iii Add age age2 rooms baths logintst logland and logarea to the equation Now what do you conclude about the effect of the incinerator on housing values iv Why is the coefficient on logdist positive and statistically significant in part ii but not in part iii What does this say about the controls used in part iii C4 Use the data in INJURY for this exercise i Using the data for Kentucky reestimate equation 1312 adding as explanatory variables male married and a full set of industry and injury type dummy variables How does the estimate on afchnge highearn change when these other factors are controlled for Is the estimate still statistically significant ii What do you make of the small Rsquared from part i Does this mean the equation is useless iii Estimate equation 1312 using the data for Michigan Compare the estimates on the interaction term for Michigan and Kentucky Is the Michigan estimate statistically significant What do you make of this C5 Use the data in RENTAL for this exercise The data for the years 1980 and 1990 include rental prices and other variables for college towns The idea is to see whether a stronger presence of students affects rental rates The unobserved effects model is log1rentit2 5 b0 1 d0y90t 1 b1log1popit2 1 b2log1avgincit2 1 b3pctstuit 1 ai 1 uit where pop is city population avginc is average income and pctstu is student population as a percentage of city population during the school year i Estimate the equation by pooled OLS and report the results in standard form What do you make of the estimate on the 1990 dummy variable What do you get for b pctstu ii Are the standard errors you report in part i valid Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 428 iii Now difference the equation and estimate by OLS Compare your estimate of bpctstu with that from part ii Does the relative size of the student population appear to affect rental prices iv Obtain the heteroskedasticityrobust standard errors for the firstdifferenced equation in part iii Does this change your conclusions C6 Use CRIME3 for this exercise i In the model of Example 136 test the hypothesis H0 b1 5 b2 Hint Define u1 5 b1 2 b2 and write b1 in terms of u1 and b2 Substitute this into the equation and then rearrange Do a t test on u1 ii If b1 5 b2 show that the differenced equation can be written as Dlog1crimei2 5 d0 1 d1Davgclri 1 Dui where d1 5 2b1 and avgclri 5 1clrprci 21 1 clrprci 2222 is the average clearup percentage over the previous two years iii Estimate the equation from part ii Compare the adjusted Rsquared with that in 1322 Which model would you finally use C7 Use GPA3 for this exercise The data set is for 366 studentathletes from a large university for fall and spring semesters A similar analysis is in Maloney and McCormick 1993 but here we use a true panel data set Because you have two terms of data for each student an unobserved effects model is appropriate The primary question of interest is this Do athletes perform more poorly in school during the semester their sport is in season i Use pooled OLS to estimate a model with term GPA trmgpa as the dependent variable The explanatory variables are spring sat hsperc female black white frstsem tothrs crsgpa and season Interpret the coefficient on season Is it statistically significant ii Most of the athletes who play their sport only in the fall are football players Suppose the ability levels of football players differ systematically from those of other athletes If ability is not adequately captured by SAT score and high school percentile explain why the pooled OLS estimators will be biased iii Now use the data differenced across the two terms Which variables drop out Now test for an inseason effect iv Can you think of one or more potentially important timevarying variables that have been omitted from the analysis C8 VOTE2 includes panel data on House of Representatives elections in 1988 and 1990 Only winners from 1988 who are also running in 1990 appear in the sample these are the incumbents An unob served effects model explaining the share of the incumbents vote in terms of expenditures by both candidates is voteit 5 b0 1 d0d90t 1 b1log1inexpit2 1 b2log1chexpit2 1 b3incshrit 1 ai 1 uit where incshrit is the incumbents share of total campaign spending in percentage form The unob served effect ai contains characteristics of the incumbentsuch as qualityas well as things about the district that are constant The incumbents gender and party are constant over time so these are subsumed in ai We are interested in the effect of campaign expenditures on election outcomes i Difference the given equation across the two years and estimate the differenced equation by OLS Which variables are individually significant at the 5 level against a twosided alternative ii In the equation from part i test for joint significance of loginexp and logchexp Report the pvalue iii Reestimate the equation from part i using incshr as the only independent variable Interpret the coefficient on incshr For example if the incumbents share of spending increases by 10 percentage points how is this predicted to affect the incumbents share of the vote Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 429 iv Redo part iii but now use only the pairs that have repeat challengers This allows us to control for characteristics of the challengers as well which would be in ai Levitt 1994 conducts a much more extensive analysis C9 Use CRIME4 for this exercise i Add the logs of each wage variable in the data set and estimate the model by first differencing How does including these variables affect the coefficients on the criminal justice variables in Example 139 ii Do the wage variables in i all have the expected sign Are they jointly significant Explain C10 For this exercise we use JTRAIN to determine the effect of the job training grant on hours of job train ing per employee The basic model for the three years is hrsempit 5 b0 1 d1d88t 1 d2d89t 1 b1grantit 1 b2granti t21 1 b3log1employit2 1 ai 1 uit i Estimate the equation using first differencing How many firms are used in the estimation How many total observations would be used if each firm had data on all variables in particular hrsemp for all three time periods ii Interpret the coefficient on grant and comment on its significance iii Is it surprising that grant21 is insignificant Explain iv Do larger firms train their employees more or less on average How big are the differences in training C11 The file MATHPNL contains panel data on school districts in Michigan for the years 1992 through 1998 It is the districtlevel analogue of the schoollevel data used by Papke 2005 The response variable of interest in this question is math4 the percentage of fourth graders in a district receiving a passing score on a standardized math test The key explanatory variable is rexpp which is real expen ditures per pupil in the district The amounts are in 1997 dollars The spending variable will appear in logarithmic form i Consider the static unobserved effects model math4it 5 d1y93t 1 p 1 d6y98t 1 b1log1rexppit2 1 b2log1enrolit2 1 b3lunchit 1 ai 1 uit where enrolit is total district enrollment and lunchit is the percentage of students in the district eligible for the school lunch program So lunchit is a pretty good measure of the districtwide poverty rate Argue that b110 is the percentage point change in math4it when real perstudent spending increases by roughly 10 ii Use first differencing to estimate the model in part i The simplest approach is to allow an intercept in the firstdifferenced equation and to include dummy variables for the years 1994 through 1998 Interpret the coefficient on the spending variable iii Now add one lag of the spending variable to the model and reestimate using first differencing Note that you lose another year of data so you are only using changes starting in 1994 Discuss the coefficients and significance on the current and lagged spending variables iv Obtain heteroskedasticityrobust standard errors for the firstdifferenced regression in part iii How do these standard errors compare with those from part iii for the spending variables v Now obtain standard errors robust to both heteroskedasticity and serial correlation What does this do to the significance of the lagged spending variable vi Verify that the differenced errors rit 5 Duit have negative serial correlation by carrying out a test of AR1 serial correlation vii Based on a fully robust joint test does it appear necessary to include the enrollment and lunch variables in the model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 430 C12 Use the data in MURDER for this exercise i Using the years 1990 and 1993 estimate the equation mrdrteit 5 d0 1 d1d93t 1 b1execit 1 b2unemit 1 ai 1 uit t 5 1 2 by pooled OLS and report the results in the usual form Do not worry that the usual OLS standard errors are inappropriate because of the presence of ai Do you estimate a deterrent effect of capital punishment ii Compute the FD estimates use only the differences from 1990 to 1993 you should have 51 observations in the FD regression Now what do you conclude about a deterrent effect iii In the FD regression from part ii obtain the residuals say ei Run the BreuschPagan regression e2 i on Dexeci Dunemi and compute the F test for heteroskedasticity Do the same for the special case of the White test that is regress e2 i on yi y2 i where the fitted values are from part ii What do you conclude about heteroskedasticity in the FD equation iv Run the same regression from part ii but obtain the heteroskedasticityrobust t statistics What happens v Which t statistic on Dexeci do you feel more comfortable relying on the usual one or the heteroskedasticityrobust one Why C13 Use the data in WAGEPAN for this exercise i Consider the unobserved effects model lwageit 5 b0 1 d1d81t 1 p 1 d7d87t 1 b1educi 1 g1d81t educi 1 p 1 d7d87t educi 1 b2unionit 1 ai 1 uit where ai is allowed to be correlated with educi and unionit Which parameters can you estimate using first differencing ii Estimate the equation from part i by FD and test the null hypothesis that the return to education has not changed over time iii Test the hypothesis from part ii using a fully robust test that is one that allows arbitrary heteroskedasticity and serial correlation in the FD errors Duit Does your conclusion change iv Now allow the union differential to change over time along with education and estimate the equation by FD What is the estimated union differential in 1980 What about 1987 Is the difference statistically significant v Test the null hypothesis that the union differential has not changed over time and discuss your results in light of your answer to part iv C14 Use the data in JTRAIN3 for this question i Estimate the simple regression model re78 5 b0 1 b1train 1 u and report the results in the usual form Based on this regression does it appear that job training which took place in 1976 and 1977 had a positive effect on real labor earnings in 1978 ii Now use the change in real labor earnings cre 5 re78 2 re75 as the dependent variable We need not difference train because we assume there was no job training prior to 1975 That is if we define ctrain 5 train78 2 train75 then ctrain 5 train78 because train75 5 0 Now what is the estimated effect of training Discuss how it compares with the estimate in part i iii Find the 95 confidence interval for the training effect using the usual OLS standard error and the heteroskedasticityrobust standard error and describe your findings C15 The data set HAPPINESS contains independently pooled cross sections for the even years from 1994 through 2006 obtained from the General Social Survey The dependent variable for this problem is a measure of happiness vhappy which is a binary variable equal to one if the person reports being very happy as opposed to just pretty happy or not too happy Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 431 i Which year has the largest number of observations Which has the smallest What is the percentage of people in the sample reporting they are very happy ii Regress vhappy on all of the year dummies leaving out y94 so that 1994 is the base year Compute a heteroskedasticityrobust statistic of the null hypothesis that the proportion of very happy people has not changed over time What is the pvalue of the test iii To the regression in part ii add the dummy variables occattend and regattend Interpret their coefficients Remember the coefficients are interpreted relative to a base group How would you summarize the effects of church attendance on happiness iv Define a variable say highinc equal to one if family income is above 25000 Unfortunately the same threshold is used in each year and so inflation is not accounted for Also 25000 is hardly what one would consider high income Include highinc unem10 educ and teens in the regression in part iii Is the coefficient on regattend affected much What about its statistical significance v Discuss the signs magnitudes and statistical significance of the four new variables in part iv Do the estimates make sense vi Controlling for the factors in part iv do there appear to be differences in happiness by gender or race Justify your answer C16 Use the data in COUNTYMURDERS to answer this question The data set covers murders and execu tions capital punishment for 2197 counties in the United States i Find the average value of murdrate across all counties and years What is the standard deviation For what percentage of the sample is murdrate equal to zero ii How many observations have execs equal to zero What is the maximum value of execs Why is the average of execs so small iii Consider the model murdrateit 5 ut 1 b1execsit 1 b2execsi t21 1 b3percblackit 1 b4percmalei 1 b5perc1019 1 b6perc2029 1 ai 1 uit where ut represents a different intercept for each time period ai is the county fixed effect and uit is the idiosyncratic error What do we need to assume about ai and the execution variables in order for pooled OLS to consistently estimate the parameters in particular b1 and b2 iv Apply OLS to the equation from part ii and report the estimates of b1 and b2 along with the usual pooled OLS standard errors Do you estimate that executions have a deterrent effect on murders What do you think is happening v Even if the pooled OLS estimators are consistent do you trust the standard errors obtained from part iv Explain vi Now estimate the equation in part iii using first differencing to remove ai What are the new estimates of b1 and b2 Are they very different from the estimates from part iv vii Using the estimates from part vi can you say there is evidence of a statistically significant deterrent effect of capital punishment on the murder rate If possible in addition to the usual OLS standard errors use those that are robust to any kind of serial correlation or heteroskedasticity in the FD errors Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 432 APPEndix 13A 13A1 Assumptions for Pooled OLS Using First Differences In this appendix we provide careful statements of the assumptions for the firstdifferencing estima tor Verification of these claims is somewhat involved but it can be found in Wooldridge 2010 Chapter 10 Assumption FD1 For each i the model is yit 5 b1xit1 1 p 1 bkxitk 1 ai 1 uit t 5 1 p T where the bj are the parameters to estimate and ai is the unobserved effect Assumption FD2 We have a random sample from the cross section Assumption FD3 Each explanatory variable changes over time for at least some i and no perfect linear relationships exist among the explanatory variables For the next assumption it is useful to let Xi denote the explanatory variables for all time periods for crosssectional observation i thus Xi contains xitj t 5 1 p T j 5 1 p k Assumption FD4 For each t the expected value of the idiosyncratic error given the explanatory variables in all time periods and the unobserved effect is zero E1uit0Xi ai2 5 0 When Assumption FD4 holds we sometimes say that the xitj are strictly exogenous conditional on the unobserved effect The idea is that once we control for ai there is no correlation between the xisj and the remaining idiosyncratic error uit for all s and t As stated Assumption FD4 is stronger than necessary We use this form of the assumption be cause it emphasizes that we are interested in the equation E1yit0Xi ai2 5 E1yit0xit ai2 5 b1xit1 1 p 1 bkxitk 1 ai so that the bj measure partial effects of the observed explanatory variables holding fixed or control ling for the unobserved effect ai Nevertheless an important implication of FD4 and one that is sufficient for the unbiasedness of the FD estimator is E1Duit0Xi2 5 0 t 5 2 p T In fact for con sistency we can simply assume that Dxitj is uncorrelated with Duit for all t 5 2 p T and j 5 1 p k See Wooldridge 2010 Chapter 10 for further discussion Under these first four assumptions the firstdifference estimators are unbiased The key as sumption is FD4 which is strict exogeneity of the explanatory variables Under these same as sumptions we can also show that the FD estimator is consistent with a fixed T and as N S and perhaps more generally The next two assumptions ensure that the standard errors and test statistics resulting from pooled OLS on the first differences are asymptotically valid Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 433 Assumption FD5 The variance of the differenced errors conditional on all explanatory variables is constant Var1Duit0Xi2 5 s2 t 5 2 p T Assumption FD6 For all t 2 s the differences in the idiosyncratic errors are uncorrelated conditional on all explanatory variables Cov1Duit Duis0Xi2 5 0 t 2 s Assumption FD5 ensures that the differenced errors Duit are homoskedastic Assumption FD6 states that the differenced errors are serially uncorrelated which means that the uit follow a random walk across time see Chapter 11 Under Assumptions FD1 through FD6 the FD estimator of the bj is the best linear unbiased estimator conditional on the explanatory variables Assumption FD7 Conditional on Xi the Duit are independent and identically distributed normal random variables When we add Assumption FD7 the FD estimators are normally distributed and the t and F statistics from pooled OLS on the differences have exact t and F distributions Without FD7 we can rely on the usual asymptotic approximations 13A2 Computing Standard Errors Robust to Serial Correlation and Heteroskedasticity of Unknown Form Because the FD estimator is consistent as N S under Assumptions FD1 through FD4 it would be very handy to have a simple method of obtaining proper standard errors and test statistics that al low for any kind of serial correlation or heteroskedasticity in the FD errors eit 5 Duit Fortunately provided N is moderately large and T is not too large fully robust standard errors and test statistics are readily available As mentioned in the text a detailed treatment is above the level of this text The technical arguments combine the insights described in Chapters 8 and 12 where statistics robust to heteroskedasticity and serial correlation are discussed Actually there is one important advantage with panel data because we have a large cross section we can allow unrestricted serial correlation in the errors 5eit6 provided T is not too large We can contrast this situation with the NeweyWest ap proach in Section 125 where the estimated covariances must be downweighted as the observations get farther apart in time The general approach to obtaining fully robust standard errors and test statistics in the context of panel data is known as clustering and ideas have been borrowed from the cluster sampling litera ture The idea is that each crosssectional unit is defined as a cluster of observations over time and arbitrary correlationserial correlationand changing variances are allowed within each cluster Because of the relationship to cluster sampling many econometric software packages have options for clustering standard errors and test statistics Most commands look something like regress cy cx1 cx2 p cxk cluster1id2 where id is a variable containing unique identifiers for the crosssectional units and the c before each variable denotes change The option clusterid at the end of the regress command tells the software to report all standard errors and test statisticsincluding t statistics and Ftype statistics so that they are valid in large cross sections with any kind of serial correlation or heteroskedasticity Reporting such statistics is very common in modern empirical work with panel data Often the corrected standard errors will be substantially larger than either the usual standard errors or those that only correct for heteroskedasticity The larger standard errors better reflect the sampling error in the pooled OLS coefficients Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 434 I n this chapter we focus on two methods for estimating unobserved effects panel data models that are at least as common as first differencing Although these methods are somewhat harder to describe and implement several econometrics packages support them In Section 141 we discuss the fixed effects estimator which like first differencing uses a transformation to remove the unobserved effect ai prior to estimation Any timeconstant explanatory variables are removed along with ai The random effects estimator in Section 142 is attractive when we think the unobserved effect is uncorrelated with all the explanatory variables If we have good controls in our equation we might believe that any leftover neglected heterogeneity only induces serial correlation in the composite error term but it does not cause correlation between the composite errors and the explanatory variables Estimation of random effects models by generalized least squares is fairly easy and is routinely done by many econometrics packages Section 143 introduces the relatively new correlated random effects approach which provides a synthesis of fixed effects and random effects methods and has been shown to be practically very useful In Section 144 we show how panel data methods can be applied to other data structures including matched pairs and cluster samples Advanced Panel Data Methods c h a p t e r 14 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 435 141 Fixed Effects Estimation First differencing is just one of the many ways to eliminate the fixed effect ai An alternative method which works better under certain assumptions is called the fixed effects transformation To see what this method involves consider a model with a single explanatory variable for each i yit 5 b1xit 1 ai 1 uit t 5 1 2 p T 141 Now for each i average this equation over time We get yi 5 b1xi 1 ai 1 ui 142 where yi 5 T21 a T t51 yit and so on Because ai is fixed over time it appears in both 141 and 142 If we subtract 142 from 141 for each t we wind up with yit 2 yi 5 b11xit 2 xi2 1 uit 2 ui t 5 1 2 p T or y it 5 b1x it 1 u it t 5 1 2 p T 143 where y it 5 yit 2 yi is the timedemeaned data on y and similarly for x it and u it The fixed effects transformation is also called the within transformation The important thing about equation 143 is that the unobserved effect ai has disappeared This suggests that we should estimate 143 by pooled OLS A pooled OLS estimator that is based on the timedemeaned variables is called the fixed effects estimator or the within estimator The latter name comes from the fact that OLS on 143 uses the time variation in y and x within each crosssectional observation The between estimator is obtained as the OLS estimator on the crosssectional equation 142 where we include an intercept b0 we use the time averages for both y and x and then run a cross sectional regression We will not study the between estimator in detail because it is biased when ai is correlated with xi see Problem 2 If we think ai is uncorrelated with xit it is better to use the random effects estimator which we cover in Section 142 The between estimator ignores important informa tion on how the variables change over time Adding more explanatory variables to the equation causes few changes The original unobserved effects model is yit 5 b1xit1 1 b2xit2 1 p 1 bk xitk 1 ai 1 uit t 5 1 2 p T 144 We simply use the timedemeaning on each explanatory variableincluding things like timeperiod dummiesand then do a pooled OLS regression using all timedemeaned variables The general timedemeaned equation for each i is y it 5 b1x it1 1 b2x it2 1 p 1 bk x itk 1 u it t 5 1 2 p T 145 which we estimate by pooled OLS Under a strict exogeneity assumption on the explanatory variables the fixed effects estimator is unbiased roughly the idiosyncratic error uit should be uncorrelated with each explanatory variable across all time periods See the chapter appendix for pre cise statements of the assumptions The fixed effects estimator allows for arbitrary correlation between ai and the explanatory variables in any time period just as with first differencing Because of this any explanatory variable that is constant over time for all i gets swept away by the fixed effects Suppose that in a family savings equation for the years 1990 1991 and 1992 we let kidsit denote the number of children in family i for year t If the number of kids is constant over this threeyear period for most families in the sample what problems might this cause for estimating the effect that the number of kids has on savings Exploring FurthEr 141 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 436 transformation x it 5 0 for all i and t if xit is constant across t Therefore we cannot include variables such as gender or a citys distance from a river The other assumptions needed for a straight OLS analysis to be valid are that the errors uit are homoskedastic and serially uncorrelated across t see the appendix to this chapter There is one subtle point in determining the degrees of freedom for the fixed effects estimator When we estimate the timedemeaned equation 145 by pooled OLS we have NT total observa tions and k independent variables Notice that there is no intercept in 145 it is eliminated by the fixed effects transformation Therefore we should apparently have NT 2 k degrees of freedom This calculation is incorrect For each crosssectional observation i we lose one df because of the time demeaning In other words for each i the demeaned errors u it add up to zero when summed across t so we lose one degree of freedom There is no such constraint on the original idiosyncratic errors uit Therefore the appropriate degrees of freedom is df 5 NT 2 N 2 k 5 N1T 2 12 2 k Fortunately modern regression packages that have a fixed effects estimation feature properly compute the df But if we have to do the timedemeaning and the estimation by pooled OLS ourselves we need to correct the standard errors and test statistics ExamplE 141 Effect of Job Training on Firm Scrap Rates We use the data for three years 1987 1988 and 1989 on the 54 firms that reported scrap rates in each year No firms received grants prior to 1988 in 1988 19 firms received grants in 1989 10 dif ferent firms received grants Therefore we must also allow for the possibility that the additional job training in 1988 made workers more productive in 1989 This is easily done by including a lagged value of the grant indicator We also include year dummies for 1988 and 1989 The results are given in Table 141 We have reported the results in a way that emphasizes the need to interpret the estimates in light of the unobserved effects model 144 We are explicitly controlling for the unobserved time constant effects in ai The timedemeaning allows us to estimate the bj but 145 is not the best equa tion for interpreting the estimates Interestingly the estimated lagged effect of the training grant is substantially larger than the con temporaneous effect job training has an effect at least one year later Because the dependent variable is in logarithmic form obtaining a grant in 1988 is predicted to lower the firm scrap rate in 1989 by about 344 3exp124222 2 1 23444 the coefficient on grant21 is significant at the 5 level against a twosided alternative The coefficient on grant is significant at the 10 level and the size TAblE 141 Fixed Effects Estimation of the Scrap Rate Equation Dependent Variable logscrap Independent Variables Coefficient Standard Error d88 2080 109 d89 2247 133 grant 2252 151 grant21 2422 210 Observations Degrees of freedom Rsquared 162 104 201 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 437 of the coefficient is hardly trivial Notice the df is obtained as N1T 2 12 2 k 5 5413 2 12 2 4 5 104 The coefficient on d89 indicates that the scrap rate was substantially lower in 1989 than in the base year 1987 even in the absence of job training grants Thus it is important to allow for these aggregate effects If we omitted the year dummies the secular increase in worker productivity would be attributed to the job training grants Table 141 shows that even after controlling for aggregate trends in produc tivity the job training grants had a large estimated effect Finally it is crucial to allow for the lagged effect in the model If we omit grant21 then we are assuming that the effect of job training does not last into the next year The estimate on grant when we drop grant21 is 2082 1t 5 2652 this is much smaller and statistically insignificant When estimating an unobserved effects model by fixed effects it is not clear how we should compute a goodnessoffit measure The Rsquared given in Table 141 is based on the within trans formation it is the Rsquared obtained from estimating 145 Thus it is interpreted as the amount of time variation in the yit that is explained by the time variation in the explanatory variables Other ways of computing Rsquared are possible one of which we discuss later Although timeconstant variables cannot be included by themselves in a fixed effects model they can be interacted with variables that change over time and in particular with year dummy vari ables For example in a wage equation where education is constant over time for each individual in our sample we can interact education with each year dummy to see how the return to education has changed over time But we cannot use fixed effects to estimate the return to education in the base period which means we cannot estimate the return to education in any period we can only see how the return to education in each year differs from that in the base period Section 143 describes an approach that allows coefficients on timeconstant variables to be estimated while preserving the fixed effects nature of the analysis When we include a full set of year dummiesthat is year dummies for all years but the first we cannot estimate the effect of any variable whose change across time is constant An example is years of experience in a panel data set where each person works in every year so that experience always increases by one in each year for every person in the sample The presence of ai accounts for differences across people in their years of experience in the initial time period But then the effect of a oneyear increase in experience cannot be distinguished from the aggregate time effects because experience increases by the same amount for everyone This would also be true if in place of sepa rate year dummies we used a linear time trend for each person experience cannot be distinguished from a linear trend ExamplE 142 Has the Return to Education Changed over Time The data in WAGEPAN are from Vella and Verbeek 1998 Each of the 545 men in the sample worked in every year from 1980 through 1987 Some variables in the data set change over time experience marital status and union status are the three important ones Other variables do not change race and education are the key examples If we use fixed effects or first differencing we cannot include race education or experience in the equation However we can include interactions of educ with year dummies for 1981 through 1987 to test whether the return to education was constant over this time period We use logwage as the dependent variable dummy variables for marital and union status a full set of year dummies and the interaction terms d81 educ d82 educ p d87 educ The estimates on these interaction terms are all positive and they generally get larger for more recent years The largest coefficient of 030 is on d87 educ with t 5 248 In other words Under the Michigan program if a firm received a grant in one year it was not eligible for a grant the following year What does this imply about the correlation between grant and grant21 Exploring FurthEr 142 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 438 the return to education is estimated to be about 3 percentage points larger in 1987 than in the base year 1980 We do not have an estimate of the return to education in the base year for the reasons given earlier The other significant interaction term is d86 educ 1coefficient 5 027 t 5 2232 The estimates on the earlier years are smaller and insignificant at the 5 level against a twosided alter native If we do a joint F test for significance of all seven interaction terms we get pvalue 5 28 this gives an example where a set of variables is jointly insignificant even though some variables are individually significant The df for the F test are 7 and 3799 the second of these comes from N1T 2 12 2 k 5 54518 2 12 2 16 5 37994 Generally the results are consistent with an increase in the return to education over this period 141a The Dummy Variable Regression A traditional view of the fixed effects approach is to assume that the unobserved effect ai is a param eter to be estimated for each i Thus in equation 144 ai is the intercept for person i or firm i city i and so on that is to be estimated along with the bj Clearly we cannot do this with a single cross section there would be N 1 k parameters to estimate with only N observations We need at least two time periods The way we estimate an intercept for each i is to put in a dummy variable for each crosssectional observation along with the explanatory variables and probably dummy variables for each time period This method is usually called the dummy variable regression Even when N is not very large say N 5 54 as in Example 141 this results in many explanatory variablesin most cases too many to explicitly carry out the regression Thus the dummy variable method is not very practical for panel data sets with many crosssectional observations Nevertheless the dummy variable regression has some interesting features Most importantly it gives us exactly the same estimates of the bj that we would obtain from the regression on time demeaned data and the standard errors and other major statistics are identical Therefore the fixed effects estimator can be obtained by the dummy variable regression One benefit of the dummy vari able regression is that it properly computes the degrees of freedom directly This is a minor advantage now that many econometrics packages have programmed fixed effects options The Rsquared from the dummy variable regression is usually rather high This occurs because we are including a dummy variable for each crosssectional unit which explains much of the varia tion in the data For example if we estimate the unobserved effects model in Example 138 by fixed effects using the dummy variable regression which is possible with N 5 222 then R2 5 933 We should not get too excited about this large Rsquared it is not surprising that we can explain much of the variation in unemployment claims using both year and city dummies Just as in Example 138 the estimate on the EZ dummy variable is more important than R2 The Rsquared from the dummy variable regression can be used to compute F tests in the usual way assuming of course that the classical linear model assumptions hold see the chapter appendix In particular we can test the joint significance of all of the crosssectional dummies N 2 1 since one unit is chosen as the base group The unrestricted Rsquared is obtained from the regression with all of the crosssectional dummies the restricted Rsquared omits these In the vast majority of applica tions the dummy variables will be jointly significant Occasionally the estimated intercepts say a i are of interest This is the case if we want to study the distribution of the a i across i or if we want to pick a particular firm or city to see whether its a i is above or below the average value in the sample These estimates are directly available from the dummy variable regression but they are rarely reported by packages that have fixed effects routines for the practical reason that there are so many a i After fixed effects estimation with N of any size the a i are pretty easy to compute a i 5 yi 2 b1 xi1 2 p 2 bk xik i 5 1 p N 146 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 439 where the overbar refers to the time averages and the b j are the fixed effects estimates For example if we have estimated a model of crime while controlling for various timevarying factors we can obtain a i for a city to see whether the unobserved fixed effects that contribute to crime are above or below average Some econometrics packages that support fixed effects estimation report an intercept which can cause confusion in light of our earlier claim that the timedemeaning eliminates all timeconstant variables including an overall intercept See equation 145 Reporting an overall intercept in fixed effects FE estimation arises from viewing the ai as parameters to estimate Typically the intercept reported is the average across i of the a i In other words the overall intercept is actually the average of the individualspecific intercepts which is an unbiased consistent estimator of a 5 E1ai2 In most studies the b j are of interest and so the timedemeaned equations are used to obtain these estimates Further it is usually best to view the ai as omitted variables that we control for through the within transformation The sense in which the ai can be estimated is generally weak In fact even though a i is unbiased under Assumptions FE1 through FE4 in the chapter appendix it is not consistent with a fixed T as N S The reason is that as we add each additional cross sectional observation we add a new ai No information accumulates on each ai when T is fixed With larger T we can get better estimates of the ai but most panel data sets are of the large N and small T variety 141b Fixed Effects or First Differencing So far setting aside pooled OLS we have seen two competing methods for estimating unobserved effects models One involves differencing the data and the other involves timedemeaning How do we know which one to use We can eliminate one case immediately when T 5 2 the FE and FD estimates as well as all test statistics are identical and so it does not matter which we use Of course the equivalence between the FE and FD estimates requires that we estimate the same model in each case In particular as we discussed in Chapter 13 it is natural to include an intercept in the FD equation this intercept is actu ally the intercept for the second time period in the original model written for the two time periods Therefore FE estimation must include a dummy variable for the second time period in order to be identical to the FD estimates that include an intercept With T 5 2 FD has the advantage of being straightforward to implement in any econometrics or statistical package that supports basic data manipulation and it is easy to compute heteroskedasticity robust statistics after FD estimation because when T 5 2 FD estimation is just a crosssectional regression When T 3 the FE and FD estimators are not the same Since both are unbiased under Assumptions FE1 through FE4 we cannot use unbiasedness as a criterion Further both are consist ent with T fixed as N S under FE1 through FE4 For large N and small T the choice between FE and FD hinges on the relative efficiency of the estimators and this is determined by the serial cor relation in the idiosyncratic errors uit We will assume homoskedasticity of the uit since efficiency comparisons require homoskedastic errors When the uit are serially uncorrelated fixed effects is more efficient than first differencing and the standard errors reported from fixed effects are valid Since the unobserved effects model is typically stated sometimes only implicitly with serially uncorrelated idiosyncratic errors the FE estimator is used more than the FD estimator But we should remember that this assumption can be false In many applications we can expect the unobserved factors that change over time to be serially correlated If uit follows a random walkwhich means that there is very substantial positive serial correlationthen the difference Duit is serially uncorrelated and first differencing is better In many cases the uit exhibit some positive serial correlation but perhaps not as much as a random walk Then we cannot easily compare the efficiency of the FE and FD estimators Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 440 It is difficult to test whether the uit are serially uncorrelated after FE estimation we can esti mate the timedemeaned errors u it but not the uit However in Section 133 we showed how to test whether the differenced errors Duit are serially uncorrelated If this seems to be the case FD can be used If there is substantial negative serial correlation in the Duit FE is probably better It is often a good idea to try both if the results are not sensitive so much the better When T is large and especially when N is not very large for example N 5 20 and T 5 30 we must exercise caution in using the fixed effects estimator Although exact distributional results hold for any N and T under the classical fixed effects assumptions inference can be very sensitive to violations of the assumptions when N is small and T is large In particular if we are using unit root processessee Chapter 11the spurious regression problem can arise First differencing has the advantage of turning an integrated time series process into a weakly dependent process Therefore if we apply first differencing we can appeal to the central limit theorem even in cases where T is larger than N Normality in the idiosyncratic errors is not needed and heteroskedasticity and serial correla tion can be dealt with as we touched on in Chapter 13 Inference with the fixed effects estimator is potentially more sensitive to nonnormality heteroskedasticity and serial correlation in the idiosyn cratic errors Like the first difference estimator the fixed effects estimator can be very sensitive to classical measurement error in one or more explanatory variables However if each xitj is uncorrelated with uit but the strict exogeneity assumption is otherwise violatedfor example a lagged dependent variable is included among the regressors or there is feedback between uit and future outcomes of the explana tory variablethen the FE estimator likely has substantially less bias than the FD estimator unless T 5 2 The important theoretical fact is that the bias in the FD estimator does not depend on T while that for the FE estimator tends to zero at the rate 1T See Wooldridge 2010 Section 107 for details Generally it is difficult to choose between FE and FD when they give substantively different results It makes sense to report both sets of results and to try to determine why they differ 141c Fixed Effects with Unbalanced Panels Some panel data sets especially on individuals or firms have missing years for at least some cross sectional units in the sample In this case we call the data set an unbalanced panel The mechanics of fixed effects estimation with an unbalanced panel are not much more difficult than with a balanced panel If Ti is the number of time periods for crosssectional unit i we simply use these Ti observa tions in doing the timedemeaning The total number of observations is then T1 1 T2 1 p 1 TN As in the balanced case one degree of freedom is lost for every crosssectional observation due to the timedemeaning Any regression package that does fixed effects makes the appropriate adjustment for this loss The dummy variable regression also goes through in exactly the same way as with a bal anced panel and the df is appropriately obtained It is easy to see that units for which we have only a single time period play no role in a fixed effects analysis The timedemeaning for such observations yields all zeros which are not used in the estimation If Ti is at most two for all i we can use first differencing if Ti 5 1 for any i we do not have two periods to difference The more difficult issue with an unbalanced panel is determining why the panel is unbalanced With cities and states for example data on key variables are sometimes missing for certain years Provided the reason we have missing data for some i is not correlated with the idiosyncratic errors uit the unbalanced panel causes no problems When we have data on individuals families or firms things are trickier Imagine for example that we obtain a random sample of manufacturing firms in 1990 and we are interested in testing how unionization affects firm profitability Ideally we can use a panel data analysis to control for unobserved worker and management characteristics that affect profitability and might also be correlated with the fraction of the firms workforce that is unionized If we collect data again in subsequent years some firms may be lost because they have gone out of business or have merged with other companies If so we probably have a nonrandom sample in Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 441 subsequent time periods The question is If we apply fixed effects to the unbalanced panel when will the estimators be unbiased or at least consistent If the reason a firm leaves the sample called attrition is correlated with the idiosyncratic error those unobserved factors that change over time and affect profitsthen the resulting sample section problem see Chapter 9 can cause biased estimators This is a serious consideration in this example Nevertheless one useful thing about a fixed effects analysis is that it does allow attrition to be cor related with ai the unobserved effect The idea is that with the initial sampling some units are more likely to drop out of the survey and this is captured by ai ExamplE 143 Effect of Job Training on Firm Scrap Rates We add two variables to the analysis in Table 141 log1salesit2 and log1employit2 where sales is annual firm sales and employ is number of employees Three of the 54 firms drop out of the analysis entirely because they do not have sales or employment data Five additional observations are lost due to missing data on one or both of these variables for some years leaving us with n 5 148 Using fixed effects on the unbalanced panel does not change the basic story although the estimated grant effect gets larger b grant 5 2297 tgrant 5 2189 b grant21 5 2536 tgrant21 5 22389 Solving general attrition problems in panel data is complicated and beyond the scope of this text See for example Wooldridge 2010 Chapter 19 142 Random Effects Models We begin with the same unobserved effects model as before yit 5 b0 1 b1xit1 1 p 1 bk xitk 1 ai 1 uit 147 where we explicitly include an intercept so that we can make the assumption that the unobserved effect ai has zero mean without loss of generality We would usually allow for time dummies among the explanatory variables as well In using fixed effects or first differencing the goal is to eliminate ai because it is thought to be correlated with one or more of the xitj But suppose we think ai is uncorrelated with each explanatory variable in all time periods Then using a transformation to eliminate ai results in inefficient estimators Equation 147 becomes a random effects model when we assume that the unobserved effect ai is uncorrelated with each explanatory variable Cov1xitj ai2 5 0 t 5 1 2 p T j 5 1 2 p k 148 In fact the ideal random effects assumptions include all of the fixed effects assumptions plus the additional requirement that ai is independent of all explanatory variables in all time periods See the chapter appendix for the actual assumptions used If we think the unobserved effect ai is correlated with any explanatory variables we should use first differencing or fixed effects Under 148 and along with the random effects assumptions how should we estimate the bj It is important to see that if we believe that ai is uncorrelated with the explanatory variables the bj can be consistently estimated by using a single cross section there is no need for panel data at all But using a single cross section disregards much useful information in the other time periods We can also use the data in a pooled OLS procedure just run OLS of yit on the explanatory variables and prob ably the time dummies This too produces consistent estimators of the bj under the random effects assumption But it ignores a key feature of the model If we define the composite error term as vit 5 ai 1 uit then 147 can be written as yit 5 b0 1 b1xit1 1 p 1 bkxitk 1 vit 149 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 442 Because ai is in the composite error in each time period the vit are serially correlated across time In fact under the random effects assumptions Corr1vit vis2 5 s2 a 1s2 a 1 s2 u2 t 2 s where s2 a 5 Var1ai2 and s2 u 5 Var1uit2 This necessarily positive serial correlation in the error term can be substantial and because the usual pooled OLS standard errors ignore this correlation they will be incorrect as will the usual test statistics In Chapter 12 we showed how generalized least squares can be used to estimate models with autoregressive serial correlation We can also use GLS to solve the serial correlation problem here For the procedure to have good properties we should have large N and relatively small T We assume that we have a balanced panel although the method can be extended to unbalanced panels Deriving the GLS transformation that eliminates serial correlation in the errors requires sophisti cated matrix algebra see for example Wooldridge 2010 Chapter 10 But the transformation itself is simple Define u 5 1 2 3s2 u 1s2 u 1 Ts2 a2 412 1410 which is between zero and one Then the transformed equation turns out to be yit 2 uyi 5 b011 2 u2 1 b11xit1 2 uxi12 1 p 1 bk1xitk 2 uxik2 1 1vit 2 uvi2 1411 where the overbar again denotes the time averages This is a very interesting equation as it involves quasidemeaned data on each variable The fixed effects estimator subtracts the time averages from the corresponding variable The random effects transformation subtracts a fraction of that time aver age where the fraction depends on s2 u s2 a and the number of time periods T The GLS estimator is simply the pooled OLS estimator of equation 1411 It is hardly obvious that the errors in 1411 are serially uncorrelated but they are See Problem 3 The transformation in 1411 allows for explanatory variables that are constant over time and this is one advantage of random effects RE over either fixed effects or first differencing This is possible because RE assumes that the unobserved effect is uncorrelated with all explanatory vari ables whether the explanatory variables are fixed over time or not Thus in a wage equation we can include a variable such as education even if it does not change over time But we are assuming that education is uncorrelated with ai which contains ability and family background In many applica tions the whole reason for using panel data is to allow the unobserved effect to be correlated with the explanatory variables The parameter u is never known in practice but it can always be estimated There are different ways to do this which may be based on pooled OLS or fixed effects for example Generally u takes the form u 5 1 2 5131 1 T1s 2 as 2 u2 4612 where s 2 a is a consistent estimator of s2 a and s 2 u is a con sistent estimator of s2 u These estimators can be based on the pooled OLS or fixed effects residuals One possibility is that s 2 a 5 3NT1T 2 122 2 1k 1 12 4 21 a N i51 a T21 t51 a T s5t11vitvis where the vit are the residuals from estimating 149 by pooled OLS Given this we can estimate s2 u by using s 2 u 5 s 2 v 2 s 2 a where s 2 v is the square of the usual standard error of the regression from pooled OLS See Wooldridge 2010 Chapter 10 for additional discussion of these estimators Many econometrics packages support estimation of random effects models and automatically compute some version of u The feasible GLS estimator that uses u in place of u is called the random effects estimator Under the random effects assumptions in the chapter appendix the estimator is consistent not unbiased and asymptotically normally distributed as N gets large with fixed T The properties of the random effects RE estimator with small N and large T are largely unknown although it has certainly been used in such situations Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 443 Equation 1411 allows us to relate the RE estimator to both pooled OLS and fixed effects Pooled OLS is obtained when u 5 0 and FE is obtained when u 5 1 In practice the estimate u is never zero or one But if u is close to zero the RE estimates will be close to the pooled OLS esti mates This is the case when the unobserved effect ai is relatively unimportant because it has small variance relative to s2 u It is more common for s2 a to be large relative to s2 u in which case u will be closer to unity As T gets large u tends to one and this makes the RE and FE estimates very similar We can gain more insight on the relative merits of random effects versus fixed effects by writ ing the quasidemeaned error in equation 1411 as vit 2 uvi 5 11 2 u2ai 1 uit 2 uui This sim ple expression makes it clear that in the transformed equation the unobserved effect is weighted by 11 2 u2 Although correlation between ai and one or more xitj causes inconsistency in the random effects estimation we see that the correlation is attenuated by the factor 11 2 u2 As u S 1 the bias term goes to zero as it must because the RE estimator tends to the FE estimator If u is close to zero we are leaving a larger fraction of the unobserved effect in the error term and as a consequence the asymptotic bias of the RE estimator will be larger In applications of FE and RE it is usually informative also to compute the pooled OLS estimates Comparing the three sets of estimates can help us determine the nature of the biases caused by leav ing the unobserved effect ai entirely in the error term as does pooled OLS or partially in the error term as does the RE transformation But we must remember that even if ai is uncorrelated with all explanatory variables in all time periods the pooled OLS standard errors and test statistics are gener ally invalid they ignore the often substantial serial correlation in the composite errors vit 5 ai 1 uit As we mentioned in Chapter 13 see Example 139 it is possible to compute standard errors and test statistics that are robust to arbitrary serial correlation and heteroskedasticity in vit and popular sta tistics packages often allow this option See for example Wooldridge 2010 Chapter 10 ExamplE 144 a Wage Equation Using panel Data We again use the data in WAGEPAN to estimate a wage equation for men We use three methods pooled OLS random effects and fixed effects In the first two methods we can include educ and race dummies black and hispan but these drop out of the fixed effects analysis The timevarying vari ables are exper exper2 union and married As we discussed in Section 141 exper is dropped in the FE analysis although exper2 remains Each regression also contains a full set of year dummies The estimation results are in Table 142 TAblE 142 Three Different Estimators of a Wage Equation Dependent Variable logwage Independent Variables Pooled OLS Random Effects Fixed Effects educ 091 005 092 011 black 2139 024 2139 048 hispan 016 021 022 043 exper 067 014 106 015 exper 2 20024 0008 20047 0007 20052 0007 married 108 016 064 017 047 018 union 182 017 106 018 080 019 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 444 The coefficients on educ black and hispan are similar for the pooled OLS and random effects esti mations The pooled OLS standard errors are the usual OLS standard errors and these underestimate the true standard errors because they ignore the posi tive serial correlation we report them here for com parison only The experience profile is somewhat different and both the marriage and union premiums fall notably in the random effects estimation When we eliminate the unobserved effect entirely by using fixed effects the marriage premium falls to about 47 although it is still statistically signifi cant The drop in the marriage premium is consistent with the idea that men who are more ableas captured by a higher unobserved effect aiare more likely to be married Therefore in the pooled OLS estimation a large part of the marriage premium reflects the fact that men who are married would earn more even if they were not married The remaining 47 has at least two possible expla nations 1 marriage really makes men more productive or 2 employers pay married men a pre mium because marriage is a signal of stability We cannot distinguish between these two hypotheses The estimate of u for the random effects estimation is u 5 643 which helps explain why on the timevarying variables the RE estimates lie closer to the FE estimates than to the pooled OLS estimates 142a Random Effects or Fixed Effects Because fixed effects allows arbitrary correlation between ai and the xitj while random effects does not FE is widely thought to be a more convincing tool for estimating ceteris paribus effects Still random effects is applied in certain situations Most obviously if the key explanatory variable is con stant over time we cannot use FE to estimate its effect on y For example in Table 142 we must rely on the RE or pooled OLS estimate of the return to education Of course we can only use ran dom effects because we are willing to assume the unobserved effect is uncorrelated with all explana tory variables Typically if one uses random effects as many timeconstant controls as possible are included among the explanatory variables With an FE analysis it is not necessary to include such controls RE is preferred to pooled OLS because RE is generally more efficient If our interest is in a timevarying explanatory variable is there ever a case to use RE rather than FE Yes but situations in which Cov1xitj ai2 5 0 should be considered the exception rather than the rule If the key policy variable is set experimentallysay each year children are randomly assigned to classes of different sizesthen random effects would be appropriate for estimating the effect of class size on performance Unfortunately in most cases the regressors are themselves outcomes of choice processes and likely to be correlated with individual preferences and abilities as captured by ai It is still fairly common to see researchers apply both random effects and fixed effects and then formally test for statistically significant differences in the coefficients on the timevarying explana tory variables So in Table 142 these would be the coefficients on exper2 married and union Hausman 1978 first proposed such a test and some econometrics packages routinely compute the Hausman test under the full set of random effects assumptions listed in the appendix to this chapter The idea is that one uses the random effects estimates unless the Hausman test rejects 148 In practice a failure to reject means either that the RE and FE estimates are sufficiently close so that it does not matter which is used or the sampling variation is so large in the FE estimates that one cannot conclude practically significant differences are statistically significant In the latter case one is left to wonder whether there is enough information in the data to provide precise estimates of the coefficients A rejection using the Hausman test is taken to mean that the key RE assumption 148 is false and then the FE estimates are used Naturally as in all applications of statistical infer ence one should distinguish between a practically significant difference and a statistically significant The union premium estimated by fixed effects is about 10 percentage points lower than the OLS estimate What does this strongly suggest about the correlation between union and the unobserved effect Exploring FurthEr 143 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 445 difference Wooldridge 2010 Chapter 10 contains further discussion In the next section we discuss an alternative computationally simpler approach to choosing between the RE and FE approaches A final word of caution In reading empirical work you may find that some authors decide on FE versus RE estimation based on whether the ai are properly viewed as parameters to estimate or as random variables Such considerations are usually wrongheaded In this chapter we have treated the ai as random variables in the unobserved effects model 147 regardless of how we decide to estimate the bj As we have emphasized the key issue that determines whether we use FE or RE is whether we can plausibly assume ai is uncorrelated with all xitj Nevertheless in some applications of panel data methods we cannot treat our sample as a random sample from a large population especially when the unit of observation is a large geographical unit say states or provinces Then it often makes sense to think of each ai as a separate intercept to estimate for each crosssectional unit In this case we use fixed effects remember using FE is mechanically the same as allowing a differ ent intercept for each crosssectional unit Fortunately whether or not we engage in the philosophical debate about the nature of ai FE is almost always much more convincing than RE for policy analysis using aggregated data 143 The Correlated Random Effects Approach In applications where it makes sense to view the ai unobserved effects as being random variables along with the observed variables we draw there is an alternative to fixed effects that still allows ai to be correlated with the observed explanatory variables To describe the approach consider again the simple model in equation 141 with a single timevarying explanatory variable xit Rather than assume ai is uncorrelated with 5xit t 5 1 2 p T6which is the random effects approachor take away time averages to remove ai the fixed effects approachwe might instead model correlation between ai and 5xit t 5 1 2 p T6 Because ai is by definition constant over time allowing it to be correlated with the average level of the xit has a certain appeal More specifically let xi 5 T21 a T t51xit be the time average as before Suppose we assume the simple linear relationship ai 5 a 1 gxi 1 ri 1412 where we assume ri is uncorrelated with each xit Because xi is a linear function of the xit Cov1xi ri2 5 0 1413 Equations 1412 and 1413 imply that ai and xi are correlated whenever g 2 0 The correlated random effects CRE approach uses 1412 in conjunction with 141 substi tuting the former in the latter gives yit 5 bxit 1 a 1 gxi 1 ri 1 uit 5 a 1 bxit 1 gxi 1 ri 1 uit 1414 Equation 1414 is interesting because it still has a composite error term ri 1 uit consisting of a timeconstant unobservable ri and the idiosyncratic shocks uit Importantly assumption 148 holds when we replace ai with ri Also because uit is assumed to be uncorrelated with xis all s and t uit is also uncorrelated with xi All of these assumptions add up to random effects estimation of the equation yit 5 a 1 bxit 1 gxi 1 ri 1 uit 1415 which is like the usual equation underlying RE estimation with the important addition of the time average variable xi It is the addition of xi that controls for the correlation between ai and the sequence 5xit t 5 1 2 p T6 What is left over ri is uncorrelated with the xit In most econometrics packages it is easy to compute the unitspecific time averages xi Assuming we have done that for each crosssectional unit i what can we expect to happen if we apply RE to equation 1415 Notice that estimation of 1415 gives a CRE b CRE and g CREthe CRE estimators Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 446 As far as b CRE goes the answer is a bit anticlimactic It can be shownsee for example Wooldridge 2010 Chapter 10that b CRE 5 b FE 1416 where b FE denotes the FE estimator from equation 143 In other words adding the time average xi and using random effects is the same as subtracting the time averages and using pooled OLS Even though 1415 is not needed to obtain b FE the equivalence of the CRE and FE estimates of b provides a nice interpretation of FE it controls for the average level xi when measuring the partial effect of xit on yit As an example suppose that xit is a tax rate on firm profits in county i in year t and yit is some measure of countylevel economic output By including xi the average tax rate in the county over the T years we are allowing for systematic differences between historically hightax and lowtax countiesdifferences that may also affect economic output We can also use equation 1415 to see why the FE estimators are often much less precise than the RE estimators If we set g 5 0 in equation 1415 then we obtain the usual RE estimator of b b RE This means that correlation between xit and xi has no bearing on the variance of the RE estimator By contrast we know from multiple regression analysis in Chapter 3 that correlation between xit and xi that is multicollinearitycan result in a higher variance for b FE Sometimes the variance is much higher particularly when there is little variation in xit across t in which case xit and xi tend to be highly positively correlated In the limiting case where there is no variation across time for any i the correlation is perfectand FE fails to provide an estimate of b Apart from providing a synthesis of the FE and RE approaches are there other reasons to con sider the CRE approach even if it simply delivers the usual FE estimate of b Yes at least two First the CRE approach provides a simple formal way of choosing between the FE and RE approaches As we just discussed the RE approach sets g 5 0 while FE estimates g Because we have g CRE and its standard error obtained from RE estimation of 1415 we can construct a t test of H0 g 5 0 against H1 g 2 0 The appendix discusses how to make this test robust to heteroskedasticity and serial cor relation in 5uit6 If we reject H0 at a sufficiently small significance level we reject RE in favor of FE As usual especially with a large cross section it is important to distinguish between a statistical rejection and economically important differences A second reason to study the CRE approach is that it provides a way to include timeconstant explanatory variables in what is effectively a fixed effects analysis For example let zi be a variable that does not change over timeit could be gender say or an IQ test score determined in childhood We can easily augment 1415 to include zi yit 5 a 1 bxit 1 gxi 1 dzi 1 ri 1 uit 1417 where we do not change the notation for the error term which no longer includes zi If we estimate this expanded equation by RE it can still be shown that the estimate of b is the FE estimate from 141 In fact once we include xi we can include any other timeconstant variables in the equation estimate it by RE and obtain b FE as the coefficient on xit In addition we obtain an estimate of d although the estimate should be interpreted with caution because it does not necessarily estimate a causal effect of zi on yit The same CRE strategy can be applied to models with many timevarying explanatory variables and many timeconstant variables When the equation augmented with the time averages is esti mated by RE the coefficients on the timevarying variables are identical to the FE estimates As a practical note when the panel is balanced there is no need to include the time averages of variables that change over timethe leading case being time period dummies With T time periods the time average of a time period is just 1T a constant for all i and t clearly it makes no sense to add a bunch of constants to an equation that already has an intercept If the panel data set is unbalanced then the average of variables such as time dummies can change across iit will depend on how many periods Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 447 we have for crosssectional unit i In such cases the time averages of any variable that changes over time must be included Computer Exercise 14 in this chapter illustrates how the CRE approach can be applied to the bal anced panel data set in AIRFARE and how one can test RE versus FE in the CRE framework 143a Unbalanced Panels The correlated random effects approach also can be applied to unbalanced panels but some care is required In order to obtain an estimator that reproduces the fixed effects estimates on the time varying explanatory variables one must be careful in constructing the time averages In particular for y or any xj a time period contributes to the time average yi or xij only if data on all of 1yit xit1 p xitk2 are observed One way to depict the situation is to define a dummy variable sit which equals one when a complete set of data on 1yit xit1 p xitk2 is observed If any element is missing including of course if the entire time period is missing then sit 5 0 The notion of a selection indicator is dis cussed in more detail in Chapter 17 With this definition the appropriate time average of 5yit6 can be written as yi 5 T21 i a T t51sit yit where Ti is the total number of complete time periods for crosssectional observation i In other words we only average over the time periods that have a complete set of data Another subtle point is that when time period dummies are included in the model or any other variables that change only by t and not i we must now include their time averages unlike in the bal anced case where the time averages are just constants For example if 5wt t 5 1 p T6 is an aggre gate time variable such as a time dummy or a linear time trend then wi 5 T21 i a T t51sitwt Because of the unbalanced nature of the panel wi almost always varies somewhat across i unless the exact same time periods are missing for all crosssectional units As with variables that actually change across i and t the time averages of aggregate time effects are easy to obtain in many software packages The mechanics of the random effects estimator also change somewhat when we have an unbal anced panel and this is true whether we use the traditional random effects estimator or the CRE version Namely the parameter u in equation 1410 used in equation 1411 to obtain the quasi demeaned data depends on i through the number of time periods observed for unit i Specifically simply replace T in equation 1410 with Ti Econometrics packages that support random effects estimation recognize this difference when using balanced panels and so nothing special needs to be done from a users perspective The bottom line is that once the time averages have been properly obtained using an equation such as 1417 is the same as in the balanced case We can still use a test of statistical significance on the set of time averages to choose between fixed effects and pure random effects and the CRE approach still allows us to include timeconstant variables As with fixed effects estimation a key issue is understanding why the panel data set is unbal anced In the pure random effects case the selection indicator sit cannot be correlated with the com posite error in equation 147 ai 1 uit in any time period Otherwise as discussed in Wooldridge 2010 Chapter 19 the RE estimator is inconsistent As discussed in Section 141 the FE estimator allows for arbitrary correlation between the selection indicator sit and the fixed effect ai Therefore FE estimator is more robust in the context of unbalanced panels And as we already know FE allows arbitrary correlation between timevarying explanatory variables and ai Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 448 144 Applying Panel Data Methods to Other Data Structures The various panel data methods can be applied to certain data structures that do not involve time For example it is common in demography to use siblings sometimes twins to account for unobserved family and background characteristics Usually we want to allow the unobserved family effect which is common to all siblings within a family to be correlated with observed explanatory variables If those explanatory variables vary across siblings within a family differencing across sibling pairs or more generally using the within transformation within a familyis preferred as an estimation method By removing the unobserved effect we eliminate potential bias caused by confounding fam ily background characteristics Implementing fixed effects on such data structures is rather straight forward in regression packages that support FE estimation As an example Geronimus and Korenman 1992 used pairs of sisters to study the effects of teen childbearing on future economic outcomes When the outcome is income relative to needs something that depends on the number of childrenthe model is log1incneedsfs2 5 b0 1 d0sister2s 1 b1teenbrthfs 1 b2agefs 1 other factors 1 af 1 ufs 1418 where f indexes family and s indexes a sister within the family The intercept for the first sister is b0 and the intercept for the second sister is b0 1 d0 The variable of interest is teenbrthfs which is a binary variable equal to one if sister s in family f had a child while a teenager The variable agefs is the current age of sister s in family f Geronimus and Korenman also used some other controls The unobserved variable af which changes only across family is an unobserved family effect or a family fixed effect The main concern in the analysis is that teenbrth is correlated with the family effect If so an OLS analysis that pools across families and sisters gives a biased estimator of the effect of teenage motherhood on economic outcomes Solving this problem is simple within each family difference 1418 across sisters to get Dlog1incneeds2 5 d0 1 b1Dteenbrth 1 b2Dage 1 p 1 Du 1419 this removes the family effect af and the resulting equation can be estimated by OLS Notice that there is no time element here the differencing is across sisters within a family Also we have allowed for differences in intercepts across sisters in 1418 which leads to a nonzero intercept in the differenced equation 1419 If in entering the data the order of the sisters within each family is essentially random the estimated intercept should be close to zero But even in such cases it does not hurt to include an intercept in 1419 and having the intercept allows for the fact that say the first sister listed might always be the neediest Using 129 sister pairs from the 1982 National Longitudinal Survey of Young Women Geronimus and Korenman first estimated b1 by pooled OLS to obtain 233 or 226 where the second estimate comes from controlling for family background variables such as parents education both estimates are very statistically significant see Table 3 in Geronimus and Korenman 1992 Therefore teenage motherhood has a rather large impact on future family income However when the differenced equa tion is estimated the coefficient on teenbrth is 208 which is small and statistically insignificant This suggests that it is largely a womans family background that affects her future income rather than teenage childbearing Geronimus and Korenman looked at several other outcomes and two other data sets in some cases the within family estimates were economically large and statistically significant They also showed how the effects disappear entirely when the sisters education levels are controlled for When using the differencing method does it make sense to include dummy variables for the mother and fathers race in 1418 Explain Exploring FurthEr 144 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 449 Ashenfelter and Krueger 1994 used the differencing methodology to estimate the return to education They obtained a sample of 149 identical twins and collected information on earnings edu cation and other variables Identical twins were used because they should have the same underly ing ability This can be differenced away by using twin differences rather than OLS on the pooled data Because identical twins are the same in age gender and race these factors all drop out of the differenced equation Therefore Ashenfelter and Krueger regressed the difference in logearnings on the difference in education and estimated the return to education to be about 92 1t 5 3832 Interestingly this is actually larger than the pooled OLS estimate of 84 which controls for gender age and race Ashenfelter and Krueger also estimated the equation by random effects and obtained 87 as the return to education See Table 5 in their paper The random effects analysis is mechani cally the same as the panel data case with two time periods The samples used by Geronimus and Korenman 1992 and Ashenfelter and Krueger 1994 are examples of matched pairs samples More generally fixed and random effects methods can be applied to a cluster sample A cluster sample has the same appearance as a crosssectional data set but there is an important difference clusters of units are sampled from a population of clusters rather than sampling individuals from the population of individuals In the previous examples each family is sampled from the population of families and then we obtain data on at least two family members Therefore each family is a cluster As another example suppose we are interested in modeling individual pension plan participation decisions One might obtain a random sample of working individualssay from the United States but it is also common to sample firms from a population of firms Once the firms are sampled one might collect information on all workers or a subset of workers within each firm In either case the resulting data set is a cluster sample because sampling was first at the firm level Unobserved firm level characteristics along with observed firm characteristics are likely to be present in participation decisions and this withinfirm correlation must be accounted for Fixed effects estimation is preferred when we think the unobserved cluster effectan example of which is ai in 1412is correlated with one or more of the explanatory variables Then we can only include explanatory variables that vary at least somewhat within clusters The cluster sizes are rarely the same so we are effectively using fixed effects methods for unbalanced panels Educational data on student outcomes can also come in the form of a cluster sample where a sample of schools is obtained from the population of schools and then information on students within each school is obtained Each school acts as a cluster and allowing a school effect to be correlated with key explanatory variablessay whether a student participates in a statesponsored tutoring programis likely to be important Because the rate at which students are tutored likely varies by school it is probably a good idea to use fixed effects estimation One often sees authors use as a shorthand I included school fixed effects in the analysis The correlated random effects approach can be applied immediately to cluster samples because for the purposes of estimation a cluster sample acts like an unbalanced panel Now the averages that are added to the equation are withincluster averagesfor example averages within schools The only difference with panel data is that the notion of serial correlation in idiosyncratic errors is not relevant Nevertheless as discussed in Wooldridge 2010 Chapter 20 there are still good reasons for using clusterrobust standard errors whether one uses fixed effects or correlated random effects In some cases the key explanatory variablesoften policy variableschange only at the level of the cluster not within the cluster In such cases the fixed effects approach is not applicable For exam ple we may be interested in the effects of measured teacher quality on student performance where each cluster is an elementary school classroom Because all students within a cluster have the same teacher eliminating a class effect also eliminates any observed measures of teacher quality If we have good controls in the equation we may be justified in applying random effects on the unbalanced cluster As with panel data the key requirement for RE to produce convincing estimates is that the explanatory variables are uncorrelated with the unobserved cluster effect Most econometrics pack ages allow random effects estimation on unbalanced clusters without much effort Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 450 Pooled OLS is also commonly applied to cluster samples when eliminating a cluster effect via fixed effects is infeasible or undesirable However as with panel data the usual OLS standard errors are incorrect unless there is no cluster effect and so robust standard errors that allow cluster cor relation and heteroskedasticity should be used Some regression packages have simple commands to correct standard errors and the usual test statistics for general within cluster correlation as well as heteroskedasticity These are the same corrections that work for pooled OLS on panel data sets which we reported in Example 139 As an example Papke 1999 estimates linear probability models for the continuation of defined benefit pension plans based on whether firms adopted defined contri bution plans Because there is likely to be a firm effect that induces correlation across different plans within the same firm Papke corrects the usual OLS standard errors for cluster sampling as well as for heteroskedasticity in the linear probability model Before ending this section some final comments are in order Given the readily available tools of fixed effects random effects and clusterrobust standard inference it is tempting to find reasons to use clustering methods where none may exist For example if a set of data is obtained from a random sample from the population then there is usually no reason to account for cluster effects in comput ing standard errors after OLS estimation The fact that the units can be put into groups ex postthat is after the random sample has been obtainedis not a reason to make inference robust to cluster correlation To illustrate this point suppose that out of the population of fourthgrade students in the United States a random sample of 50000 is obtained these data are properly studied using standard methods for crosssectional regression It may be tempting to group the students by say the 50 states plus the District of Columbiaassuming a state identifier is includedand then treat the data as a cluster sample But this would be wrong and clustering the standard errors at the state level can produce standard errors that are systematically too large Or they might be too small because the asymptotic theory underlying cluster sampling assumes that we have many clusters with each cluster size being relatively small In any case a simple thought experiment shows that clustering cannot be correct For example if we know the county of residence for each student why not cluster at the county level Or at a coarser level we can divide the United States into four census regions and treat those as the clustersand this would give a different set of standard errors that do not have any theoretical justification Taking this argument to its extreme one could argue that we have one cluster the entire United States in which case the clustered standard errors would not be defined and inference would be impossible The confusion comes about because the clusters are defined ex postthat is after the random sample is obtained In a true cluster sample the clusters are first drawn from a population of clusters and then individuals are drawn from the clusters One might use clustering methods if say a districtlevel variable is created after the random sam ple is collected and then used in the studentlevel equation This can create unobserved cluster cor relation within each district Recall that the fixed effects estimator in this case at the district level is the same as putting in districtlevel averages Thus one might want to account for cluster correlation at the district level in addition to using fixed effects As shown by Stock and Watson 2008 in the context of panel data with large cluster sizes the resulting cluster correlation is generally unimpor tant but with small cluster sizes one should use the clusterrobust standard errors Summary In this chapter we have continued our discussion of panel data methods studying the fixed effects and random effects estimators and also described the correlated random effects approach as a unifying framework Compared with first differencing the fixed effects estimator is efficient when the idiosyn cratic errors are serially uncorrelated as well as homoskedastic and we make no assumptions about correlation between the unobserved effect ai and the explanatory variables As with first differencing any Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 451 time constant explanatory variables drop out of the analysis Fixed effects methods apply immediately to unbalanced panels but we must assume that the reasons some time periods are missing are not systemati cally related to the idiosyncratic errors The random effects estimator is appropriate when the unobserved effect is thought to be uncorre lated with all the explanatory variables Then ai can be left in the error term and the resulting serial correlation over time can be handled by generalized least squares estimation Conveniently feasible GLS can be obtained by a pooled regression on quasidemeaned data The value of the estimated transforma tion parameter u indicates whether the estimates are likely to be closer to the pooled OLS or the fixed effects estimates If the full set of random effects assumptions holds the random effects estimator is asymptoticallyas N gets large with T fixedmore efficient than pooled OLS first differencing or fixed effects which are all unbiased consistent and asymptotically normal The correlated random effects approach to panel data models has become more popular in recent years primarily because it allows a simple test for choosing between FE and RE and it allows one to incor porate timeconstant variables in an equation that delivers the FE estimates of the timevarying variables Finally the panel data methods studied in Chapters 13 and 14 can be used when working with matched pairs or cluster samples Differencing or the within transformation eliminates the cluster effect If the clus ter effect is uncorrelated with the explanatory variables pooled OLS can be used but the standard errors and test statistics should be adjusted for cluster correlation Random effects estimation is also a possibility Key Terms Cluster Effect Cluster Sample Clustering Composite Error Term Correlated Random Effects Dummy Variable Regression Fixed Effects Estimator Fixed Effects Transformation Matched Pairs Samples QuasiDemeaned Data Random Effects Estimator Random Effects Model TimeDemeaned Data Unbalanced Panel Unobserved Effects Model Within Estimator Within Transformation Problems 1 Suppose that the idiosyncratic errors in 144 5uit t 5 1 2 p T6 are serially uncorrelated with constant variance s2 u Show that the correlation between adjacent differences Duit and Dui t11 is 25 Therefore under the ideal FE assumptions first differencing induces negative serial correlation of a known value 2 With a single explanatory variable the equation used to obtain the between estimator is yi 5 b0 1 b1xi 1 ai 1 ui where the overbar represents the average over time We can assume that E1ai2 5 0 because we have included an intercept in the equation Suppose that ui is uncorrelated with xi but Cov1xit ai2 5 sxa for all t and i because of random sampling in the cross section i Letting b 1 be the between estimator that is the OLS estimator using the time averages show that plim b 1 5 b1 1 sxa Var1xi2 where the probability limit is defined as N S Hint See equations 55 and 56 ii Assume further that the xit for all t 5 1 2 p T are uncorrelated with constant variance s2 x Show that plim b 1 5 b1 1 T1sxa s2 x2 iii If the explanatory variables are not very highly correlated across time what does part ii suggest about whether the inconsistency in the between estimator is smaller when there are more time periods Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 452 3 In a random effects model define the composite error vit 5 ai 1 uit where ai is uncorrelated with uit and the uit have constant variance s2 u and are serially uncorrelated Define eit 5 vit 2 uvi where u is given in 1410 i Show that E1eit2 5 0 ii Show that Var1eit2 5 s2 u t 5 1 p T iii Show that for t 2 s Cov1eit eis2 5 0 4 In order to determine the effects of collegiate athletic performance on applicants you collect data on applications for a sample of Division I colleges for 1985 1990 and 1995 i What measures of athletic success would you include in an equation What are some of the timing issues ii What other factors might you control for in the equation iii Write an equation that allows you to estimate the effects of athletic success on the percentage change in applications How would you estimate this equation Why would you choose this method 5 Suppose that for one semester you can collect the following data on a random sample of college juniors and seniors for each class taken a standardized final exam score percentage of lectures attended a dummy variable indicating whether the class is within the students major cumulative grade point average prior to the start of the semester and SAT score i Why would you classify this data set as a cluster sample Roughly how many observations would you expect for the typical student ii Write a model similar to equation 1418 that explains final exam performance in terms of attendance and the other characteristics Use s to subscript student and c to subscript class Which variables do not change within a student iii If you pool all of the data and use OLS what are you assuming about unobserved student characteristics that affect performance and attendance rate What roles do SAT score and prior GPA play in this regard iv If you think SAT score and prior GPA do not adequately capture student ability how would you estimate the effect of attendance on final exam performance 6 Using the cluster option in the econometrics package Stata 11 the fully robust standard errors for the pooled OLS estimates in Table 142that is robust to serial correlation and heteroskedas ticity in the composite errors 5vit t 5 1 p T6are obtained as se1b educ2 5 011 se1b black2 5 051 se1b hispan2 5 039 se1b exper2 5 020 se1b exper22 5 0010 se1b married2 5 026 and se1b union2 5 027 i How do these standard errors generally compare with the nonrobust ones and why ii How do the robust standard errors for pooled OLS compare with the standard errors for RE Does it seem to matter whether the explanatory variable is timeconstant or timevarying iii When the fully robust standard errors for the RE estimates are computed Stata 11 reports the following where we look at only the coefficients on the timevarying variables se1b exper2 5 016 se1b expersq2 5 0008 se1b married2 5 019 and se1b union2 5 021 These are robust to any kind of serial correlation or heteroskedasticity in the idiosyncratic errors 5uit t 5 1 p T6 as well as heteroskedasticity in ai How do the robust standard errors generally compare with the usual RE standard errors reported in Table 142 What conclusion might you draw iv Comparing the four standard errors in part iii with their pooled OLS counterparts what do you make of the fact that the robust RE standard errors are all below the robust pooled OLS standard errors 7 The data in CENSUS2000 is a random sample of individuals from the United States Here we are interested in estimating a simple regression model relating the log of weekly income lweekinc to schooling educ There are 29501 observations Associated with each individual is a state identifier state for the 50 states plus the District of Columbia A less coarse geographic identifier is puma which takes on 610 different values indicating geographic regions smaller than a state Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 453 Running the simple regression of lweekinc on educ gives a slope coefficient equal to 1083 to four decimal places The heteroskedasticityrobust standard error is about 0024 The standard error clus tered at the puma level is about 0027 and the standard error clustered at the state level is about 0033 For computing a confidence interval which of these standard errors is the most reliable Explain Computer Exercises C1 Use the data in RENTAL for this exercise The data on rental prices and other variables for college towns are for the years 1980 and 1990 The idea is to see whether a stronger presence of students affects rental rates The unobserved effects model is log1rentit2 5 b0 1 d0y90t 1 b1log1popit2 1 b2log1avgincit2 1 b3 pctstuit 1 ai 1 uit where pop is city population avginc is average income and pctstu is student population as a percent age of city population during the school year i Estimate the equation by pooled OLS and report the results in standard form What do you make of the estimate on the 1990 dummy variable What do you get for b pctstu ii Are the standard errors you report in part i valid Explain iii Now difference the equation and estimate by OLS Compare your estimate of bpctstu with that from part i Does the relative size of the student population appear to affect rental prices iv Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part iii C2 Use CRIME4 for this exercise i Reestimate the unobserved effects model for crime in Example 139 but use fixed effects rather than differencing Are there any notable sign or magnitude changes in the coefficients What about statistical significance ii Add the logs of each wage variable in the data set and estimate the model by fixed effects How does including these variables affect the coefficients on the criminal justice variables in part i iii Do the wage variables in part ii all have the expected sign Explain Are they jointly significant C3 For this exercise we use JTRAIN to determine the effect of the job training grant on hours of job train ing per employee The basic model for the three years is hrsempit 5 b0 1 d1d88t 1 d2d89t 1 b1grantit 1 b2grantit2l 1 b3log1employit2 1 ai 1 uit i Estimate the equation using fixed effects How many firms are used in the FE estimation How many total observations would be used if each firm had data on all variables in particular hrsemp for all three years ii Interpret the coefficient on grant and comment on its significance iii Is it surprising that grant21 is insignificant Explain iv Do larger firms provide their employees with more or less training on average How big are the differences For example if a firm has 10 more employees what is the change in average hours of training C4 In Example 138 we used the unemployment claims data from Papke 1994 to estimate the effect of enterprise zones on unemployment claims Papke also uses a model that allows each city to have its own time trend log1uclmsit2 5 ai 1 cit 1 b1ezit 1 uit where ai and ci are both unobserved effects This allows for more heterogeneity across cities Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 454 i Show that when the previous equation is first differenced we obtain Dlog1uclmsit2 5 ci 1 b1Dezit 1 Duit t 5 2 p T Notice that the differenced equation contains a fixed effect ci ii Estimate the differenced equation by fixed effects What is the estimate of b1 Is it very different from the estimate obtained in Example 138 Is the effect of enterprise zones still statistically significant iii Add a full set of year dummies to the estimation in part ii What happens to the estimate of b1 C5 i In the wage equation in Example 144 explain why dummy variables for occupation might be important omitted variables for estimating the union wage premium ii If every man in the sample stayed in the same occupation from 1981 through 1987 would you need to include the occupation dummies in a fixed effects estimation Explain iii Using the data in WAGEPAN include eight of the occupation dummy variables in the equation and estimate the equation using fixed effects Does the coefficient on union change by much What about its statistical significance C6 Add the interaction term unionit t to the equation estimated in Table 142 to see if wage growth depends on union status Estimate the equation by random and fixed effects and compare the results C7 Use the statelevel data on murder rates and executions in MURDER for the following exercise i Consider the unobserved effects model mrdrteit 5 ht 1 b1execit 1 b2unemit 1 ai 1 uit where ht simply denotes different year intercepts and ai is the unobserved state effect If past executions of convicted murderers have a deterrent effect what should be the sign of b1 What sign do you think b2 should have Explain ii Using just the years 1990 and 1993 estimate the equation from part i by pooled OLS Ignore the serial correlation problem in the composite errors Do you find any evidence for a deterrent effect iii Now using 1990 and 1993 estimate the equation by fixed effects You may use first differencing since you are only using two years of data Is there evidence of a deterrent effect How strong iv Compute the heteroskedasticityrobust standard error for the estimation in part ii v Find the state that has the largest number for the execution variable in 1993 The variable exec is total executions in 1991 1992 and 1993 How much bigger is this value than the next highest value vi Estimate the equation using first differencing dropping Texas from the analysis Compute the usual and heteroskedasticityrobust standard errors Now what do you find What is going on vii Use all three years of data and estimate the model by fixed effects Include Texas in the analysis Discuss the size and statistical significance of the deterrent effect compared with only using 1990 and 1993 C8 Use the data in MATHPNL for this exercise You will do a fixed effects version of the first differencing done in Computer Exercise 11 in Chapter 13 The model of interest is math4it 5 d1y94t 1 p 1 d5y98t 1 g1log1rexppit2 1 g2log1rexppit212 1 c1log1enrolit2 1 c2lunchit 1 ai 1 uit where the first available year the base year is 1993 because of the lagged spending variable i Estimate the model by pooled OLS and report the usual standard errors You should include an intercept along with the year dummies to allow ai to have a nonzero expected value What are the estimated effects of the spending variables Obtain the OLS residuals vit Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 455 ii Is the sign of the lunchit coefficient what you expected Interpret the magnitude of the coefficient Would you say that the district poverty rate has a big effect on test pass rates iii Compute a test for AR1 serial correlation using the regression vit on vit21 You should use the years 1994 through 1998 in the regression Verify that there is strong positive serial correlation and discuss why iv Now estimate the equation by fixed effects Is the lagged spending variable still significant v Why do you think in the fixed effects estimation the enrollment and lunch program variables are jointly insignificant vi Define the total or longrun effect of spending as u1 5 g1 1 g2 Use the substitution g1 5 u1 2 g2 to obtain a standard error for u 1 Hint Standard fixed effects estimation using log1rexppit2 and zit 5 log1rexppi t212 2 log1rexppit2 as explanatory variables should do it C9 The file PENSION contains information on participantdirected pension plans for US workers Some of the observations are for couples within the same family so this data set constitutes a small cluster sample with cluster sizes of two i Ignoring the clustering by family use OLS to estimate the model pctstck 5 b0 1 b1choice 1 b2 prftshr 1 b3 female 1 b4age 1 b5educ 1 b6 finc25 1 b7 finc35 1 b8 finc50 1 b9 finc75 1 b10 finc100 1 b11 finc101 1 b12wealth89 1 b13stckin89 1 b14irain89 1 u where the variables are defined in the data set The variable of most interest is choice which is a dummy variable equal to one if the worker has a choice in how to allocate pension funds among different investments What is the estimated effect of choice Is it statistically significant ii Are the income wealth stock holding and IRA holding control variables important Explain iii Determine how many different families there are in the data set iv Now obtain the standard errors for OLS that are robust to cluster correlation within a family Do they differ much from the usual OLS standard errors Are you surprised v Estimate the equation by differencing across only the spouses within a family Why do the explanatory variables asked about in part ii drop out in the firstdifferenced estimation vi Are any of the remaining explanatory variables in part v significant Are you surprised C10 Use the data in AIRFARE for this exercise We are interested in estimating the model log1 fareit2 5 ht 1 b1concenit 1 b2log1disti2 1 b33log1disti2 42 1 ai 1 uit t 5 1 p 4 where ht means that we allow for different year intercepts i Estimate the above equation by pooled OLS being sure to include year dummies If Dconcen 5 10 what is the estimated percentage increase in fare ii What is the usual OLS 95 confidence interval for b1 Why is it probably not reliable If you have access to a statistical package that computes fully robust standard errors find the fully robust 95 CI for b1 Compare it to the usual CI and comment iii Describe what is happening with the quadratic in logdist In particular for what value of dist does the relationship between logfare and dist become positive Hint First figure out the turning point value for logdist and then exponentiate Is the turning point outside the range of the data iv Now estimate the equation using random effects How does the estimate of b1 change v Now estimate the equation using fixed effects What is the FE estimate of b1 Why is it fairly similar to the RE estimate Hint What is u for RE estimation vi Name two characteristics of a route other than distance between stops that are captured by ai Might these be correlated with concenit vii Are you convinced that higher concentration on a route increases airfares What is your best estimate Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 456 C11 This question assumes that you have access to a statistical package that computes standard errors robust to arbitrary serial correlation and heteroskedasticity for panel data methods i For the pooled OLS estimates in Table 141 obtain the standard errors that allow for arbitrary serial correlation in the composite errors vit 5 ai 1 uit and heteroskedasticity How do the robust standard errors for educ married and union compare with the nonrobust ones ii Now obtain the robust standard errors for the fixed effects estimates that allow arbitrary serial correlation and heteroskedasticity in the idiosyncratic errors uit How do these compare with the nonrobust FE standard errors iii For which method pooled OLS or FE is adjusting the standard errors for serial correlation more important Why C12 Use the data in ELEM9495 to answer this question The data are on elementary schools in Michigan In this exercise we view the data as a cluster sample where each school is part of a district cluster i What are the smallest and largest number of schools in a district What is the average number of schools per district ii Using pooled OLS that is pooling across all 1848 schools estimate a model relating lavgsal to bs lenrol lstaff and lunch see also Computer Exercise 11 from Chapter 9 What are the coefficient and standard error on bs iii Obtain the standard errors that are robust to cluster correlation within district and also heteroskedasticity What happens to the t statistic for bs iv Still using pooled OLS drop the four observations with bs 5 and obtain b bs and its cluster robust standard error Now is there much evidence for a salarybenefits tradeoff v Estimate the equation by fixed effects allowing for a common district effect for schools within a district Again drop the observations with bs 5 Now what do you conclude about the salarybenefits tradeoff vi In light of your estimates from parts iv and v discuss the importance of allowing teacher compensation to vary systematically across districts via a district fixed effect C13 The data set DRIVING includes statelevel panel data for the 48 continental US states from 1980 through 2004 for a total of 25 years Various driving laws are indicated in the data set including the alcohol level at which drivers are considered legally intoxicated There are also indicators for per se lawswhere licenses can be revoked without a trialand seat belt laws Some economics and demo graphic variables are also included i How is the variable totfatrte defined What is the average of this variable in the years 1980 1992 and 2004 Run a regression of totfatrte on dummy variables for the years 1981 through 2004 and describe what you find Did driving become safer over this period Explain ii Add the variables bac08 bac10 perse sbprim sbsecon sl70plus gdl perc1424 unem and vehicmilespc to the regression from part i Interpret the coefficients on bac8 and bac10 Do per se laws have a negative effect on the fatality rate What about having a primary seat belt law Note that if a law was enacted sometime within a year the fraction of the year is recorded in place of the zeroone indicator iii Reestimate the model from part ii using fixed effects at the state level How do the coefficients on bac08 bac10 perse and sbprim compare with the pooled OLS estimates Which set of estimates do you think is more reliable iv Suppose that vehicmilespc the number of miles driven per capita increases by 1000 Using the FE estimates what is the estimated effect on totfatrte Be sure to interpret the estimate as if explaining to a layperson v If there is serial correlation or heteroskedasticity in the idiosyncratic errors of the model then the standard errors in part iii are invalid If possible use cluster robust standard errors for the fixed effects estimates What happens to the statistical significance of the policy variables in part iii Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 457 C14 Use the data set in AIRFARE to answer this question The estimates can be compared with those in Computer Exercise 10 in this Chapter i Compute the time averages of the variable concen call these concenbar How many different time averages can there be Report the smallest and the largest ii Estimate the equation lfareit 5 b0 1 d1y98t 1 d2y99t 1 d3y00t 1 b1concenit 1 b2ldisti 1 b3ldistsqi 1 g1concenbari 1 ai 1 uit by random effects Verify that b 1 is identical to the FE estimate computed in C10 iii If you drop ldist and ldistsq from the estimation in part i but still include concenbari what happens to the estimate of b 1 What happens to the estimate of g1 iv Using the equation in part ii and the usual RE standard error test H0 g1 5 0 against the two sided alternative Report the pvalue What do you conclude about RE versus FE for estimating b1 in this application v If possible for the test in part iv obtain a tstatistic and therefore pvalue that is robust to arbitrary serial correlation and heteroskedasticity Does this change the conclusion reached in part iv C15 Use the data in COUNTYMURDERS to answer this question The data set covers murders and execu tions capital punishment for 2197 counties in the United States See also Computer Exercise C16 in Chapter 13 i Consider the model murdrateit 5 ut 1 d0execsit 1 d1execsit21 1 d2execsit22 1 d3execsit23 1 b5 percblackit 1 b6 percmaleit 1 b7 perc1019it 1 b8 perc2029it 1 ai 1 uit where ut represents a different intercept for each time period ai is the county fixed effect and uit is the idiosyncratic error Why does it make sense to include lags of the key variable execs in the equation ii Apply OLS to the equation from part i and report the estimates of d0 d1 d2 and d3 along with the usual pooled OLS standard errors Do you estimate that executions have a deterrent effect on murders Provide an explanation that involves ai iii Now estimate the equation in part i using fixed effects to remove ai What are the new estimates of the dj Are they very different from the estimates from part ii iv Obtain the longrun propensity from estimates in part iii Using the usual FE standard errors is the LRP statistically different from zero v If possible obtain the standard errors for the FE estimates that are robust to arbitrary heteroskedasticity and serial correlation in the 5uit6 What happens to the statistical significance of the d j What about the estimated LRP APPEndix 14A 14A1 Assumptions for Fixed and Random Effects In this appendix we provide statements of the assumptions for fixed and random effects estimation We also provide a discussion of the properties of the estimators under different sets of assumptions Verification of these claims is somewhat involved but can be found in Wooldridge 2010 Chapter 10 Assumption FE1 For each i the model is yit 5 b1xit1 1 p 1 bkxitk 1 ai 1 uit t 5 1 p T where the bj are the parameters to estimate and ai is the unobserved effect Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 458 Assumption FE2 We have a random sample from the cross section Assumption FE3 Each explanatory variable changes over time for at least some i and no perfect linear relationships exist among the explanatory variables Assumption FE4 For each t the expected value of the idiosyncratic error given the explanatory variables in all time periods and the unobserved effect is zero E1uit0Xi ai2 5 0 Under these first four assumptionswhich are identical to the assumptions for the first differencing estimatorthe fixed effects estimator is unbiased Again the key is the strict exogene ity assumption FE4 Under these same assumptions the FE estimator is consistent with a fixed T as N S Assumption FE5 Var1uit0Xi ai2 5 Var1uit2 5 su 2 for all t 5 1 p T Assumption FE6 For all t 2 s the idiosyncratic errors are uncorrelated conditional on all explanatory variables and ai Cov1uituis0Xi ai2 5 0 Under Assumptions FE1 through FE6 the fixed effects estimator of the bj is the best linear unbiased estimator Since the FD estimator is linear and unbiased it is necessarily worse than the FE estimator The assumption that makes FE better than FD is FE6 which implies that the idiosyncratic errors are serially uncorrelated Assumption FE7 Conditional on Xi and ai the uit are independent and identically distributed as Normal10 su 22 Assumption FE7 implies FE4 FE5 and FE6 but it is stronger because it assumes a normal dis tribution for the idiosyncratic errors If we add FE7 the FE estimator is normally distributed and t and F statistics have exact t and F distributions Without FE7 we can rely on asymptotic approxima tions But without making special assumptions these approximations require large N and small T The ideal random effects assumptions include FE1 FE2 FE4 FE5 and FE6 FE7 could be added but it gains us little in practice because we have to estimate u Because we are only subtract ing a fraction of the time averages we can now allow timeconstant explanatory variables So FE3 is replaced with the following assumption Assumption RE1 There are no perfect linear relationships among the explanatory variables The cost of allowing timeconstant regressors is that we must add assumptions about how the unob served effect ai is related to the explanatory variables Assumption RE2 In addition to FE4 the expected value of ai given all explanatory variables is constant E1ai0Xi2 5 b0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 14 Advanced Panel Data Methods 459 This is the assumption that rules out correlation between the unobserved effect and the explana tory variables and it is the key distinction between fixed effects and random effects Because we are assuming ai is uncorrelated with all elements of xit we can include timeconstant explanatory vari ables Technically the quasitimedemeaning only removes a fraction of the time average and not the whole time average We allow for a nonzero expectation for ai in stating Assumption RE4 so that the model under the random effects assumptions contains an intercept b0 as in equation 147 Remember we would typically include a set of timeperiod intercepts too with the first year acting as the base year We also need to impose homoskedasticity on ai as follows Assumption RE3 In addition to FE5 the variance of ai given all explanatory variables is constant Var1ai0Xi2 5 s 2 a Under the six random effects assumptions FE1 FE2 RE3 RE4 RE5 and FE6 the RE estimator is consistent and asymptotically normally distributed as N gets large for fixed T Actu ally consistency and asymptotic normality follow under the first four assumptions but without the last two assumptions the usual RE standard errors and test statistics would not be valid In addition under the six RE assumptions the RE estimators are asymptotically efficient This means that in large samples the RE estimators will have smaller standard errors than the corresponding pooled OLS estimators when the proper robust standard errors are used for pooled OLS For coefficients on timevarying explanatory variables the only ones estimable by FE the RE estimator is more efficient than the FE estimatoroften much more efficient But FE is not meant to be efficient under the RE assumptions FE is intended to be robust to correlation between ai and the xitj As often hap pens in econometrics there is a tradeoff between robustness and efficiency See Wooldridge 2010 Chapter 10 for verification of the claims made here 14A2 Inference Robust to Serial Correlation and Heteroskedasticity for Fixed Effects and Random Effects One of the key assumptions for performing inference using the FE RE and even the CRE approach to panel data models is the assumption of no serial correlation in the idiosyncratic errors 5uit t 5 1 p T6see Assumption FE6 Of course heteroskedasticity can also be an issue but this is also ruled out for standard inference see Assumption FE5 As discussed in the appendix to Chapter 13 the same issues can arise with first differencing estimation when we have T 3 time periods Fortunately as with FD estimation there are now simple solutions for fully robust inference inference that is robust to arbitrary violations of Assumptions FE5 and FE6 and when applying the RE or CRE approaches to Assumption RE5 As with FD estimation the general approach to obtaining fully robust standard errors and test statistics is known as clustering Now however the clustering is applied to a different equation For example for FE estimation the clustering is applied to the timedemeaned equation 145 For RE estimation the clustering gets applied to the quasi timedemeaned equation 1411 and a similar comment holds for CRE but there the time aver ages are included as separate explanatory variables The details which can be found in Wooldridge 2010 Chapter 10 are too advanced for this text But understanding the purpose of clustering is not if possible we should compute standard errors confidence intervals and test statistics that are valid in large cross sections under the weakest set of assumptions The FE estimator requires only Assumptions FE1 to FE4 for unbiasedness and consistency as N S with T fixed Thus a care ful researcher at least checks whether inference made robust to serial correlation and heteroskedas ticity in the errors affects inference Experience shows that it often does Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 460 Applying clusterrobust inference to account for serial correlation within a panel data context is easily justified when N is substantially larger than T Under certain restrictions on the time series dependence of the sort discussed in Chapter 11 clusterrobust inference for the fixed effects estima tor can be justified when T is of a similar magnitude as N provided both are not small This follows from the work by Hansen 2007 Generally clustering is not theoretically justified when N is small and T is large Computing the clusterrobust statistics after FE or RE estimation is simple in many economet rics packages often only requiring an option of the form clusterid appended to the end of FE and RE estimation commands As in the FD case id refers to a crosssection identifier Similar com ments hold when applying FE or RE to cluster samples as the cluster identifier Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 461 I n this chapter we further study the problem of endogenous explanatory variables in multiple regression models In Chapter 3 we derived the bias in the OLS estimators when an important variable is omitted in Chapter 5 we showed that OLS is generally inconsistent under omitted variables Chapter 9 demonstrated that omitted variables bias can be eliminated or at least mitigated when a suitable proxy variable is given for an unobserved explanatory variable Unfortunately suitable proxy variables are not always available In the previous two chapters we explained how fixed effects estimation or first differencing can be used with panel data to estimate the effects of timevarying independent variables in the presence of timeconstant omitted variables Although such methods are very useful we do not always have access to panel data Even if we can obtain panel data it does us little good if we are interested in the effect of a variable that does not change over time first differencing or fixed effects estimation elimi nates timeconstant explanatory variables In addition the panel data methods that we have studied so far do not solve the problem of timevarying omitted variables that are correlated with the explanatory variables In this chapter we take a different approach to the endogeneity problem You will see how the method of instrumental variables IV can be used to solve the problem of endogeneity of one or more explanatory variables The method of two stage least squares 2SLS or TSLS is second in popularity only to ordinary least squares for estimating linear equations in applied econometrics We begin by showing how IV methods can be used to obtain consistent estimators in the pres ence of omitted variables IV can also be used to solve the errorsinvariables problem at least Instrumental Variables Estimation and Two Stage Least Squares c h a p t e r 15 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 462 under certain assumptions The next chapter will demonstrate how to estimate simultaneous equations models using IV methods Our treatment of instrumental variables estimation closely follows our development of ordinary least squares in Part 1 where we assumed that we had a random sample from an underlying popula tion This is a desirable starting point because in addition to simplifying the notation it emphasizes that the important assumptions for IV estimation are stated in terms of the underlying population just as with OLS As we showed in Part 2 OLS can be applied to time series data and the same is true of instrumental variables methods Section 157 discusses some special issues that arise when IV meth ods are applied to time series data In Section 158 we cover applications to pooled cross sections and panel data 151 Motivation Omitted Variables in a Simple Regression Model When faced with the prospect of omitted variables bias or unobserved heterogeneity we have so far discussed three options 1 we can ignore the problem and suffer the consequences of biased and inconsistent estimators 2 we can try to find and use a suitable proxy variable for the unobserved variable or 3 we can assume that the omitted variable does not change over time and use the fixed effects or firstdifferencing methods from Chapters 13 and 14 The first response can be satisfactory if the estimates are coupled with the direction of the biases for the key parameters For example if we can say that the estimator of a positive parameter say the effect of job training on subsequent wages is biased toward zero and we have found a statistically significant positive estimate we have still learned something job training has a positive effect on wages and it is likely that we have underes timated the effect Unfortunately the opposite case where our estimates may be too large in magni tude often occurs which makes it very difficult for us to draw any useful conclusions The proxy variable solution discussed in Section 92 can also produce satisfying results but it is not always possible to find a good proxy This approach attempts to solve the omitted variable prob lem by replacing the unobservable with one or more proxy variables Another approach leaves the unobserved variable in the error term but rather than estimating the model by OLS it uses an estimation method that recognizes the presence of the omitted variable This is what the method of instrumental variables does For illustration consider the problem of unobserved ability in a wage equation for working adults A simple model is log1wage2 5 b0 1 b1educ 1 b2abil 1 e where e is the error term In Chapter 9 we showed how under certain assumptions a proxy variable such as IQ can be substituted for ability and then a consistent estimator of b1 is available from the regression of log1wage2 on educ IQ Suppose however that a proxy variable is not available or does not have the properties needed to produce a consistent estimator of b1 Then we put abil into the error term and we are left with the simple regression model log1wage2 5 b0 1 b1educ 1 u 151 where u contains abil Of course if equation 151 is estimated by OLS a biased and inconsistent estimator of b1 results if educ and abil are correlated Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 463 It turns out that we can still use equation 151 as the basis for estimation provided we can find an instrumental variable for educ To describe this approach the simple regression model is written as y 5 b0 1 b1x 1 u 152 where we think that x and u are correlated have nonzero covariance Cov1xu2 2 0 153 The method of instrumental variables works whether or not x and u are correlated but for reasons we will see later OLS should be used if x is uncorrelated with u In order to obtain consistent estimators of b0 and b1 when x and u are correlated we need some additional information The information comes by way of a new variable that satisfies certain prop erties Suppose that we have an observable variable z that satisfies these two assumptions 1 z is uncorrelated with u that is Cov1zu2 5 0 154 2 z is correlated with x that is Cov1z x2 2 0 155 Then we call z an instrumental variable for x or sometimes simply an instrument for x The requirement that the instrument z satisfies 154 is summarized by saying z is exogenous in equation 152 and so we often refer to 154 as instrument exogeneity In the context of omitted vari ables instrument exogeneity means that z should have no partial effect on y after x and omitted variables have been controlled for and z should be uncorrelated with the omitted variables Equation 155 means that z must be related either positively or negatively to the endogenous explanatory variable x This condi tion is sometimes referred to as instrument relevance as in z is relevant for explaining variation in x There is a very important difference between the two requirements for an instrumental variable Because 154 involves the covariance between z and the unobserved error u we cannot generally hope to test this assumption in the vast majority of cases we must maintain Cov1zu2 5 0 by appeal ing to economic behavior or introspection In unusual cases we might have an observable proxy variable for some factor contained in u in which case we can check to see if z and the proxy variable are roughly uncorrelated Of course if we have a good proxy for an important element of u we might just add the proxy as an explanatory variable and estimate the expanded equation by ordinary least squares See Section 92 By contrast the condition that z is correlated with x in the population can be tested given a random sample from the population The easiest way to do this is to estimate a simple regression between x and z In the population we have x 5 p0 1 p1z 1 v 156 Then because p1 5 Cov1z x2Var1z2 assumption 155 holds if and only if p1 2 0 Thus we should be able to reject the null hypothesis H0 p1 5 0 157 against the twosided alternative H0 p1 2 0 at a sufficiently small significance level say 5 or 1 If this is the case then we can be fairly confident that 155 holds For the logwage equation in 151 an instrumental variable z for educ must be 1 uncorrelated with ability and any other unobserved factors affecting wage and 2 correlated with education Something such as the last digit of an individuals Social Security Number almost certainly satisfies the first requirement it is uncorrelated with ability because it is determined randomly However it is precisely because of the randomness of the last digit of the SSN that it is not correlated with educa tion either therefore it makes a poor instrumental variable for educ because it violates the instrument relevance requirement in equation 155 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 464 What we have called a proxy variable for the omitted variable makes a poor IV for the opposite reason For example in the logwage example with omitted ability a proxy variable for abil should be as highly correlated as possible with abil An instrumental variable must be uncorrelated with abil Therefore while IQ is a good candidate as a proxy variable for abil it is not a good instrumental vari able for educ because it violates the instrument exogeneity requirement in equation 154 Whether other possible instrumental variable candidates satisfy the exogeneity requirement in 154 is less clearcut In wage equations labor economists have used family background variables as IVs for education For example mothers education motheduc is positively correlated with childs education as can be seen by collecting a sample of data on working people and running a simple regression of educ on motheduc Therefore motheduc satisfies equation 155 The problem is that mothers education might also be correlated with childs ability through mothers ability and perhaps quality of nurturing at an early age in which case 154 fails Another IV choice for educ in 151 is number of siblings while growing up sibs Typically having more siblings is associated with lower average levels of education Thus if number of siblings is uncorrelated with ability it can act as an instrumental variable for educ As a second example consider the problem of estimating the causal effect of skipping classes on final exam score In a simple regression framework we have score 5 b0 1 b1skipped 1 u 158 where score is the final exam score and skipped is the total number of lectures missed during the semester We certainly might be worried that skipped is correlated with other factors in u more able highly motivated students might miss fewer classes Thus a simple regression of score on skipped may not give us a good estimate of the causal effect of missing classes What might be a good IV for skipped We need something that has no direct effect on score and is not correlated with student ability and motivation At the same time the IV must be correlated with skipped One option is to use distance between living quarters and campus Some students at a large university will commute to campus which may increase the likelihood of missing lectures due to bad weather oversleeping and so on Thus skipped may be positively correlated with distance this can be checked by regressing skipped on distance and doing a t test as described earlier Is distance uncorrelated with u In the simple regression model 158 some factors in u may be correlated with distance For example students from lowincome families may live off campus if income affects student performance this could cause distance to be correlated with u Section 152 shows how to use IV in the context of multiple regression so that other factors affecting score can be included directly in the model Then distance might be a good IV for skipped An IV approach may not be necessary at all if a good proxy exists for student ability such as cumulative GPA prior to the semester There is a final point worth emphasizing before we turn to the mechanics of IV estimation namely in using the simple regression in equation 156 to test 157 it is important to take note of the sign and even magnitude of p 1 and not just its statistical significance Arguments for why a variable z makes a good IV candidate for an endogenous explanatory variable x should include a discussion about the nature of the relationship between x and z For example due to genetics and background influences it makes sense that childs education x and mothers education z are posi tively correlated If in your sample of data you find that they are actually negatively correlatedthat is p 1 0then your use of mothers education as an IV for childs education is likely to be uncon vincing And this has nothing to do with whether condition 154 is likely to hold In the example of measuring whether skipping classes has an effect on test performance one should find a positive statistically significant relationship between skipped and distance in order to justify using distance as an IV for skipped a negative relationship would be difficult to justify and would suggest that there are important omitted variables driving a negative correlationvariables that might themselves have to be included in the model 158 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 465 We now demonstrate that the availability of an instrumental variable can be used to estimate consistently the parameters in equation 152 In particular we show that assumptions 154 and 155 serve to identify the parameter b1 Identification of a parameter in this context means that we can write b1 in terms of population moments that can be estimated using a sample of data To write b1 in terms of population covariances we use equation 152 the covariance between z and y is Cov1z y2 5 b1Cov1z x2 1 Cov1zu2 Now under assumption 154 Cov1zu2 5 0 and under assumption 155 Cov1z x2 2 0 Thus we can solve for b1 as b1 5 Cov1z y2 Cov1z x2 159 Notice how this simple algebra fails if z and x are uncorrelated that is if Cov1z x2 5 0 Equation 159 shows that b1 is the population covariance between z and y divided by the population covari ance between z and x which shows that b1 is identified Given a random sample we estimate the population quantities by the sample analogs After canceling the sample sizes in the numerator and denominator we get the instrumental variables IV estimator of b1 b 1 5 a n i51 1zi 2 z2 1yi 2 y2 a n i51 1zi 2 z2 1xi 2 x2 1510 Given a sample of data on x y and z it is simple to obtain the IV estimator in 1510 The IV estima tor of b0 is simply b 0 5 y 2 b 1x which looks just like the OLS intercept estimator except that the slope estimator b 1 is now the IV estimator It is no accident that when z 5 x we obtain the OLS estimator of b1 In other words when x is exogenous it can be used as its own IV and the IV estimator is then identical to the OLS estimator A simple application of the law of large numbers shows that the IV estimator is consistent for b1 plim1b 12 5 b1 provided assumptions 154 and 155 are satisfied If either assumption fails the IV estimators are not consistent more on this later One feature of the IV estimator is that when x and u are in fact correlatedso that instrumental variables estimation is actually neededit is essentially never unbiased This means that in small samples the IV estimator can have a substantial bias which is one reason why large samples are preferred When discussing the application of instrumental variables it is important to be careful with lan guage Like OLS IV is an estimation method It makes little sense to refer to an instrumental varia bles modeljust as the phrase OLS model makes little sense As we know a model is an equation such as 158 which is a special case of the generic model in equation 152 When we have a model such as 152 we can choose to estimate the parameters of that model in many different ways Prior to this chapter we focused primarily on OLS but for example we also know from Chapter 8 that one can use weighted least squares as an alternative estimation method and there are unlimited possibili ties for the weights If we have an instrumental variable candidate z for x then we can instead apply instrumental variables estimation It is certainly true that the estimation method we apply is motivated by the model and assumptions we make about that model But the estimators are well defined and exist apart from any underlying model or assumptions remember an estimator is simply a rule for combining data The bottom line is that while we probably know what a researcher means when using a phrase such as I estimated an IV model such language betrays a lack of understanding about the difference between a model and an estimation method Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 466 151a Statistical Inference with the IV Estimator Given the similar structure of the IV and OLS estimators it is not surprising that the IV estimator has an approximate normal distribution in large sample sizes To perform inference on b1 we need a standard error that can be used to compute t statistics and confidence intervals The usual approach is to impose a homoskedasticity assumption just as in the case of OLS Now the homoskedasticity assumption is stated conditional on the instrumental variable z not the endogenous explanatory vari able x Along with the previous assumptions on u x and z we add E1u20z2 5 s2 5 Var1u2 1511 It can be shown that under 154 155 and 1511 the asymptotic variance of b 1 is s2 ns2 xr2 x z 1512 where s2 x is the population variance of x s2 is the population variance of u and r2 x z is the square of the population correlation between x and z This tells us how highly correlated x and z are in the popu lation As with the OLS estimator the asymptotic variance of the IV estimator decreases to zero at the rate of 1n where n is the sample size Equation 1512 is interesting for two reasons First it provides a way to obtain a standard error for the IV estimator All quantities in 1512 can be consistently estimated given a random sample To estimate s2 x we simply compute the sample variance of xi to estimate r2 x z we can run the regression of xi on zi to obtain the Rsquared say R2 x z Finally to estimate s2 we can use the IV residuals u i 5 yi 2 b 0 2 b 1xi i 5 1 2 p n where b 0 and b 1 are the IV estimates A consistent estimator of s2 looks just like the estimator of s2 from a simple OLS regression s 2 5 1 n 2 2 a n i51 u2 i where it is standard to use the degrees of freedom correction even though this has little effect as the sample size grows The asymptotic standard error of b 1 is the square root of the estimated asymptotic variance the latter of which is given by s 2 SSTx R2 x z 1513 where SSTx is the total sum of squares of the xi Recall that the sample variance of xi is SSTxn and so the sample sizes cancel to give us 1513 The resulting standard error can be used to construct either t statistics for hypotheses involving b1 or confidence intervals for b1 b 0 also has a standard error that we do not present here Any modern econometrics package computes the standard error after any IV estimation there is rarely any reason to perform the calculations by hand A second reason 1512 is interesting is that it allows us to compare the asymptotic vari ances of the IV and the OLS estimators when x and u are uncorrelated Under the GaussMarkov assumptions the variance of the OLS estimator is s2SSTx while the comparable formula for the IV estimator is s21SSTx R2 x z2 they differ only in that R2 x z appears in the denominator of the IV variance Because an Rsquared is always less than one the IV variance is always larger than the OLS variance when OLS is valid If R2 x z is small then the IV variance can be much larger than the OLS vari ance Remember R2 x z measures the strength of the linear relationship between x and z in the sample Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 467 If x and z are only slightly correlated R2 x z can be small and this can translate into a very large sam pling variance for the IV estimator The more highly correlated z is with x the closer R2 x z is to one and the smaller is the variance of the IV estimator In the case that z 5 x R2 x z 5 1 and we get the OLS variance as expected The previous discussion highlights an important cost of performing IV estimation when x and u are uncorrelated the asymptotic variance of the IV estimator is always larger and sometimes much larger than the asymptotic variance of the OLS estimator ExamplE 151 Estimating the Return to Education for married Women We use the data on married working women in MROZ to estimate the return to education in the simple regression model log1wage2 5 b0 1 b1educ 1 u 1514 For comparison we first obtain the OLS estimates log1wage2 5 2185 1 109 educ 11852 10142 1515 n 5 428 R2 5 118 The estimate for b1 implies an almost 11 return for another year of education Next we use fathers education fatheduc as an instrumental variable for educ We have to main tain that fatheduc is uncorrelated with u The second requirement is that educ and fatheduc are cor related We can check this very easily using a simple regression of educ on fatheduc using only the working women in the sample educ 5 1024 1 269 fatheduc 1282 10292 1516 n 5 428 R2 5 173 The t statistic on fatheduc is 928 which indicates that educ and fatheduc have a statistically signifi cant positive correlation In fact fatheduc explains about 17 of the variation in educ in the sample Using fatheduc as an IV for educ gives log1wage2 5 441 1 059 educ 14462 10352 1517 n 5 428 R2 5 093 The IV estimate of the return to education is 59 which is barely more than onehalf of the OLS esti mate This suggests that the OLS estimate is too high and is consistent with omitted ability bias But we should remember that these are estimates from just one sample we can never know whether 109 is above the true return to education or whether 059 is closer to the true return to education Further the standard error of the IV estimate is two and onehalf times as large as the OLS standard error this is expected for the reasons we gave earlier The 95 confidence interval for b1 using OLS is much tighter than that using the IV in fact the IV confidence interval actually contains the OLS estimate Therefore although the differences between 1515 and 1517 are practically large we cannot say whether the difference is statistically significant We will show how to test this in Section 155 In the previous example the estimated return to education using IV was less than that using OLS which corresponds to our expectations But this need not have been the case as the following example demonstrates Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 468 ExamplE 152 Estimating the Return to Education for men We now use WAGE2 to estimate the return to education for men We use the variable sibs number of siblings as an instrument for educ These are negatively correlated as we can verify from a simple regression educ 5 1414 2 228 sibs 1112 10302 n 5 935 R2 5 057 This equation implies that every sibling is associated with on average about 23 less of a year of edu cation If we assume that sibs is uncorrelated with the error term in 1514 then the IV estimator is consistent Estimating equation 1514 using sibs as an IV for educ gives log1wage2 5 513 1 122 educ 1362 10262 n 5 935 The Rsquared is computed to be negative so we do not report it A discussion of Rsquared in the context of IV estimation follows For comparison the OLS estimate of b1 is 059 with a standard error of 006 Unlike in the previous example the IV estimate is now much higher than the OLS esti mate While we do not know whether the difference is statistically significant this does not mesh with the omitted ability bias from OLS It could be that sibs is also correlated with ability more siblings means on average less parental attention which could result in lower ability Another interpretation is that the OLS estimator is biased toward zero because of measurement error in educ This is not entirely convincing because as we discussed in Section 93 educ is unlikely to satisfy the classical errorsinvariables model In the previous examples the endogenous explanatory variable educ and the instrumental variables fatheduc sibs have quantitative meaning But nothing prevents the explanatory variable or IV from being binary variables Angrist and Krueger 1991 in their simplest analysis came up with a clever binary instrumental variable for educ using census data on men in the United States Let frstqrt be equal to one if the man was born in the first quarter of the year and zero otherwise It seems that the error term in 1514and in particular abilityshould be unrelated to quarter of birth But frstqrt also needs to be correlated with educ It turns out that years of education do differ systematically in the population based on quarter of birth Angrist and Krueger argued persuasively that this is due to compulsory school attendance laws in effect in all states Briefly students born early in the year typically begin school at an older age Therefore they reach the compulsory schooling age 16 in most states with somewhat less education than students who begin school at a younger age For students who finish high school Angrist and Krueger verified that there is no relationship between years of education and quarter of birth Because years of education varies only slightly across quarter of birthwhich means R2 x z in 1513 is very smallAngrist and Krueger needed a very large sample size to get a reasonably precise IV estimate Using 247199 men born between 1920 and 1929 the OLS estimate of the return to education was 0801 standard error 0004 and the IV estimate was 0715 0219 these are reported in Table III of Angrist and Kruegers paper Note how large the t statistic is for the OLS estimate about 200 whereas the t statistic for the IV estimate is only 326 Thus the IV estimate is statistically different from zero but its confidence interval is much wider than that based on the OLS estimate An interesting finding by Angrist and Krueger is that the IV estimate does not differ much from the OLS estimate In fact using men born in the next decade the IV estimate is somewhat higher Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 469 than the OLS estimate One could interpret this as showing that there is no omitted ability bias when wage equations are estimated by OLS However the Angrist and Krueger paper has been criticized on econometric grounds As discussed by Bound Jaeger and Baker 1995 it is not obvious that season of birth is unrelated to unobserved factors that affect wage As we will explain in the next subsection even a small amount of correlation between z and u can cause serious problems for the IV estimator For policy analysis the endogenous explanatory variable is often a binary variable For example Angrist 1990 studied the effect that being a veteran of the Vietnam War had on lifetime earnings A simple model is log1earns2 5 b0 1 b1veteran 1 u 1518 where veteran is a binary variable The problem with estimating this equation by OLS is that there may be a selfselection problem as we mentioned in Chapter 7 perhaps people who get the most out of the military choose to join or the decision to join is correlated with other characteristics that affect earnings These will cause veteran and u to be correlated Angrist pointed out that the Vietnam draft lottery provided a natural experiment see also Chapter 13 that created an instrumental variable for veteran Young men were given lottery numbers that deter mined whether they would be called to serve in Vietnam Because the numbers given were eventu ally randomly assigned it seems plausible that draft lottery number is uncorrelated with the error term u But those with a low enough number had to serve in Vietnam so that the probability of being a vet eran is correlated with lottery number If both of these assertions are true draft lottery number is a good IV candidate for veteran It is also possible to have a binary endogenous explanatory variable and a binary instrumental variable See Problem 1 for an example 151b Properties of IV with a Poor Instrumental Variable We have already seen that though IV is consistent when z and u are uncorrelated and z and x have any positive or negative correlation IV estimates can have large standard errors especially if z and x are only weakly correlated Weak correlation between z and x can have even more serious consequences the IV estimator can have a large asymptotic bias even if z and u are only moderately correlated We can see this by studying the probability limit of the IV estimator when z and u are possibly correlated Letting b 1 IV denote the IV estimator we can write plim b 1 IV 5 b1 1 Corr1zu2 Corr1z x2 su sx 1519 where su and sx are the standard deviations of u and x in the population respectively The interest ing part of this equation involves the correlation terms It shows that even if Corr1zu2 is small the inconsistency in the IV estimator can be very large if Corr1z x2 is also small Thus even if we focus only on consistency it is not necessarily better to use IV than OLS if the correlation between z and u is smaller than that between x and u Using the fact that Corr1xu2 5 Cov1xu21sx su2 along with equation 53 we can write the plim of the OLS estimatorcall it b 1 OLSas plim b 1 OLS 5 b1 1 Corr1xu2 su sx 1520 Comparing these formulas shows that it is possible for the directions of the asymptotic biases to be different for IV and OLS For example suppose Corr1xu2 0 Corr1z x2 0 and Corr1z u2 0 If some men who were assigned low draft lottery numbers obtained additional schooling to reduce the probability of being drafted is lottery number a good instrument for veteran in 1518 Exploring FurthEr 151 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 470 Then the IV estimator has a downward bias whereas the OLS estimator has an upward bias asymptotically In practice this situation is probably rare More problematic is when the direction of the bias is the same and the correlation between z and x is small For concreteness suppose x and z are both positively correlated with u and Corr1z x2 0 Then the asymptotic bias in the IV estima tor is less than that for OLS only if Corr1zu2Corr1z x2 Corr1xu2 If Corr1z x2 is small then a seemingly small correlation between z and u can be magnified and make IV worse than OLS even if we restrict attention to bias For example if Corr1z x2 5 2 Corr1zu2 must be less than onefifth of Corr1zu2 before IV has less asymptotic bias than OLS In many applications the correlation between the instrument and x is less than 2 Unfortunately because we rarely have an idea about the relative magnitudes of Corr1zu2 and Corr1xu2 we can never know for sure which estimator has the largest asymptotic bias unless of course we assume Corr1zu2 5 0 In the Angrist and Krueger 1991 example mentioned earlier where x is years of schooling and z is a binary variable indicating quarter of birth the correlation between z and x is very small Bound Jaeger and Baker 1995 discussed reasons why quarter of birth and u might be somewhat correlated From equation 1519 we see that this can lead to a substantial bias in the IV estimator When z and x are not correlated at all things are especially bad whether or not z is uncorre lated with u The following example illustrates why we should always check to see if the endogenous explanatory variable is correlated with the IV candidate ExamplE 153 Estimating the Effect of Smoking on Birth Weight In Chapter 6 we estimated the effect of cigarette smoking on child birth weight Without other explanatory variables the model is log1bwght2 5 b0 1 b1packs 1 u 1521 where packs is the number of packs smoked by the mother per day We might worry that packs is cor related with other health factors or the availability of good prenatal care so that packs and u might be correlated A possible instrumental variable for packs is the average price of cigarettes in the state of residence cigprice We will assume that cigprice and u are uncorrelated even though state support for health care could be correlated with cigarette taxes If cigarettes are a typical consumption good basic economic theory suggests that packs and cig price are negatively correlated so that cigprice can be used as an IV for packs To check this we regress packs on cigprice using the data in BWGHT packs 5 067 1 0003 cigprice 11032 100082 n 5 1388 R2 5 0000 R2 5 20006 This indicates no relationship between smoking during pregnancy and cigarette prices which is perhaps not too surprising given the addictive nature of cigarette smoking Because packs and cigprice are not correlated we should not use cigprice as an IV for packs in 1521 But what happens if we do The IV results would be log1bwght2 5 445 1 299 packs 1912 18702 n 5 1388 the reported Rsquared is negative The coefficient on packs is huge and of an unexpected sign The standard error is also very large so packs is not significant But the estimates are meaningless because cigprice fails the one requirement of an IV that we can always test assumption 155 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 471 The previous example shows that IV estimation can produce strange results when the instrument relevance condition Corr1z x2 2 0 fails Of practically greater interest is the socalled problem of weak instruments which is loosely defined as the problem of low but not zero correlation between z and x In a particular application it is difficult to define how low is too low but recent theoretical research supplemented by simulation studies has shed considerable light on the issue Staiger and Stock 1997 formalized the problem of weak instruments by modeling the correlation between z and x as a function of the sample size in particular the correlation is assumed to shrink to zero at the rate 1n Not surprisingly the asymptotic distribution of the instrumental variables esti mator is different compared with the usual asymptotics where the correlation is assumed to be fixed and nonzero One of the implications of the StockStaiger work is that the usual statistical inference based on t statistics and the standard normal distribution can be seriously misleading We discuss this further in Section 153 151c Computing RSquared after IV Estimation Most regression packages compute an Rsquared after IV estimation using the standard formula R2 5 1 2 SSRSST where SSR is the sum of squared IV residuals and SST is the total sum of squares of y Unlike in the case of OLS the Rsquared from IV estimation can be negative because SSR for IV can actually be larger than SST Although it does not really hurt to report the Rsquared for IV estimation it is not very useful either When x and u are correlated we cannot decompose the variance of y into b2 1Var1x2 1 Var1u2 and so the Rsquared has no natural interpretation In addition as we will discuss in Section 153 these Rsquareds cannot be used in the usual way to compute F tests of joint restrictions If our goal was to produce the largest Rsquared we would always use OLS IV methods are intended to provide better estimates of the ceteris paribus effect of x on y when x and u are correlated goodnessoffit is not a factor A high Rsquared resulting from OLS is of little comfort if we cannot consistently estimate b1 152 IV Estimation of the Multiple Regression Model The IV estimator for the simple regression model is easily extended to the multiple regression case We begin with the case where only one of the explanatory variables is correlated with the error In fact consider a standard linear model with two explanatory variables y1 5 b0 1 b1y2 1 b2z1 1 u1 1522 We call this a structural equation to emphasize that we are interested in the bj which simply means that the equation is supposed to measure a causal relationship We use a new notation here to dis tinguish endogenous from exogenous variables The dependent variable y1 is clearly endogenous as it is correlated with u1 The variables y2 and z1 are the explanatory variables and u1 is the error As usual we assume that the expected value of u1 is zero E1u12 5 0 We use z1 to indicate that this variable is exogenous in 1522 z1 is uncorrelated with u1 We use y2 to indicate that this variable is suspected of being correlated with u1 We do not specify why y2 and u1 are correlated but for now it is best to think of u1 as containing an omitted variable correlated with y2 The notation in equation 1522 originates in simultaneous equations models which we cover in Chapter 16 but we use it more generally to easily distinguish exogenous from endogenous explanatory variables in a multiple regression model An example of 1522 is log1wage2 5 b0 1 b1educ 1 b2exper 1 u1 1523 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 472 where y1 5 log1wage2 y2 5 educ and z1 5 exper In other words we assume that exper is exogenous in 1523 but we allow that educfor the usual reasonsis correlated with u1 We know that if 1522 is estimated by OLS all of the estimators will be biased and inconsist ent Thus we follow the strategy suggested in the previous section and seek an instrumental variable for y2 Since z1 is assumed to be uncorrelated with u1 can we use z1 as an instrument for y2 assum ing y2 and z1 are correlated The answer is no Since z1 itself appears as an explanatory variable in 1522 it cannot serve as an instrumental variable for y2 We need another exogenous variablecall it z2that does not appear in 1522 Therefore key assumptions are that z1 and z2 are uncorrelated with u1 we also assume that u1 has zero expected value which is without loss of generality when the equation contains an intercept E1u12 5 0 Cov1z1u12 5 0 and Cov1z2u12 5 0 1524 Given the zero mean assumption the latter two assumptions are equivalent to E1z1u12 5 E1z2u12 5 0 and so the method of moments approach suggests obtaining estimators b 0 b 1 and b 2 by solving the sample counterparts of 1524 a n i51 1yi1 2 b 0 2 b 1yi2 2 b 2zi12 5 0 a n i51 zi11yi1 2 b 0 2 b 1yi2 2 b 2zi12 5 0 1525 a n i51 zi21yi1 2 b 0 2 b 1yi2 2 b 2zi12 5 0 This is a set of three linear equations in the three unknowns b 0 b 1 and b 2 and it is easily solved given the data on y1 y2 z1 and z2 The estimators are called instrumental variables estimators If we think y2 is exogenous and we choose z2 5 y2 equations 1525 are exactly the first order conditions for the OLS estimators see equations 313 We still need the instrumental variable z2 to be correlated with y2 but the sense in which these two variables must be correlated is complicated by the presence of z1 in equation 1522 We now need to state the assumption in terms of partial correlation The easiest way to state the condition is to write the endogenous explanatory variable as a linear function of the exogenous variables and an error term y2 5 p0 1 p1z1 1 p2z2 1 v2 1526 where by construction E1v22 5 0 Cov1z1v22 5 0 and Cov1z2v22 5 0 and the pj are unknown para meters The key identification condition along with 1524 is that p2 2 0 1527 In other words after partialling out z1 y2 and z2 are still correlated This correlation can be positive or negative but it cannot be zero Testing 1527 is easy we estimate 1526 by OLS and use a t test possibly making it robust to heteroskedasticity We should always test this assumption Unfortunately we cannot test that z1 and z2 are uncorrelated with u1 hopefully we can make the case based on economic reasoning or introspection Suppose we wish to estimate the effect of marijuana usage on college grade point average For the population of college seniors at a university let daysused denote the number of days in the past month on which a student smoked marijuana and consider the structural equation colGPA 5 b0 1 b1daysused 1 b2SAT 1 u i Let percHS denote the percentage of a students high school graduating class that reported regular use of marijuana If this is an IV candidate for daysused write the reduced form for daysused Do you think 1527 is likely to be true ii Do you think percHS is truly exogenous in the structural equation What problems might there be Exploring FurthEr 152 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 473 Equation 1526 is an example of a reduced form equation which means that we have written an endogenous variable in terms of exogenous variables This name comes from simultaneous equa tions modelswhich we study in the next chapterbut it is a useful concept whenever we have an endogenous explanatory variable The name helps distinguish it from the structural equation 1522 Adding more exogenous explanatory variables to the model is straightforward Write the struc tural model as y1 5 b0 1 b1y2 1 b2z1 1 p 1 bkzk21 1 u1 1528 where y2 is thought to be correlated with u1 Let zk be a variable not in 1528 that is also exogenous Therefore we assume that E1u12 5 0 Cov1zju12 5 0 j 5 1 p k 1529 Under 1529 z1 p zk21 are the exogenous variables appearing in 1528 In effect these act as their own instrumental variables in estimating the bj in 1528 The special case of k 5 2 is given in the equations in 1525 along with z2 z1 appears in the set of moment conditions used to obtain the IV estimates More generally z1 p zk21 are used in the moment conditions along with the instru mental variable for y2 zk The reduced form for y2 is y2 5 p0 1 p1z1 1 p 1 pk21zk21 1 pkzk 1 v2 1530 and we need some partial correlation between zk and y2 pk 2 0 1531 Under 1529 and 1531 zk is a valid IV for y2 We do not care about the remaining pj in 1530 some or all of them could be zero A minor additional assumption is that there are no perfect linear relationships among the exogenous variables this is analogous to the assumption of no perfect col linearity in the context of OLS For standard statistical inference we need to assume homoskedasticity of u1 We give a careful statement of these assumptions in a more general setting in Section 153 ExamplE 154 Using College proximity as an IV for Education Card 1995 used wage and education data for a sample of men in 1976 to estimate the return to educa tion He used a dummy variable for whether someone grew up near a fouryear college nearc4 as an instrumental variable for education In a logwage equation he included other standard controls expe rience a black dummy variable dummy variables for living in an SMSA and living in the South and a full set of regional dummy variables and an SMSA dummy for where the man was living in 1966 In order for nearc4 to be a valid instrument it must be uncorrelated with the error term in the wage equationwe assume thisand it must be partially correlated with educ To check the latter require ment we regress educ on nearc4 and all of the exogenous variables appearing in the equation That is we estimate the reduced form for educ Using the data in CARD we obtain in condensed form educ 5 1664 1 320 nearc4 2 413 exper 1 p 1242 10882 10342 n 5 3010 R2 5 477 We are interested in the coefficient and t statistic on nearc4 The coefficient implies that in 1976 other things being fixed experience race region and so on people who lived near a college in 1966 had on average about onethird of a year more education than those who did not grow up near a college The t statistic on nearc4 is 364 which gives a pvalue that is zero in the first three decimals Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 474 As discussed earlier we should not make anything of the smaller Rsquared in the IV estimation by definition the OLS Rsquared will always be larger because OLS minimizes the sum of squared residuals Therefore if nearc4 is uncorrelated with unobserved factors in the error term we can use nearc4 as an IV for educ The OLS and IV estimates are given in Table 151 Like the OLS standard errors the reported IV standard errors employ a degreesoffreedom adjustment in estimating the error variance In some statistical packages the degreesoffreedom adjustment is the default in others it is not Interestingly the IV estimate of the return to education is almost twice as large as the OLS esti mate but the standard error of the IV estimate is over 18 times larger than the OLS standard error The 95 confidence interval for the IV estimate is between 024 and 239 which is a very wide range The presence of larger confidence intervals is a price we must pay to get a consistent estimator of the return to education when we think educ is endogenous TAblE 151 Dependent Variable logwage Explanatory Variables OLS IV educ 075 003 132 055 exper 085 007 108 024 exper2 20023 0003 20023 0003 black 2199 018 2147 054 smsa 136 020 112 032 south 2148 026 2145 027 Observations Rsquared 3010 300 3010 238 Other controls smsa66 reg662 reg669 It is worth noting especially for studying the effects of policy interventions that a reduced form equation exists for y1 too In the context of equation 1528 with zk an IV for y2 the reduced form for y1 always has the form y1 5 g0 1 g1z1 1 p 1 gkzk 1 e1 1532 where gj 5 bj 1 b1pj for j k gk 5 b1pk and e1 5 u1 1 b1v2as can be verified by plugging 1530 into 1528 and rearranging Because the zj are exogenous in 1532 the gj can be consist ently estimated by OLS In other words we regress y1 on all of the exogenous variables including zk the IV for y2 Only if we want to estimate b1 in 1528 do we need to apply IV When y2 is a zeroone variable denoting participation and zk is a zeroone variable representing eligibility for program participationwhich is hopefully either randomized across individuals or at most a function of the other exogenous variables z1 p zk21 such as incomethe coefficient gk has an interesting interpretation Rather than an estimate of the effect of the program itself it is an Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 475 estimate of the effect of offering the program Unlike b1 in 1528which measures the effect of the program itselfgk accounts for the possibility that some units made eligible will choose not to participate In the program evaluation literature gk is an example of an intentiontotreat parameter it measures the effect of being made eligible and not the effect of actual participation The intention totreat coefficient gk 5 b1pk depends on the effect of participating b1 and the change typically increase in the probability of participating due to being eligible pk When y2 is binary equation 1530 is a linear probability model and therefore pk measures the ceteris paribus change in prob ability that y2 5 1 as zk switches from zero to one 153 Two Stage Least Squares In the previous section we assumed that we had a single endogenous explanatory variable 1y22 along with one instrumental variable for y2 It often happens that we have more than one exogenous variable that is excluded from the structural model and might be correlated with y2 which means they are valid IVs for y2 In this section we discuss how to use multiple instrumental variables 153a A Single Endogenous Explanatory Variable Consider again the structural model 1522 which has one endogenous and one exogenous explana tory variable Suppose now that we have two exogenous variables excluded from 1522 z2 and z3 Our assumptions that z2 and z3 do not appear in 1522 and are uncorrelated with the error u1 are known as exclusion restrictions If z2 and z3 are both correlated with y2 we could just use each as an IV as in the previous sec tion But then we would have two IV estimators and neither of these would in general be efficient Since each of z1 z2 and z3 is uncorrelated with u1 any linear combination is also uncorrelated with u1 and therefore any linear combination of the exogenous variables is a valid IV To find the best IV we choose the linear combination that is most highly correlated with y2 This turns out to be given by the reduced form equation for y2 Write y2 5 p0 1 p1z1 1 p2z2 1 p3z3 1 v2 1533 where E1v22 5 0 Cov1z1v22 5 0 Cov1z2v22 5 0 and Cov1z3v22 5 0 Then the best IV for y2 under the assumptions given in the chapter appendix is the linear combina tion of the zj in 1533 which we call yp 2 yp 2 5 p0 1 p1z1 1 p2z2 1 p3z3 1534 For this IV not to be perfectly correlated with z1 we need at least one of p2 or p3 to be different from zero p2 2 0 or p3 2 0 1535 This is the key identification assumption once we assume the zj are all exogenous The value of p1 is irrelevant The structural equation 1522 is not identified if p2 5 0 and p3 5 0 We can test H0 p2 5 0 and p3 5 0 against 1535 using an F statistic A useful way to think of 1533 is that it breaks y2 into two pieces The first is yp 2 this is the part of y2 that is uncorrelated with the error term u1 The second piece is v2 and this part is possibly cor related with u1which is why y2 is possibly endogenous Given data on the zj we can compute yp 2 for each observation provided we know the population parameters pj This is never true in practice Nevertheless as we saw in the previous section we can Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 476 always estimate the reduced form by OLS Thus using the sample we regress y2 on z1 z2 and z3 and obtain the fitted values y2 5 p 0 1 p 1z1 1 p 2z2 1 p 3z3 1536 that is we have yi2 for each i At this point we should verify that z2 and z3 are jointly significant in 1533 at a reasonably small significance level no larger than 5 If z2 and z3 are not jointly signifi cant in 1533 then we are wasting our time with IV estimation Once we have y2 we can use it as the IV for y2 The three equations for estimating b0 b1 and b2 are the first two equations of 1525 with the third replaced by a n i51 yi21yi1 2 b 0 2 b 1yi2 2 b 2zi12 5 0 1537 Solving the three equations in three unknowns gives us the IV estimators With multiple instruments the IV estimator using yi2 as the instrument is also called the two stage least squares 2SLS estimator The reason is simple Using the algebra of OLS it can be shown that when we use y2 as the IV for y2 the IV estimates b 0 b 1 and b 2 are identical to the OLS estimates from the regression of y1 on y2 and z1 1538 In other words we can obtain the 2SLS estimator in two stages The first stage is to run the regression in 1536 where we obtain the fitted values y2 The second stage is the OLS regression 1538 Because we use y2 in place of y2 the 2SLS estimates can differ substantially from the OLS estimates Some economists like to interpret the regression in 1538 as follows The fitted value y2 is the estimated version of yp 2 and yp 2 is uncorrelated with u1 Therefore 2SLS first purges y2 of its correla tion with u1 before doing the OLS regression in 1538 We can show this by plugging y2 5 yp 2 1 v2 into 1522 y1 5 b0 1 b1yp 2 1 b2z1 1 u1 1 b1v2 1539 Now the composite error u1 1 b1v2 has zero mean and is uncorrelated with yp 2 and z1 which is why the OLS regression in 1538 works Most econometrics packages have special commands for 2SLS so there is no need to perform the two stages explicitly In fact in most cases you should avoid doing the second stage manually as the standard errors and test statistics obtained in this way are not valid The reason is that the error term in 1539 includes v2 but the standard errors involve the variance of u1 only Any regression software that supports 2SLS asks for the dependent variable the list of explanatory variables both exogenous and endogenous and the entire list of instrumental variables that is all exogenous vari ables The output is typically quite similar to that for OLS In model 1528 with a single IV for y2 the IV estimator from Section 152 is identical to the 2SLS estimator Therefore when we have one IV for each endogenous explanatory variable we can call the estimation method IV or 2SLS Adding more exogenous variables changes very little For example suppose the wage equation is log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 1 u1 1540 where u1 is uncorrelated with both exper and exper2 Suppose that we also think mothers and fathers educations are uncorrelated with u1 Then we can use both of these as IVs for educ The reduced form equation for educ is educ 5 p0 1 p1exper 1 p2exper2 1 p3 motheduc 1 p4 fatheduc 1 v2 1541 and identification requires that p3 2 0 or p4 2 0 or both of course Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 477 ExamplE 155 Return to Education for Working Women We estimate equation 1540 using the data in MROZ First we test H0 p3 5 0 p4 5 0 in 1541 using an F test The result is F 5 12476 and pvalue 5 0000 As expected educ is partially cor related with parents education When we estimate 1540 by 2SLS we obtain in equation form log1wage2 5 048 1 061 educ 1 044 exper 2 0009 exper2 14002 10312 10132 100042 n 5 428 R2 5 136 The estimated return to education is about 61 compared with an OLS estimate of about 108 Because of its relatively large standard error the 2SLS estimate is barely statistically significant at the 5 level against a twosided alternative The assumptions needed for 2SLS to have the desired large sample properties are given in the chapter appendix but it is useful to briefly summarize them here If we write the structural equation as in 1528 y1 5 b0 1 b1y2 1 b2z1 1 p 1 bkzk21 1 u1 1542 then we assume each zj to be uncorrelated with u1 In addition we need at least one exogenous vari able not in 1542 that is partially correlated with y2 This ensures consistency For the usual 2SLS standard errors and t statistics to be asymptotically valid we also need a homoskedasticity assump tion the variance of the structural error u1 cannot depend on any of the exogenous variables For time series applications we need more assumptions as we will see in Section 157 153b Multicollinearity and 2SLS In Chapter 3 we introduced the problem of multicollinearity and showed how correlation among regres sors can lead to large standard errors for the OLS estimates Multicollinearity can be even more serious with 2SLS To see why the asymptotic variance of the 2SLS estimator of b1 can be approximated as s23SST211 2 R 2 22 4 1543 where s2 5 Var1u12 SST2 is the total variation in y2 and R 2 2 is the Rsquared from a regression of y2 on all other exogenous variables appearing in the structural equation There are two reasons why the variance of the 2SLS estimator is larger than that for OLS First y2 by construction has less variation than y2 Remember Total sum of squares 5 explained sum of squares 1 residual sum of squares the variation in y2 is the total sum of squares while the variation in y2 is the explained sum of squares from the first stage regression Second the correlation between y2 and the exogenous variables in 1542 is often much higher than the correlation between y2 and these variables This essentially defines the multicollinearity problem in 2SLS As an illustration consider Example 154 When educ is regressed on the exogenous variables in Table 151 not including nearc4 Rsquared 5 475 this is a moderate degree of multicollinearity but the important thing is that the OLS standard error on b educ is quite small When we obtain the first stage fitted values educ and regress these on the exogenous variables in Table 151 Rsquared 5 995 which indicates a very high degree of multicollinearity between educ and the remaining exogenous variables in the table This high Rsquared is not too surprising because educ is a function of all the exogenous variables in Table 151 plus nearc4 Equation 1543 shows that an R 2 2 close to one can result in a very large standard error for the 2SLS estimator But as with OLS a large sample size can help offset a large R 2 2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 478 153c Detecting Weak Instruments In Section 151 we briefly discussed the problem of weak instruments We focused on equation 1519 which demonstrates how a small correlation between the instrument and error can lead to very large inconsistency and therefore bias if the instrument z also has little correlation with the explanatory variable x The same problem can arise in the context of the multiple equation model in equation 1542 whether we have one instrument for y2 or more instruments than we need We also mentioned the findings of Staiger and Stock 1997 and we now discuss the practical implications of this research in a bit more depth Importantly Staiger and Stock study the case of where all instrumental variables are exogenous With the exogeneity requirement satisfied by the instruments they focus on the case where the instruments are weakly correlated with y2 and they study the validity of standard errors confidence intervals and t statistics involving the coefficient b1 on y2 The mecha nism they used to model weak correlation led to an important finding even with very large sample sizes the 2SLS estimator can be biased and a distribution that is very different from standard normal Building on Staiger and Stock 1997 Stock and Yogo 2005 SY for short proposed methods for detecting situations where weak instruments will lead to substantial bias and distorted statistical inference Conveniently Stock and Yogo obtained rules concerning the size of the t statistic with one instrument or the F statistic with more than one instrument from the firststage regression The the ory is much too involved to pursue here Instead we describe some simple rules of thumb proposed by Stock and Yogo that are easy to implement The key implication of the SY work is that one needs more than just a statistical rejection of the null hypothesis in the first stage regression at the usual significance levels For example in equation 156 it is not enough to reject the null hypothesis stated in 157 at the 5 significance level Using bias calcu lations for the instrumental variables estimator SY recommend that one can proceed with the usual IV inference if the firststage t statistic has absolute value larger than 10 32 Readers will recognize this value as being well above the 95th percentile of the standard normal distribution 196 which is what we would use for a standard 5 significance level This same rule of thumb applies in the multiple regres sion model with a single endogenous explanatory variable y2 and a single instrumental variable zk In particular the t statistic in testing hypothesis 1531 should be at least 32 in absolute value SY cover the case of 2SLS too In this case we must focus on the firststage F statistic for exclusion of the instrumental variables for y2 and the SY rule is F 10 Notice this is the same rule based on the t statistic when there is only one instrument as t2 5 F For example consider equation 1534 where we have two instruments for y2 z2 and z3 Then the F statistic for the null hypothesis H0 p2 5 0 p3 5 0 should have F 10 Remember this is not the overall F statistic for all of the exogenous variables in 1534 We test only the coefficients on the proposed IVs for y2 that is the exogenous variables that do not appear in 1522 In Example 155 the relevant F statistic is 12476 which is well above 10 implying that we do not have to worry about weak instruments Of course the exogeneity of the par ents education variables is in doubt The rule of thumb of requiring the F statistic to be larger than 10 works well in most models and is easy to remember However like all rules of thumb involving statistical inference it makes no sense to use 10 as a knifeedge cutoff For example one can probably proceed if F 5 994 as it is pretty close to 10 The rule of thumb should be used as a guideline SY have more detailed suggestions for cases where there are many instruments for y2 say five or more The interested reader is referred to the SY paper Most empirical researchers adopt 10 as the target value 153d Multiple Endogenous Explanatory Variables Two stage least squares can also be used in models with more than one endogenous explanatory vari able For example consider the model y1 5 b0 1 b1y2 1 b2y3 1 b3z1 1 b4z2 1 b5z3 1 u1 1544 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 479 where E1u12 5 0 and u1 is uncorrelated with z1 z2 and z3 The variables y2 and y3 are endogenous explanatory variables each may be correlated with u1 To estimate 1544 by 2SLS we need at least two exogenous variables that do not appear in 1544 but that are correlated with y2 and y3 Suppose we have two excluded exogenous variables say z4 and z5 Then from our analysis of a single endogenous explanatory variable we need either z4 or z5 to appear in each reduced form for y2 and y3 As before we can use F statistics to test this Although this is necessary for identification unfortunately it is not sufficient Suppose that z4 appears in each reduced form but z5 appears in neither Then we do not really have two exogenous variables partially correlated with y2 and y3 Two stage least squares will not produce consistent esti mators of the bj Generally when we have more than one endogenous explanatory variable in a regression model identification can fail in several complicated ways But we can easily state a necessary condition for identification which is called the order condition Order Condition for Identification of an Equation We need at least as many excluded exogenous variables as there are included endogenous explanatory vari ables in the structural equation The order condition is simple to check as it only involves counting en dogenous and exogenous variables The sufficient condition for identification is called the rank condi tion We have seen special cases of the rank condition beforefor example in the discussion surrounding equation 1535 A general statement of the rank condition requires matrix algebra and is beyond the scope of this text See Wooldridge 2010 Chapter 5 It is even more difficult to obtain diagnostics for weak instruments 153e Testing Multiple Hypotheses after 2SLS Estimation We must be careful when testing multiple hypotheses in a model estimated by 2SLS It is tempting to use either the sum of squared residuals or the Rsquared form of the F statistic as we learned with OLS in Chapter 4 The fact that the Rsquared in 2SLS can be negative suggests that the usual way of computing F statistics might not be appropriate this is the case In fact if we use the 2SLS residu als to compute the SSRs for both the restricted and unrestricted models there is no guarantee that SSRr SSRur if the reverse is true the F statistic would be negative It is possible to combine the sum of squared residuals from the second stage regression such as 1538 with SSRur to obtain a statistic with an approximate F distribution in large samples Because many econometrics packages have simpletouse test commands that can be used to test multiple hypotheses after 2SLS estimation we omit the details Davidson and MacKinnon 1993 and Wooldridge 2010 Chapter 5 contain discussions of how to compute Ftype statistics for 2SLS 154 IV Solutions to ErrorsinVariables Problems In the previous sections we presented the use of instrumental variables as a way to solve the omitted variables problem but they can also be used to deal with the measurement error problem As an illus tration consider the model y 5 b0 1 b1xp 1 1 b2x2 1 u 1545 The following model explains violent crime rates at the city level in terms of a binary variable for whether gun control laws exist and other controls violent 5 b0 1 b1guncontrol 1 b2unem 1 b3 popul 1 b4percblck 1 b5age18221 1 p Some researchers have estimated similar equations using variables such as the num ber of National Rifle Association members in the city and the number of subscribers to gun magazines as instrumental variables for gun control see for example Kleck and Patterson 1993 Are these convincing instruments Exploring FurthEr 153 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 480 where y and x2 are observed but xp 1 is not Let x1 be an observed measurement of xp 1 x1 5 xp 1 1 e1 where e1 is the measurement error In Chapter 9 we showed that correlation between x1 and e1 causes OLS where x1 is used in place of xp 1 to be biased and inconsistent We can see this by writing y 5 b0 1 b1x1 1 b2x2 1 1u 2 b1e12 1546 If the classical errorsinvariables CEV assumptions hold the bias in the OLS estimator of b1 is toward zero Without further assumptions we can do nothing about this In some cases we can use an IV procedure to solve the measurement error problem In 1545 we assume that u is uncorrelated with xp 1 x1 and x2 in the CEV case we assume that e1 is uncor related with xp 1 and x2 These imply that x2 is exogenous in 1546 but that x1 is correlated with e1 What we need is an IV for x1 Such an IV must be correlated with x1 uncorrelated with uso that it can be excluded from 1545and uncorrelated with the measurement error e1 One possibility is to obtain a second measurement on xp 1 say z1 Because it is xp 1 that affects y it is only natural to assume that z1 is uncorrelated with u If we write z1 5 xp 1 1 a1 where a1 is the meas urement error in z1 then we must assume that a1 and e1 are uncorrelated In other words x1 and z1 both mismeasure xp 1 but their measurement errors are uncorrelated Certainly x1 and z1 are correlated through their dependence on xp 1 so we can use z1 as an IV for x1 Where might we get two measurements on a variable Sometimes when a group of workers is asked for their annual salary their employers can provide a second measure For married couples each spouse can independently report the level of savings or family income In the Ashenfelter and Krueger 1994 study cited in Section 143 each twin was asked about his or her siblings years of education this gives a second measure that can be used as an IV for selfreported education in a wage equation Ashenfelter and Krueger combined differencing and IV to account for the omitted ability problem as well more on this in Section 158 Generally though having two measures of an explanatory variable is rare An alternative is to use other exogenous variables as IVs for a potentially mismeasured variable For example our use of motheduc and fatheduc as IVs for educ in Example 155 can serve this pur pose If we think that educ 5 educp 1 e1 then the IV estimates in Example 155 do not suffer from measurement error if motheduc and fatheduc are uncorrelated with the measurement error e1 This is probably more reasonable than assuming motheduc and fatheduc are uncorrelated with ability which is contained in u in 1545 IV methods can also be adopted when using things like test scores to control for unobserved characteristics In Section 92 we showed that under certain assumptions proxy variables can be used to solve the omitted variables problem In Example 93 we used IQ as a proxy variable for unobserved ability This simply entails adding IQ to the model and performing an OLS regression But there is an alternative that works when IQ does not fully satisfy the proxy variable assumptions To illustrate write a wage equation as log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 1 abil 1 u 1547 where we again have the omitted ability problem But we have two test scores that are indicators of ability We assume that the scores can be written as test1 5 g1abil 1 e1 and test2 5 d1abil 1 e2 where g1 0 d1 0 Since it is ability that affects wage we can assume that test1 and test2 are uncorrelated with u If we write abil in terms of the first test score and plug the result into 1547 we get log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 1 a1test1 1 1u 2 a1e12 1548 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 481 where a1 5 1g1 Now if we assume that e1 is uncorrelated with all the explanatory variables in 1547 including abil then e1 and test1 must be correlated Notice that educ is not endogenous in 1548 however test1 is This means that estimating 1548 by OLS will produce inconsistent estimators of the bj and a1 Under the assumptions we have made test1 does not satisfy the proxy variable assumptions If we assume that e2 is also uncorrelated with all the explanatory variables in 1547 and that e1 and e2 are uncorrelated then e1 is uncorrelated with the second test score test2 Therefore test2 can be used as an IV for test1 ExamplE 156 Using Two Test Scores as Indicators of ability We use the data in WAGE2 to implement the preceding procedure where IQ plays the role of the first test score and KWW knowledge of the world of work is the second test score The explanatory vari ables are the same as in Example 93 educ exper tenure married south urban and black Rather than adding IQ and doing OLS as in column 2 of Table 92 we add IQ and use KWW as its instru ment The coefficient on educ is 025 se 5 017 This is a low estimate and it is not statistically dif ferent from zero This is a puzzling finding and it suggests that one of our assumptions fails perhaps e1 and e2 are correlated 155 Testing for Endogeneity and Testing Overidentifying Restrictions In this section we describe two important tests in the context of instrumental variables estimation 155a Testing for Endogeneity The 2SLS estimator is less efficient than OLS when the explanatory variables are exogenous as we have seen the 2SLS estimates can have very large standard errors Therefore it is useful to have a test for endogeneity of an explanatory variable that shows whether 2SLS is even necessary Obtaining such a test is rather simple To illustrate suppose we have a single suspected endogenous variable y1 5 b0 1 b1y2 1 b2z1 1 b3z2 1 u1 1549 where z1 and z2 are exogenous We have two additional exogenous variables z3 and z4 which do not appear in 1549 If y2 is uncorrelated with u1 we should estimate 1549 by OLS How can we test this Hausman 1978 suggested directly comparing the OLS and 2SLS estimates and determining whether the differences are statistically significant After all both OLS and 2SLS are consistent if all variables are exogenous If 2SLS and OLS differ significantly we conclude that y2 must be endog enous maintaining that the zj are exogenous It is a good idea to compute OLS and 2SLS to see if the estimates are practically different To determine whether the differences are statistically significant it is easier to use a regression test This is based on estimating the reduced form for y2 which in this case is y2 5 p0 1 p1z1 1 p2z2 1 p3z3 1 p4z4 1 v2 1550 Now since each zj is uncorrelated with u1 y2 is uncorrelated with u1 if and only if v2 is uncor related with u1 this is what we wish to test Write u1 5 d1v2 1 e1 where e1 is uncorrelated with v2 and has zero mean Then u1 and v2 are uncorrelated if and only if d1 5 0 The easiest way to test this is to include v2 as an additional regressor in 1549 and to do a t test There is only one problem with implementing this v2 is not observed because it is the error term in 1550 Because we can Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 482 estimate the reduced form for y2 by OLS we can obtain the reduced form residuals v2 Therefore we estimate y1 5 b0 1 b1y2 1 b2z1 1 b3z2 1 d1v2 1 error 1551 by OLS and test H0 d1 5 0 using a t statistic If we reject H0 at a small significance level we con clude that y2 is endogenous because v2 and u1 are correlated Testing for Endogeneity of a Single Explanatory Variable i Estimate the reduced form for y2 by regressing it on all exogenous variables including those in the structural equation and the additional IVs Obtain the residuals v2 ii Add v2 to the structural equation which includes y2 and test for significance of v2 using an OLS regression If the coefficient on v2 is statistically different from zero we conclude that y2 is indeed endogenous We might want to use a heteroskedasticityrobust t test ExamplE 157 Return to Education for Working Women We can test for endogeneity of educ in 1540 by obtaining the residuals v2 from estimating the reduced form 1541using only working womenand including these in 1540 When we do this the coefficient on v2 is d 1 5 058 and t 5 167 This is moderate evidence of positive correlation between u1 and v2 It is probably a good idea to report both estimates because the 2SLS estimate of the return to education 61 is well below the OLS estimate 108 An interesting feature of the regression from step ii of the test for endogeneity is that the coeffi cient estimates on all explanatory variables except of course v2 are identical to the 2SLS estimates For example estimating 1551 by OLS produces the same b j as estimating 1549 by 2SLS One benefit of this equivalence is that it provides an easy check on whether you have done the proper regression in testing for endogeneity But it also gives a different useful interpretation of 2SLS adding v2 to the origi nal equation as an explanatory variable and applying OLS clears up the endogeneity of y2 So when we start by estimating 1549 by OLS we can quantify the importance of allowing y2 to be endogenous by seeing how much b 1 changes when v2 is added to the equation Irrespective of the outcome of the statistical tests we can see whether the change in b 1 is expected and is practically significant We can also test for endogeneity of multiple explanatory variables For each suspected endog enous variable we obtain the reduced form residuals as in part i Then we test for joint significance of these residuals in the structural equation using an F test Joint significance indicates that at least one suspected explanatory variable is endogenous The number of exclusion restrictions tested is the number of suspected endogenous explanatory variables 155b Testing Overidentification Restrictions When we introduced the simple instrumental variables estimator in Section 151 we emphasized that the instrument must satisfy two requirements it must be uncorrelated with the error exogeneity and correlated with the endogenous explanatory variable relevance We have now seen that even in models with additional explanatory variables the second requirement can be tested using a t test with just one instrument or an F test when there are multiple instruments In the context of the simple IV estimator we noted that the exogeneity requirement cannot be tested However if we have more instruments than we need we can effectively test whether some of them are uncorrelated with the structural error As a specific example again consider equation 1549 with two instrumental variables for y2 z3 and z4 Remember z1 and z2 essentially act as their own instruments Because we have two instru ments for y2 we can estimate 1549 using say only z3 as an IV for y2 let bˇ 1 be the resulting IV Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 483 estimator of b1 Then we can estimate 1549 using only z4 as an IV for y2 call this IV estimator b 1 If all zj are exogenous and if z3 and z4 are each partially correlated with y2 then bˇ 1 and b 1 are both consistent for b1 Therefore if our logic for choosing the instruments is sound bˇ 1 and b 1 should differ only by sampling error Hausman 1978 proposed basing a test of whether z3 and z4 are both exog enous on the difference b1ˇ 2 b 1 Shortly we will provide a simpler way to obtain a valid test but before doing so we should understand how to interpret the outcome of the test If we conclude that bˇ 1 and b 1 are statistically different from one another then we have no choice but to conclude that either z3 z4 or both fail the exogeneity requirement Unfortunately we cannot know which is the case unless we simply assert from the beginning that say z3 is exogenous For example if y2 denotes years of schooling in a log wage equation z3 is mothers education and z4 is fathers education a statistically significant difference in the two IV estimators implies that one or both of the parents education variables are correlated with u1 in 1554 Certainly rejecting that ones instruments are exogenous is serious and requires a new approach But the more serious and subtle problem in comparing IV estimates is that they may be similar even though both instruments fail the exogeneity requirement In the previous example it seems likely that if mothers education is positively correlated with u1 then so is fathers education Therefore the two IV estimates may be similar even though each is inconsistent In effect because the IVs in this example are chosen using similar reasoning their separate use in IV procedures may very well lead to similar estimates that are nevertheless both inconsistent The point is that we should not feel espe cially comfortable if our IV procedures pass the Hausman test Another problem with comparing two IV estimates is that often they may seem practically dif ferent yet statistically we cannot reject the null hypothesis that they are consistent for the same population parameter For example in estimating 1540 by IV using motheduc as the only instru ment the coefficient on educ is 049 037 If we use only fatheduc as the IV for educ the coefficient on educ is 070 034 Perhaps not surprisingly the estimate using both parents education as IVs is in between these two 061 031 For policy purposes the difference between 5 and 7 for the estimated return to a year of schooling is substantial Yet as shown in Example 158 the difference is not statistically significant The procedure of comparing different IV estimates of the same parameter is an example of test ing overidentifying restrictions The general idea is that we have more instruments than we need to estimate the parameters consistently In the previous example we had one more instrument than we need and this results in one overidentifying restriction that can be tested In the general case sup pose that we have q more instruments than we need For example with one endogenous explanatory variable y2 and three proposed instruments for y2 we have q 5 3 2 1 5 2 overidentifying restric tions When q is two or more comparing several IV estimates is cumbersome Instead we can easily compute a test statistic based on the 2SLS residuals The idea is that if all instruments are exog enous the 2SLS residuals should be uncorrelated with the instruments up to sampling error But if there are k 1 1 parameters and k 1 1 1 q instruments the 2SLS residuals have a zero mean and are identically uncorrelated with k linear combinations of the instruments This algebraic fact contains as a special case the fact that the OLS residuals have a zero mean and are uncorrelated with the k explanatory variables Therefore the test checks whether the 2SLS residuals are correlated with q linear functions of the instruments and we need not decide on the functions the test does that for us automatically The following regressionbased test is valid when the homoskedasticity assumption listed as Assumption 2SLS5 in the chapter appendix holds Testing Overidentifying Restrictions i Estimate the structural equation by 2SLS and obtain the 2SLS residuals u 1 ii Regress u 1 on all exogenous variables Obtain the Rsquared say R2 1 iii Under the null hypothesis that all IVs are uncorrelated with u1 nR2 1 ax2 q where q is the number of instrumental variables from outside the model minus the total number of endogenous Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 484 explanatory variables If nR2 1 exceeds say the 5 critical value in the x2 q distribution we reject H0 and conclude that at least some of the IVs are not exogenous ExamplE 158 Return to Education for Working Women When we use motheduc and fatheduc as IVs for educ in 1540 we have a single overidentifying restriction Regressing the 2SLS residuals u 1 on exper exper2 motheduc and fatheduc produces R2 1 5 0009 Therefore nR2 1 5 428100092 5 3852 which is a very small value in a x2 1 distribution 1pvalue 5 5352 Therefore the parents education variables pass the overidentification test When we add husbands education to the IV list we get two overidentifying restrictions and nR2 1 5 111 1pvalue 5 5742 Subject to the preceding cautions it seems reasonable to add huseduc to the IV list as this reduces the standard error of the 2SLS estimate the 2SLS estimate on educ using all three instruments is 080 1se 5 0222 so this makes educ much more significant than when huseduc is not used as an IV 1b educ 5 061 se 5 0312 When q 5 1 a natural question is How does the test obtained from the regressionbased proce dure compare with a test based on directly comparing the estimates In fact the two procedures are asymptotically the same As a practical matter it makes sense to compute the two IV estimates to see how they differ More generally when q 2 one can compare the 2SLS estimates using all IVs to the IV estimates using single instruments By doing so one can see if the various IV estimates are practically different whether or not the overidentification test rejects or fails to reject In the previous example we alluded to a general fact about 2SLS under the standard 2SLS assump tions adding instruments to the list improves the asymptotic efficiency of the 2SLS But this requires that any new instruments are in fact exogenousotherwise 2SLS will not even be consistent and it is only an asymptotic result With the typical sample sizes available adding too many instru mentsthat is increasing the number of overidentifying restrictionscan cause severe biases in 2SLS A detailed discussion would take us too far afield A nice illustration is given by Bound Jaeger and Baker 1995 who argue that the 2SLS estimates of the return to education obtained by Angrist and Krueger 1991 using many instrumental variables are likely to be seriously biased even with hundreds of thousands of observations The overidentification test can be used whenever we have more instruments than we need If we have just enough instruments the model is said to be just identified and the Rsquared in part ii will be identically zero As we mentioned earlier we cannot test exogeneity of the instruments in the just identified case The test can be made robust to heteroskedasticity of arbitrary form for details see Wooldridge 2010 Chapter 5 156 2SLS with Heteroskedasticity Heteroskedasticity in the context of 2SLS raises essentially the same issues as with OLS Most impor tantly it is possible to obtain standard errors and test statistics that are asymptotically robust to heteroskedasticity of arbitrary and unknown form In fact expression 84 continues to be valid if the rij are obtained as the residuals from regressing xij on the other xih where the denotes fitted values from the first stage regressions for endogenous explanatory variables Wooldridge 2010 Chapter 5 contains more details Some software packages do this routinely We can also test for heteroskedasticity using an analog of the BreuschPagan test that we cov ered in Chapter 8 Let u denote the 2SLS residuals and let z1 z2 p zm denote all the exogenous vari ables including those used as IVs for the endogenous explanatory variables Then under reasonable assumptions spelled out for example in Wooldridge 2010 Chapter 5 an asymptotically valid Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 485 statistic is the usual F statistic for joint significance in a regression of u 2 on z1 z2 p zm The null hypothesis of homoskedasticity is rejected if the zj are jointly significant If we apply this test to Example 158 using motheduc fatheduc and huseduc as instruments for educ we obtain F5422 5 253 and pvalue 5 029 This is evidence of heteroskedasticity at the 5 level We might want to compute heteroskedasticityrobust standard errors to account for this If we know how the error variance depends on the exogenous variables we can use a weighted 2SLS procedure essentially the same as in Section 84 After estimating a model for Var1u0z1 z2 p zm2 we divide the dependent variable the explanatory variables and all the instru mental variables for observation i by h i where h i denotes the estimated variance The constant which is both an explanatory variable and an IV is divided by h i see Section 84 Then we apply 2SLS on the transformed equation using the transformed instruments 157 Applying 2SLS to Time Series Equations When we apply 2SLS to time series data many of the considerations that arose for OLS in Chapters 10 11 and 12 are relevant Write the structural equation for each time period as yt 5 b0 1 b1xt1 1 p 1 bk xtk 1 ut 1552 where one or more of the explanatory variables xtj might be correlated with ut Denote the set of exog enous variables by zt1 p ztm E1ut2 5 0 Cov1ztj ut2 5 0 j 5 1 p m Any exogenous explanatory variable is also a ztj For identification it is necessary that m k we have as many exogenous variables as explanatory variables The mechanics of 2SLS are identical for time series or crosssectional data but for time series data the statistical properties of 2SLS depend on the trending and correlation properties of the under lying sequences In particular we must be careful to include trends if we have trending dependent or explanatory variables Since a time trend is exoge nous it can always serve as its own instrumental var iable The same is true of seasonal dummy variables if monthly or quarterly data are used Series that have strong persistence have unit roots must be used with care just as with OLS Often differencing the equation is warranted before estimation and this applies to the instruments as well Under analogs of the assumptions in Chapter 11 for the asymptotic properties of OLS 2SLS using time series data is consistent and asymptotically normally distributed In fact if we replace the explanatory variables with the instrumental variables in stating the assumptions we only need to add the identification assumptions for 2SLS For example the homoskedasticity assumption is stated as E1u2 t 0zt1 p ztm2 5 s2 1553 and the no serial correlation assumption is stated as E1utus0zt zs2 5 0 for all t 2 s 1554 where zt denotes all exogenous variables at time t A full statement of the assumptions is given in the chapter appendix We will provide examples of 2SLS for time series problems in Chapter 16 see also Computer Exercise C4 A model to test the effect of growth in gov ernment spending on growth in output is gGDPt 5 b0 1 b1gGOVt 1 b2INVRATt 1 b3gLABt 1 ut where g indicates growth GDP is real gross domestic product GOV is real government spending INVRAT is the ratio of gross do mestic investment to GDP and LAB is the size of the labor force See equation 6 in Ram 1986 Under what assumptions would a dummy variable indicating whether the president in year t 2 1 is a Republican be a suitable IV for gGOVt Exploring FurthEr 154 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 486 As in the case of OLS the no serial correlation assumption can often be violated with time series data Fortunately it is very easy to test for AR1 serial correlation If we write ut 5 rut21 1 et and plug this into equation 1552 we get yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 rut21 1 et t 2 1555 To test H0 r1 5 0 we must replace ut21 with the 2SLS residuals u t21 Further if xtj is endogenous in 1552 then it is endogenous in 1555 so we still need to use an IV Because et is uncorrelated with all past values of ut u t21 can be used as its own instrument Testing for AR1 Serial Correlation after 2SLS i Estimate 1552 by 2SLS and obtain the 2SLS residuals u t ii Estimate yt 5 b0 1 b1xt1 1 p 1 bkxtk 1 ru t21 1 errort t 5 2 p n by 2SLS using the same instruments from part i in addition to u t21 Use the t statistic on r to test H0 r 5 0 As with the OLS version of this test from Chapter 12 the t statistic only has asymptotic justifica tion but it tends to work well in practice A heteroskedasticityrobust version can be used to guard against heteroskedasticity Further lagged residuals can be added to the equation to test for higher forms of serial correlation using a joint F test What happens if we detect serial correlation Some econometrics packages will compute standard errors that are robust to fairly general forms of serial correlation and heteroskedasticity This is a nice simple way to go if your econometrics package does this The computations are very similar to those in Section 125 for OLS See Wooldridge 1995 for formulas and other computational methods An alternative is to use the AR1 model and correct for serial correlation The procedure is similar to that for OLS and places additional restrictions on the instrumental variables The quasi differenced equation is the same as in equation 1232 yt 5 b011 2 r2 1 b1xt1 1 p 1 bkxtk 1 et t 2 1556 where xtj 5 xtj 2 rxt21 j We can use the t 5 1 observation just as in Section 123 but we omit that for simplicity here The question is What can we use as instrumental variables It seems natural to use the quasidifferenced instruments ztj 5 ztj 2 rzt21 j This only works however if in 1552 the original error ut is uncorrelated with the instruments at times t t 2 1 and t 1 1 That is the instru mental variables must be strictly exogenous in 1552 This rules out lagged dependent variables as IVs for example It also eliminates cases where future movements in the IVs react to current and past changes in the error ut 2SLS with AR1 Errors i Estimate 1552 by 2SLS and obtain the 2SLS residuals u t t 5 1 2 p n ii Obtain r from the regression of u t on u t21 t 5 2 p n and construct the quasidifferenced vari ables yt 5 yt 2 ryt21 xtj 5 xtj 2 rxt21 j and ztj 5 ztj 2 rzt21 j for t 2 Remember in most cases some of the IVs will also be explanatory variables iii Estimate 1556 where r is replaced with r by 2SLS using the ztj as the instruments Assuming that 1556 satisfies the 2SLS assumptions in the chapter appendix the usual 2SLS test statistics are asymptotically valid We can also use the first time period as in PraisWinsten estimation of the model with exogenous explanatory variables The transformed variables in the first time periodthe dependent variable explanatory variables and instrumental variablesare obtained simply by multiplying all firstperiod values by 11 2 r 2 12 See also Section 123 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 487 158 Applying 2SLS to Pooled Cross Sections and Panel Data Applying instrumental variables methods to independently pooled cross sections raises no new dif ficulties As with models estimated by OLS we should often include time period dummy variables to allow for aggregate time effects These dummy variables are exogenousbecause the passage of time is exogenousand so they act as their own instruments ExamplE 159 Effect of Education on Fertility In Example 131 we used the pooled cross section in FERTIL1 to estimate the effect of education on womens fertility controlling for various other factors As in Sander 1992 we allow for the possibil ity that educ is endogenous in the equation As instrumental variables for educ we use mothers and fathers education levels meduc feduc The 2SLS estimate of beduc is 2153 1se 5 0392 compared with the OLS estimate 2128 1se 5 0182 The 2SLS estimate shows a somewhat larger effect of education on fertility but the 2SLS standard error is over twice as large as the OLS standard error In fact the 95 confidence interval based on 2SLS easily contains the OLS estimate The OLS and 2SLS estimates of beduc are not statistically different as can be seen by testing for endogeneity of educ as in Section 155 when the reduced form residual v2 is included with the other regressors in Table 131 including educ its t statistic is 702 which is not significant at any reasonable level Therefore in this case we conclude that the difference between 2SLS and OLS could be entirely due to sampling error Instrumental variables estimation can be combined with panel data methods particularly first differencing to estimate parameters consistently in the presence of unobserved effects and endogene ity in one or more timevarying explanatory variables The following simple example illustrates this combination of methods ExamplE 1510 Job Training and Worker productivity Suppose we want to estimate the effect of another hour of job training on worker productivity For the two years 1987 and 1988 consider the simple panel data model log1scrapit2 5 b0 1 d0d88t 1 b1hrsempit 1 ai 1 uit t 5 1 2 where scrapit is firm is scrap rate in year t and hrsempit is hours of job training per employee As usual we allow different year intercepts and a constant unobserved firm effect ai For the reasons discussed in Section 132 we might be concerned that hrsempit is correlated with ai the latter of which contains unmeasured worker ability As before we difference to remove ai Dlog1scrapi2 5 d0 1 b1Dhrsempi 1 Dui 1557 Normally we would estimate this equation by OLS But what if Dui is correlated with Dhrsempi For example a firm might hire more skilled workers while at the same time reducing the level of job train ing In this case we need an instrumental variable for Dhrsempi Generally such an IV would be hard to find but we can exploit the fact that some firms received job training grants in 1988 If we assume that grant designation is uncorrelated with Duisomething that is reasonable because the grants were given at the beginning of 1988then Dgranti is valid as an IV provided Dhrsemp and Dgrant are correlated Using the data in JTRAIN differenced between 1987 and 1988 the first stage regression is Dhrsemp 5 51 1 2788 Dgrant 11562 13132 n 5 45 R2 5 392 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 488 This confirms that the change in hours of job training per employee is strongly positively related to receiving a job training grant in 1988 In fact receiving a job training grant increased peremployee training by almost 28 hours and grant designation accounted for almost 40 of the variation in Dhrsemp Two stage least squares estimation of 1557 gives Dlog1scrap2 5 2033 2 014 Dhrsemp 11272 10082 n 5 45 R2 5 016 This means that 10 more hours of job training per worker are estimated to reduce the scrap rate by about 14 For the firms in the sample the average amount of job training in 1988 was about 17 hours per worker with a minimum of zero and a maximum of 88 For comparison OLS estimation of 1557 gives b 1 5 20076 1se 5 00452 so the 2SLS esti mate of b1 is almost twice as large in magnitude and is slightly more statistically significant When T 3 the differenced equation may contain serial correlation The same test and cor rection for AR1 serial correlation from Section 157 can be used where all regressions are pooled across i as well as t Because we do not want to lose an entire time period the PraisWinsten transfor mation should be used for the initial time period Unobserved effects models containing lagged dependent variables also require IV methods for consistent estimation The reason is that after differencing Dyi t21 is correlated with Duit because yit21 and uit21 are correlated We can use two or more lags of y as IVs for Dyi t21 See Wooldridge 2010 Chapter 11 for details Instrumental variables after differencing can be used on matched pairs samples as well Ashenfelter and Krueger 1994 differenced the wage equation across twins to eliminate unobserved ability log1wage22 2 log1wage12 5 d0 1 b11educ22 2 educ112 1 1u2 2 u12 where educ1 1 is years of schooling for the first twin as reported by the first twin and educ2 2 is years of schooling for the second twin as reported by the second twin To account for possible measure ment error in the selfreported schooling measures Ashenfelter and Krueger used 1educ21 2 educ122 as an IV for 1educ22 2 educ112 where educ21 is years of schooling for the second twin as reported by the first twin and educ1 2 is years of schooling for the first twin as reported by the second twin The IV estimate of b1 is 1671t 5 3882 compared with the OLS estimate on the first differences of 0921t 5 3832 see Ashenfelter and Krueger 1994 Table 3 Summary In Chapter 15 we have introduced the method of instrumental variables as a way to estimate the param eters in a linear model consistently when one or more explanatory variables are endogenous An instrumen tal variable must have two properties 1 it must be exogenous that is uncorrelated with the error term of the structural equation 2 it must be partially correlated with the endogenous explanatory variable Find ing a variable with these two properties is usually challenging The method of two stage least squares which allows for more instrumental variables than we have explanatory variables is used routinely in the empirical social sciences When used properly it can allow us to estimate ceteris paribus effects in the presence of endogenous explanatory variables This is true in crosssectional time series and panel data applications But when instruments are poorwhich means they are correlated with the error term only weakly correlated with the endogenous explanatory variable or boththen 2SLS can be worse than OLS Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 489 When we have valid instrumental variables we can test whether an explanatory variable is endog enous using the test in Section 155 In addition though we can never test whether all IVs are exogenous we can test that at least some of them areassuming that we have more instruments than we need for consistent estimation that is the model is overidentified Heteroskedasticity and serial correlation can be tested for and dealt with using methods similar to the case of models with exogenous explanatory variables In this chapter we used omitted variables and measurement error to illustrate the method of instru mental variables IV methods are also indispensable for simultaneous equations models which we will cover in Chapter 16 Key Terms Endogenous Explanatory Variables ErrorsinVariables Exclusion Restrictions Exogenous Explanatory Variables Exogenous Variables Identification Instrument Instrumental Variable Instrumental Variables IV Estimator Instrument Exogeneity Instrument Relevance Natural Experiment Omitted Variables Order Condition Overidentifying Restrictions Rank Condition Reduced Form Equation Structural Equation Two Stage Least Squares 2SLS Estimator Weak Instruments Problems 1 Consider a simple model to estimate the effect of personal computer PC ownership on college grade point average for graduating seniors at a large public university GPA 5 b0 1 b1PC 1 u where PC is a binary variable indicating PC ownership i Why might PC ownership be correlated with u ii Explain why PC is likely to be related to parents annual income Does this mean parental income is a good IV for PC Why or why not iii Suppose that four years ago the university gave grants to buy computers to roughly one half of the incoming students and the students who received grants were randomly chosen Carefully explain how you would use this information to construct an instrumental variable for PC 2 Suppose that you wish to estimate the effect of class attendance on student performance as in Example 63 A basic model is stndfnl 5 b0 1 b1atndrte 1 b2priGPA 1 b3ACT 1 u where the variables are defined as in Chapter 6 i Let dist be the distance from the students living quarters to the lecture hall Do you think dist is uncorrelated with u ii Assuming that dist and u are uncorrelated what other assumption must dist satisfy to be a valid IV for atndrte iii Suppose as in equation 618 we add the interaction term priGPAatndrte stndfnl 5 b0 1 b1atndrte 1 b2priGPA 1 b3ACT 1 b4priGPA atndrte 1 u If atndrte is correlated with u then in general so is priGPAatndrte What might be a good IV for priGPAatndrte Hint If EupriGPA ACT dist 0 as happens when priGPA ACT and dist are all exogenous then any function of priGPA and dist is uncorrelated with u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 490 3 Consider the simple regression model y 5 b0 1 b1x 1 u and let z be a binary instrumental variable for x Use 1510 to show that the IV estimator b 1 can be written as b 1 5 1y1 2 y021x1 2 x02 where y0 and x0 are the sample averages of yi and xi over the part of the sample with zi 5 0 and where y1 and x1 are the sample averages of yi and xi over the part of the sample with zi 5 1 This estimator known as a grouping estimator was first suggested by Wald 1940 4 Suppose that for a given state in the United States you wish to use annual time series data to estimate the effect of the statelevel minimum wage on the employment of those 18 to 25 years old EMP A simple model is gEMPt 5 b0 1 b1gMINt 1 b2gPOPt 1 b3gGSPt 1 b4gGDPt 1 ut where MINt is the minimum wage in real dollars POPt is the population from 18 to 25 years old GSPt is gross state product and GDPt is US gross domestic product The g prefix indicates the growth rate from year t 2 1 to year t which would typically be approximated by the difference in the logs i If we are worried that the state chooses its minimum wage partly based on unobserved to us factors that affect youth employment what is the problem with OLS estimation ii Let USMINt be the US minimum wage which is also measured in real terms Do you think gUSMINt is uncorrelated with ut iii By law any states minimum wage must be at least as large as the US minimum Explain why this makes gUSMINt a potential IV candidate for gMINt 5 Refer to equations 1519 and 1520 Assume that su 5 sx so that the population variation in the error term is the same as it is in x Suppose that the instrumental variable z is slightly correlated with u Corr1z u2 5 1 Suppose also that z and x have a somewhat stronger correlation Corr1z x2 5 2 i What is the asymptotic bias in the IV estimator ii How much correlation would have to exist between x and u before OLS has more asymptotic bias than 2SLS 6 i In the model with one endogenous explanatory variable one exogenous explanatory variable and one extra exogenous variable take the reduced form for y2 1526 and plug it into the struc tural equation 1522 This gives the reduced form for y1 y1 5 a0 1 a1z1 1 a2z2 1 v1 Find the aj in terms of the bj and the pj ii Find the reduced form error v1 in terms of u1 v2 and the parameters iii How would you consistently estimate the aj 7 The following is a simple model to measure the effect of a school choice program on standardized test performance see Rouse 1998 for motivation and Computer Exercise C11 for an analysis of a subset of Rouses data score 5 b0 1 b1choice 1 b2faminc 1 u1 where score is the score on a statewide test choice is a binary variable indicating whether a student attended a choice school in the last year and faminc is family income The IV for choice is grant the dollar amount granted to students to use for tuition at choice schools The grant amount differed by family income level which is why we control for faminc in the equation i Even with faminc in the equation why might choice be correlated with u1 ii If within each income class the grant amounts were assigned randomly is grant uncorrelated with u1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 491 iii Write the reduced form equation for choice What is needed for grant to be partially correlated with choice iv Write the reduced form equation for score Explain why this is useful Hint How do you inter pret the coefficient on grant 8 Suppose you want to test whether girls who attend a girls high school do better in math than girls who attend coed schools You have a random sample of senior high school girls from a state in the United States and score is the score on a standardized math test Let girlhs be a dummy variable indicating whether a student attends a girls high school i What other factors would you control for in the equation You should be able to reasonably collect data on these factors ii Write an equation relating score to girlhs and the other factors you listed in part i iii Suppose that parental support and motivation are unmeasured factors in the error term in part ii Are these likely to be correlated with girlhs Explain iv Discuss the assumptions needed for the number of girls high schools within a 20mile radius of a girls home to be a valid IV for girlhs v Suppose that when you estimate the reduced form for girlshs you find that the coefficient on numghs the number of girls high schools within a 20mile radius is negative and statistically significant Would you feel comfortable proceeding with IV estimation where numghs is used as an IV for girlshs Explain 9 Suppose that in equation 158 you do not have a good instrumental variable candidate for skipped But you have two other pieces of information on students combined SAT score and cumulative GPA prior to the semester What would you do instead of IV estimation 10 In a recent article Evans and Schwab 1995 studied the effects of attending a Catholic high school on the probability of attending college For concreteness let college be a binary variable equal to unity if a student attends college and zero otherwise Let CathHS be a binary variable equal to one if the student attends a Catholic high school A linear probability model is college 5 b0 1 b1CathHS 1 other factors 1 u where the other factors include gender race family income and parental education i Why might CathHS be correlated with u ii Evans and Schwab have data on a standardized test score taken when each student was a sopho more What can be done with this variable to improve the ceteris paribus estimate of attending a Catholic high school iii Let CathRel be a binary variable equal to one if the student is Catholic Discuss the two require ments needed for this to be a valid IV for CathHS in the preceding equation Which of these can be tested iv Not surprisingly being Catholic has a significant positive effect on attending a Catholic high school Do you think CathRel is a convincing instrument for CathHS 11 Consider a simple time series model where the explanatory variable has classical measurement error yt 5 b0 1 b1xp t 1 ut 1558 xt 5 xp t 1 et where ut has zero mean and is uncorrelated with xp t and et We observe yt and xt only Assume that et has zero mean and is uncorrelated with xp t and that xp t also has a zero mean this last assumption is only to simplify the algebra i Write xp t 5 xt 2 et and plug this into 1558 Show that the error term in the new equation say vt is negatively correlated with xt if b1 0 What does this imply about the OLS estimator of b1 from the regression of yt on xt Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 492 ii In addition to the previous assumptions assume that ut and et are uncorrelated with all past values of xp t and et in particular with xp t 2 1 and et21 Show that E1xt21vt2 5 0 where vt is the error term in the model from part i iii Are xt and xt21 likely to be correlated Explain iv What do parts ii and iii suggest as a useful strategy for consistently estimating b0 and b1 Computer Exercises C1 Use the data in WAGE2 for this exercise i In Example 152 if sibs is used as an instrument for educ the IV estimate of the return to education is 122 To convince yourself that using sibs as an IV for educ is not the same as just plugging sibs in for educ and running an OLS regression run the regression of logwage on sibs and explain your findings ii The variable brthord is birth order brthord is one for a firstborn child two for a secondborn child and so on Explain why educ and brthord might be negatively correlated Regress educ on brthord to determine whether there is a statistically significant negative correlation iii Use brthord as an IV for educ in equation 151 Report and interpret the results iv Now suppose that we include number of siblings as an explanatory variable in the wage equation this controls for family background to some extent log1wage2 5 b0 1 b1educ 1 b2sibs 1 u Suppose that we want to use brthord as an IV for educ assuming that sibs is exogenous The reduced form for educ is educ 5 p0 1 p1sibs 1 p2brthord 1 v State and test the identification assumption v Estimate the equation from part iv using brthord as an IV for educ and sibs as its own IV Comment on the standard errors for b educ and b sibs vi Using the fitted values from part iv educ compute the correlation between educ and sibs Use this result to explain your findings from part v C2 The data in FERTIL2 include for women in Botswana during 1988 information on number of chil dren years of education age and religious and economic status variables i Estimate the model children 5 b0 1 b1educ 1 b2age 1 b3age2 1 u by OLS and interpret the estimates In particular holding age fixed what is the estimated effect of another year of education on fertility If 100 women receive another year of education how many fewer children are they expected to have ii The variable frsthalf is a dummy variable equal to one if the woman was born during the first six months of the year Assuming that frsthalf is uncorrelated with the error term from part i show that frsthalf is a reasonable IV candidate for educ Hint You need to do a regression iii Estimate the model from part i by using frsthalf as an IV for educ Compare the estimated effect of education with the OLS estimate from part i iv Add the binary variables electric tv and bicycle to the model and assume these are exogenous Estimate the equation by OLS and 2SLS and compare the estimated coefficients on educ Interpret the coefficient on tv and explain why television ownership has a negative effect on fertility C3 Use the data in CARD for this exercise i The equation we estimated in Example 154 can be written as log1wage2 5 b0 1 b1educ 1 b2exper 1 p 1 u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 493 where the other explanatory variables are listed in Table 151 In order for IV to be consistent the IV for educ nearc4 must be uncorrelated with u Could nearc4 be correlated with things in the error term such as unobserved ability Explain ii For a subsample of the men in the data set an IQ score is available Regress IQ on nearc4 to check whether average IQ scores vary by whether the man grew up near a fouryear college What do you conclude iii Now regress IQ on nearc4 smsa66 and the 1966 regional dummy variables reg662 reg669 Are IQ and nearc4 related after the geographic dummy variables have been partialled out Reconcile this with your findings from part ii iv From parts ii and iii what do you conclude about the importance of controlling for smsa66 and the 1966 regional dummies in the logwage equation C4 Use the data in INTDEF for this exercise A simple equation relating the threemonth Tbill rate to the inflation rate constructed from the Consumer Price Index is i3t 5 b0 1 b1inft 1 ut i Estimate this equation by OLS omitting the first time period for later comparisons Report the results in the usual form ii Some economists feel that the Consumer Price Index mismeasures the true rate of inflation so that the OLS from part i suffers from measurement error bias Reestimate the equation from part i using inft21 as an IV for inft How does the IV estimate of b1 compare with the OLS estimate iii Now first difference the equation Di3t 5 b0 1 b1Dinft 1 Dut Estimate this by OLS and compare the estimate of b1 with the previous estimates iv Can you use Dinft21 as an IV for Dinft in the differenced equation in part iii Explain Hint Are Dinft and Dinft21 sufficiently correlated C5 Use the data in CARD for this exercise i In Table 151 the difference between the IV and OLS estimates of the return to education is economically important Obtain the reduced form residuals v2 from the reduced form regression educ on nearc4 exper exper2 black smsa south smsa66 reg662 reg669see Table 151 Use these to test whether educ is exogenous that is determine if the difference between OLS and IV is statistically significant ii Estimate the equation by 2SLS adding nearc2 as an instrument Does the coefficient on educ change much iii Test the single overidentifying restriction from part ii C6 Use the data in MURDER for this exercise The variable mrdrte is the murder rate that is the number of murders per 100000 people The variable exec is the total number of prisoners executed for the cur rent and prior two years unem is the state unemployment rate i How many states executed at least one prisoner in 1991 1992 or 1993 Which state had the most executions ii Using the two years 1990 and 1993 do a pooled regression of mrdrte on d93 exec and unem What do you make of the coefficient on exec iii Using the changes from 1990 to 1993 only for a total of 51 observations estimate the equation Dmrdrte 5 d0 1 b1Dexec 1 b2Dunem 1 Du by OLS and report the results in the usual form Now does capital punishment appear to have a deterrent effect iv The change in executions may be at least partly related to changes in the expected murder rate so that Dexec is correlated with Du in part iii It might be reasonable to assume that Dexec21 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 494 is uncorrelated with Du After all Dexec21 depends on executions that occurred three or more years ago Regress Dexec on Dexec21 to see if they are sufficiently correlated interpret the coefficient on Dexec21 v Reestimate the equation from part iii using Dexec21 as an IV for Dexec Assume that Dunem is exogenous How do your conclusions change from part iii C7 Use the data in PHILLIPS for this exercise i In Example 115 we estimated an expectations augmented Phillips curve of the form Dinft 5 b0 1 b1unemt 1 et where Dinft 5 inft 2 inft21 In estimating this equation by OLS we assumed that the supply shock et was uncorrelated with unemt If this is false what can be said about the OLS estimator of b1 ii Suppose that et is unpredictable given all past information E1et0inft21 unemt21 p2 5 0 Explain why this makes unemt21 a good IV candidate for unemt iii Regress unemt on unemt21 Are unemt and unemt21 significantly correlated iv Estimate the expectations augmented Phillips curve by IV Report the results in the usual form and compare them with the OLS estimates from Example 115 C8 Use the data in 401KSUBS for this exercise The equation of interest is a linear probability model pira 5 b0 1 b1p401k 1 b2inc 1 b3inc2 1 b4age 1 b5age2 1 u The goal is to test whether there is a tradeoff between participating in a 401k plan and having an individual retirement account IRA Therefore we want to estimate b1 i Estimate the equation by OLS and discuss the estimated effect of p401k ii For the purposes of estimating the ceteris paribus tradeoff between participation in two different types of retirement savings plans what might be a problem with ordinary least squares iii The variable e401k is a binary variable equal to one if a worker is eligible to participate in a 401k plan Explain what is required for e401k to be a valid IV for p401k Do these assumptions seem reasonable iv Estimate the reduced form for p401k and verify that e401k has significant partial correlation with p401k Since the reduced form is also a linear probability model use a heteroskedasticity robust standard error v Now estimate the structural equation by IV and compare the estimate of b1 with the OLS estimate Again you should obtain heteroskedasticityrobust standard errors vi Test the null hypothesis that p401k is in fact exogenous using a heteroskedasticityrobust test C9 The purpose of this exercise is to compare the estimates and standard errors obtained by correctly using 2SLS with those obtained using inappropriate procedures Use the data file WAGE2 i Use a 2SLS routine to estimate the equation log1wage2 5 b0 1 b1educ 1 b2exper 1 b3tenure 1 b4black 1 u where sibs is the IV for educ Report the results in the usual form ii Now manually carry out 2SLS That is first regress educi on sibsi experi tenurei and blacki and obtain the fitted values educi i 5 1 p n Then run the second stage regression log1wagei2 on educi experi tenurei and blacki i 5 1 p n Verify that the b j are identical to those obtained from part i but that the standard errors are somewhat different The standard errors obtained from the second stage regression when manually carrying out 2SLS are generally inappropriate iii Now use the following twostep procedure which generally yields inconsistent parameter estimates of the bj and not just inconsistent standard errors In step one regress educi on sibsi only and obtain the fitted values say educi Note that this is an incorrect first stage regression Then in the second step run the regression of log1wagei2 on educi experi tenurei and blacki i 5 1 p n How does the estimate from this incorrect twostep procedure compare with the correct 2SLS estimate of the return to education Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 495 C10 Use the data in HTV for this exercise i Run a simple OLS regression of logwage on educ Without controlling for other factors what is the 95 confidence interval for the return to another year of education ii The variable ctuit in thousands of dollars is the change in college tuition facing students from age 17 to age 18 Show that educ and ctuit are essentially uncorrelated What does this say about ctuit as a possible IV for educ in a simple regression analysis iii Now add to the simple regression model in part i a quadratic in experience and a full set of regional dummy variables for current residence and residence at age 18 Also include the urban indicators for current and age 18 residences What is the estimated return to a year of education iv Again using ctuit as a potential IV for educ estimate the reduced form for educ Naturally the reduced form for educ now includes the explanatory variables in part iii Show that ctuit is now statistically significant in the reduced form for educ v Estimate the model from part iii by IV using ctuit as an IV for educ How does the confidence interval for the return to education compare with the OLS CI from part iii vi Do you think the IV procedure from part v is convincing C11 The data set in VOUCHER which is a subset of the data used in Rouse 1998 can be used to estimate the effect of school choice on academic achievement Attendance at a choice school was paid for by a voucher which was determined by a lottery among those who applied The data subset was chosen so that any student in the sample has a valid 1994 math test score the last year available in Rouses sample Unfortunately as pointed out by Rouse many students have missing test scores possibly due to attrition that is leaving the Milwaukee public school district These data include students who applied to the voucher program and were accepted students who applied and were not accepted and students who did not apply Therefore even though the vouchers were chosen by lottery among those who applied we do not necessarily have a random sample from a population where being selected for a voucher has been ran domly determined An important consideration is that students who never applied to the program may be systematically different from those who didand in ways that we cannot know based on the data Rouse 1998 uses panel data methods of the kind we discussed in Chapter 14 to allow student fixed effects she also uses instrumental variables methods This problem asks you to do a crosssec tional analysis where winning the lottery for a voucher acts as an instrumental variable for attending a choice school Actually because we have multiple years of data on each student we construct two variables The first choiceyrs is the number of years from 1991 to 1994 that a student attended a choice school this variable ranges from zero to four The variable selectyrs indicates the number of years a student was selected for a voucher If the student applied for the program in 1990 and received a voucher then selectyrs 5 4 if he or she applied in 1991 and received a voucher then selectyrs 5 3 and so on The outcome of interest is mnce the students percentile score on a math test administered in 1994 i Of the 990 students in the sample how many were never awarded a voucher How many had a voucher available for four years How many students actually attended a choice school for four years ii Run a simple regression of choiceyrs on selectyrs Are these variables related in the direction you expected How strong is the relationship Is selectyrs a sensible IV candidate for choiceyrs iii Run a simple regression of mnce on choiceyrs What do you find Is this what you expected What happens if you add the variables black hispanic and female iv Why might choiceyrs be endogenous in an equation such as mnce 5 b0 1 b1choiceyrs 1 b2black 1 b3hispanic 1 b4female 1 u1 v Estimate the equation in part iv by instrumental variables using selectyrs as the IV for choiceyrs Does using IV produce a positive effect of attending a choice school What do you make of the coefficients on the other explanatory variables vi To control for the possibility that prior achievement affects participating in the lottery as well as predicting attrition add mnce90the math score in 1990to the equation in part iv Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 496 Estimate the equation by OLS and IV and compare the results for b1 For the IV estimate how much is each year in a choice school worth on the math percentile score Is this a practically large effect vii Why is the analysis from part vi not entirely convincing Hint Compared with part v what happens to the number of observations and why viii The variables choiceyrs1 choiceyrs2 and so on are dummy variables indicating the different number of years a student could have been in a choice school from 1991 to 1994 The dummy variables selectyrs1 selectyrs2 and so on have a similar definition but for being selected from the lottery Estimate the equation mnce 5 b0 1 b1choiceyrs1 1 b2choiceyrs2 1 b3choiceyrs3 1 b4choiceyrs4 1 b5black 1 b6hispanic 1 b7 female 1 b8mnce90 1 u1 by IV using as instruments the four selectyrs dummy variables As before the variables black hispanic and female act as their own IVs Describe your findings Do they make sense C12 Use the data in CATHOLIC to answer this question The model of interest is math12 5 b0 1 b1cathhs 1 b2lfaminc 1 b3motheduc 1 b4fatheduc 1 u where cathhs is a binary indicator for whether a student attends a Catholic high school i How many students are in the sample What percentage of these students attend a Catholic high school ii Estimate the above equation by OLS What is the estimate of b1 What is its 95 confidence interval iii Using parcath as an instrument for cathhs estimate the reduced form for cathhs What is the t statistic for parcath Is there evidence of a weak instrument problem iv Estimate the above equation by IV using parcath as an IV for cathhs How does the estimate and 95 CI compare with the OLS quantities v Test the null hypothesis that cathhs is exogenous What is the pvalue of the test vi Suppose you add the interaction between cathhs motheduc to the above model Why is it generally endogenous Why is pareduc motheduc a good IV candidate for cathhs motheduc vii Before you create the interactions in part vi first find the sample average of motheduc and create cathhs 1motheduc 2 motheduc2 and parcath 1motheduc 2 motheduc2 Add the first interaction to the model and use the second as an IV Of course cathhs is also instrumented Is the interaction term statistically significant viii Compare the coefficient on cathhs in vii to that in part iv Is including the interaction important for estimating the average partial effect APPEndix 15A 15A1 Assumptions for Two Stage Least Squares This appendix covers the assumptions under which 2SLS has desirable large sample properties We first state the assumptions for crosssectional applications under random sampling Then we discuss what needs to be added for them to apply to time series and panel data Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 497 15A2 Assumption 2SLS1 Linear in Parameters The model in the population can be written as y 5 b0 1 b1x1 1 b2x2 1 p 1 bkxk 1 u where b0 b1 p bk are the unknown parameters constants of interest and u is an unobserved ran dom error or random disturbance term The instrumental variables are denoted as zj It is worth emphasizing that Assumption 2SLS1 is virtually identical to MLR1 with the minor exception that 2SLS1 mentions the notation for the instrumental variables zj In other words the model we are interested in is the same as that for OLS estimation of the bj Sometimes it is easy to lose sight of the fact that we can apply different estimation methods to the same model Unfortu nately it is not uncommon to hear researchers say I estimated an OLS model or I used a 2SLS model Such statements are meaningless OLS and 2SLS are different estimation methods that are applied to the same model It is true that they have desirable statistical properties under different sets of assumptions on the model but the relationship they are estimating is given by the equation in 2SLS1 or MLR1 The point is similar to that made for the unobserved effects panel data model covered in Chapters 13 and 14 pooled OLS first differencing fixed effects and random effects are different estimation methods for the same model 15A3 Assumption 2SLS2 Random Sampling We have a random sample on y the xj and the zj 15A4 Assumption 2SLS3 Rank Condition i There are no perfect linear relationships among the instrumental variables ii The rank condition for identification holds With a single endogenous explanatory variable as in equation 1542 the rank condition is easily described Let z1 p zm denote the exogenous variables where zk p zm do not appear in the structural model 1542 The reduced form of y2 is y2 5 p0 1 p1z1 1 p2z2 1 p 1 pk21zk21 1 pkzk 1 p 1 pmzm 1 v2 Then we need at least one of pk p pm to be nonzero This requires at least one exogenous vari able that does not appear in 1542 the order condition Stating the rank condition with two or more endogenous explanatory variables requires matrix algebra See Wooldridge 2010 Chapter 5 15A5 Assumption 2SLS4 Exogenous Instrumental Variables The error term u has zero mean and each IV is uncorrelated with u Remember that any xj that is uncorrelated with u also acts as an IV 15A6 Theorem 15A1 Under Assumptions 2SLS1 through 2SLS4 the 2SLS estimator is consistent 15A7 Assumption 2SLS5 Homoskedasticity Let z denote the collection of all instrumental variables Then E1u20z2 5 s2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 498 15A8 Theorem 15A2 Under Assumptions 2SLS1 through 2SLS5 the 2SLS estimators are asymptotically normally dis tributed Consistent estimators of the asymptotic variance are given as in equation 1543 where s2 is replaced with s 2 5 1n 2 k 2 12 21g n i51 u 2 i and the u i are the 2SLS residuals The 2SLS estimator is also the best IV estimator under the five assumptions given We state the result here A proof can be found in Wooldridge 2010 Chapter 5 15A9 Theorem 15A3 Under Assumptions 2SLS1 through 2SLS5 the 2SLS estimator is asymptotically efficient in the class of IV estimators that uses linear combinations of the exogenous variables as instruments If the homoskedasticity assumption does not hold the 2SLS estimators are still asymptotically normal but the standard errors and t and F statistics need to be adjusted many econometrics pack ages do this routinely Moreover the 2SLS estimator is no longer the asymptotically efficient IV esti mator in general We will not study more efficient estimators here see Wooldridge 2010 Chapter 8 For time series applications we must add some assumptions First as with OLS we must as sume that all series including the IVs are weakly dependent this ensures that the law of large num bers and the central limit theorem hold For the usual standard errors and test statistics to be valid as well as for asymptotic efficiency we must add a no serial correlation assumption 15A10 Assumption 2SLS6 No Serial Correlation Equation 1554 holds A similar no serial correlation assumption is needed in panel data applications Tests and cor rections for serial correlation were discussed in Section 157 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 499 I n the previous chapter we showed how the method of instrumental variables can solve two kinds of endogeneity problems omitted variables and measurement error Conceptually these problems are straightforward In the omitted variables case there is a variable or more than one that we would like to hold fixed when estimating the ceteris paribus effect of one or more of the observed explanatory variables In the measurement error case we would like to estimate the effect of certain explanatory variables on y but we have mismeasured one or more variables In both cases we could estimate the parameters of interest by OLS if we could collect better data Another important form of endogeneity of explanatory variables is simultaneity This arises when one or more of the explanatory variables is jointly determined with the dependent variable typically through an equilibrium mechanism as we will see later In this chapter we study methods for estimating simple simultaneous equations models SEMs Although a complete treatment of SEMs is beyond the scope of this text we are able to cover models that are widely used The leading method for estimating simultaneous equations models is the method of instrumental variables Therefore the solution to the simultaneity problem is essentially the same as the IV solutions to the omitted variables and measurement error problems However crafting and interpreting SEMs is challenging Therefore we begin by discussing the nature and scope of simultaneous equa tions models in Section 161 In Section 162 we confirm that OLS applied to an equation in a simultaneous system is generally biased and inconsistent Section 163 provides a general description of identification and estimation in a twoequation system while Section 164 briefly covers models with more than two equations Simultaneous Simultaneous Equations Models c h a p t e r 16 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 500 equations models are used to model aggregate time series and in Section 165 we include a discus sion of some special issues that arise in such models Section 166 touches on simultaneous equations models with panel data 161 The Nature of Simultaneous Equations Models The most important point to remember in using simultaneous equations models is that each equation in the system should have a ceteris paribus causal interpretation Because we only observe the out comes in equilibrium we are required to use counterfactual reasoning in constructing the equations of a simultaneous equations model We must think in terms of potential as well as actual outcomes The classic example of an SEM is a supply and demand equation for some commodity or input to production such as labor For concreteness let hs denote the annual labor hours supplied by workers in agriculture measured at the county level and let w denote the average hourly wage offered to such workers A simple labor supply function is hs 5 a1w 1 b1z1 1 u1 161 where z1 is some observed variable affecting labor supplysay the average manufacturing wage in the county The error term u1 contains other factors that affect labor supply Many of these factors are observed and could be included in equation 161 to illustrate the basic concepts we include only one such factor z1 Equation 161 is an example of a structural equation This name comes from the fact that the labor supply function is derivable from economic theory and has a causal interpretation The coefficient a1 measures how labor supply changes when the wage changes if hs and w are in logarithmic form a1 is the labor supply elasticity Typically we expect a1 to be posi tive although economic theory does not rule out a1 0 Labor supply elasticities are important for determining how workers will change the number of hours they desire to work when tax rates on wage income change If z1 is the manufacturing wage we expect b1 0 other factors equal if the manufacturing wage increases more workers will go into manufacturing than into agriculture When we graph labor supply we sketch hours as a function of wage with z1 and u1 held fixed A change in z1 shifts the labor supply function as does a change in u1 The difference is that z1 is observed while u1 is not Sometimes z1 is called an observed supply shifter and u1 is called an unob served supply shifter How does equation 161 differ from those we have studied previously The difference is subtle Although equation 161 is supposed to hold for all possible values of wage we cannot generally view wage as varying exogenously for a cross section of counties If we could run an experiment where we vary the level of agricultural and manufacturing wages across a sample of counties and survey workers to obtain the labor supply hs for each county then we could estimate 161 by OLS Unfortunately this is not a manageable experiment Instead we must collect data on average wages in these two sectors along with how many person hours were spent in agricultural production In decid ing how to analyze these data we must understand that they are best described by the interaction of labor supply and demand Under the assumption that labor markets clear we actually observe equilib rium values of wages and hours worked To describe how equilibrium wages and hours are determined we need to bring in the demand for labor which we suppose is given by hd 5 a2w 1 b2z2 1 u2 162 where hd is hours demanded As with the supply function we graph hours demanded as a function of wage w keeping z2 and u2 fixed The variable z2say agricultural land areais an observable demand shifter while u2 is an unobservable demand shifter Just as with the labor supply equation the labor demand equation is a structural equation it can be obtained from the profit maximization considerations of farmers If hd and w are in logarithmic Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 501 form a2 is the labor demand elasticity Economic theory tells us that a2 0 Because labor and land are complements in production we expect b2 0 Notice how equations 161 and 162 describe entirely different relationships Labor supply is a behavioral equation for workers and labor demand is a behavioral relationship for farmers Each equation has a ceteris paribus interpretation and stands on its own They become linked in an econometric analysis only because observed wage and hours are determined by the intersection of supply and demand In other words for each county i observed hours hi and observed wage wi are determined by the equilibrium condition his 5 hid 163 Because we observe only equilibrium hours for each county i we denote observed hours by hi When we combine the equilibrium condition in 163 with the labor supply and demand equations we get hi 5 a1wi 1 b1zi1 1 ui1 164 and hi 5 a2wi 1 b2zi2 1 ui2 165 where we explicitly include the i subscript to emphasize that hi and wi are the equilibrium observed values for county i These two equations constitute a simultaneous equations model SEM which has several important features First given zi1 zi2 ui1 and ui2 these two equations determine hi and wi Actually we must assume that a1 2 a2 which means that the slopes of the supply and demand functions differ see Problem 1 For this reason hi and wi are the endogenous variables in this SEM What about zi1 and zi2 Because they are determined outside of the model we view them as exogenous variables From a statistical standpoint the key assumption concerning zi1 and zi2 is that they are both uncorrelated with the supply and demand errors ui1 and ui2 respectively These are examples of structural errors because they appear in the structural equations A second important point is that without including z1 and z2 in the model there is no way to tell which equation is the supply function and which is the demand function When z1 represents manufactur ing wage economic reasoning tells us that it is a factor in agricultural labor supply because it is a measure of the opportunity cost of working in agriculture when z2 stands for agricultural land area production theory implies that it appears in the labor demand function Therefore we know that 164 represents labor supply and 165 represents labor demand If z1 and z2 are the samefor example average educa tion level of adults in the county which can affect both supply and demandthen the equations look identical and there is no hope of estimating either one In a nutshell this illustrates the identification problem in simultaneous equations models which we will discuss more generally in Section 163 The most convincing examples of SEMs have the same flavor as supply and demand examples Each equation should have a behavioral ceteris paribus interpretation on its own Because we only observe equilibrium outcomes specifying an SEM requires us to ask such counterfactual questions as How much labor would workers provide if the wage were different from its equilibrium value Example 161 provides another illustration of an SEM where each equation has a ceteris paribus interpretation ExamplE 161 murder Rates and Size of the police Force Cities often want to determine how much additional law enforcement will decrease their murder rates A simple crosssectional model to address this question is murdpc 5 a1polpc 1 b10 1 b11incpc 1 u1 166 where murdpc is murders per capita polpc is number of police officers per capita and incpc is income per capita Henceforth we do not include an i subscript We take income per capita as exogenous in this equation In practice we would include other factors such as age and gender distributions Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 502 education levels perhaps geographic variables and variables that measure severity of punishment To fix ideas we consider equation 166 The question we hope to answer is If a city exogenously increases its police force will that increase on average lower the murder rate If we could exogenously choose police force sizes for a random sample of cities we could estimate 166 by OLS Certainly we cannot run such an experi ment But can we think of police force size as being exogenously determined anyway Probably not A citys spending on law enforcement is at least partly determined by its expected murder rate To reflect this we postulate a second relationship polpc 5 a2murdpc 1 b20 1 other factors 167 We expect that a2 0 other factors being equal cities with higher expected murder rates will have more police officers per capita Once we specify the other factors in 167 we have a twoequation simultaneous equations model We are really only interested in equation 166 but as we will see in Section 163 we need to know precisely how the second equation is specified in order to estimate the first An important point is that 167 describes behavior by city officials while 166 describes the actions of potential murderers This gives each equation a clear ceteris paribus interpretation which makes equations 166 and 167 an appropriate simultaneous equations model We next give an example of an inappropriate use of SEMs ExamplE 162 Housing Expenditures and Saving Suppose that for a random household in the population we assume that annual housing expenditures and saving are jointly determined by housing 5 a1saving 1 b10 1 b11inc 1 b12educ 1 b13age 1 u1 168 and saving 5 a2housing 1 b20 1 b21inc 1 b22educ 1 b23age 1 u2 169 where inc is annual income and educ and age are measured in years Initially it may seem that these equations are a sensible way to view how housing and saving expenditures are determined But we have to ask What value would one of these equations be without the other Neither has a ceteris paribus inter pretation because housing and saving are chosen by the same household For example it makes no sense to ask this question If annual income increases by 10000 how would housing expenditures change holding saving fixed If family income increases a household will generally change the optimal mix of housing expenditures and saving But equation 168 makes it seem as if we want to know the effect of changing inc educ or age while keeping saving fixed Such a thought experiment is not interesting Any model based on economic principles particularly utility maximization would have households opti mally choosing housing and saving as functions of inc and the relative prices of housing and saving The variables educ and age would affect preferences for consumption saving and risk Therefore housing and saving would each be functions of income education age and other variables that affect the utility maximization problem such as different rates of return on housing and other saving Even if we decided that the SEM in 168 and 169 made sense there is no way to estimate the parameters We discuss this problem more generally in Section 163 The two equations are indistin guishable unless we assume that income education or age appears in one equation but not the other which would make no sense Though this makes a poor SEM example we might be interested in testing whether other factors being fixed there is a tradeoff between housing expenditures and saving But then we would just esti mate say 168 by OLS unless there is an omitted variable or measurement error problem Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 503 Example 162 has the characteristics of all too many SEM applications The problem is that the two endogenous variables are chosen by the same economic agent Therefore neither equation can stand on its own Another example of an inappropriate use of an SEM would be to model weekly hours spent studying and weekly hours working Each student will choose these variables simultaneously presumably as a function of the wage that can be earned working ability as a student enthusiasm for college and so on Just as in Example 162 it makes no sense to specify two equations where each is a function of the other The important lesson is this just because two variables are determined simul taneously does not mean that a simultaneous equa tions model is suitable For an SEM to make sense each equation in the SEM should have a ceteris pari bus interpretation in isolation from the other equation As we discussed earlier supply and demand examples and Example 161 have this feature Usually basic economic reasoning supported in some cases by sim ple economic models can help us use SEMs intelli gently including knowing when not to use an SEM 162 Simultaneity Bias in OLS It is useful to see in a simple model that an explanatory variable that is determined simultaneously with the dependent variable is generally correlated with the error term which leads to bias and incon sistency in OLS We consider the twoequation structural model y1 5 a1y2 1 b1z1 1 u1 1610 y2 5 a2 y1 1 b2z2 1 u2 1611 and focus on estimating the first equation The variables z1 and z2 are exogenous so that each is uncorrelated with u1 and u2 For simplicity we suppress the intercept in each equation To show that y2 is generally correlated with u1 we solve the two equations for y2 in terms of the exogenous variables and the error term If we plug the righthand side of 1610 in for y1 in 1611 we get y2 5 a21a1y2 1 b1z1 1 u12 1 b2z2 1 u2 or 11 2 a2a12y2 5 a2b1z1 1 b2z2 1 a2u1 1 u2 1612 Now we must make an assumption about the parameters in order to solve for y2 a2a1 2 1 1613 Whether this assumption is restrictive depends on the application In Example 161 we think that a1 0 and a2 0 which implies a1a2 0 therefore 1613 is very reasonable for Example 161 Provided condition 1613 holds we can divide 1612 by 11 2 a2a12 and write y2 as y2 5 p21z1 1 p22z2 1 v2 1614 where p21 5 a2b111 2 a2a12 p22 5 b211 2 a2a12 and v2 5 1a2u1 1 u2211 2 a2a12 Equation 1614 which expresses y2 in terms of the exogenous variables and the error terms is the reduced form equation for y2 a concept we introduced in Chapter 15 in the context of instrumental variables Pindyck and Rubinfeld 1992 Section 116 describe a model of advertising where monopolistic firms choose profit maximizing levels of price and advertising expenditures Does this mean we should use an SEM to model these variables at the firm level Exploring FurthEr 161 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 504 estimation The parameters p21 and p22 are called reduced form parameters notice how they are nonlinear functions of the structural parameters which appear in the structural equations 1610 and 1611 The reduced form error v2 is a linear function of the structural error terms u1 and u2 Because u1 and u2 are each uncorrelated with z1 and z2 v2 is also uncorrelated with z1 and z2 Therefore we can consistently estimate p21 and p22 by OLS something that is used for two stage least squares estima tion which we return to in the next section In addition the reduced form parameters are sometimes of direct interest although we are focusing here on estimating equation 1610 A reduced form also exists for y1 under assumption 1613 the algebra is similar to that used to obtain 1614 It has the same properties as the reduced form equation for y2 We can use equation 1614 to show that except under special assumptions OLS estimation of equation 1610 will produce biased and inconsistent estimators of a1 and b1 in equation 1610 Because z1 and u1 are uncorrelated by assumption the issue is whether y2 and u1 are uncorrelated From the reduced form in 1614 we see that y2 and u1 are correlated if and only if v2 and u1 are correlated because z1 and z2 are assumed exogenous But v2 is a linear function of u1 and u2 so it is generally correlated with u1 In fact if we assume that u1 and u2 are uncorrelated then v2 and u1 must be correlated whenever a2 2 0 Even if a2 equals zerowhich means that y1 does not appear in equa tion 1611 v2 and u1 will be correlated if u1 and u2 are correlated When a2 5 0 and u1 and u2 are uncorrelated y2 and u1 are also uncorrelated These are fairly strong requirements if a2 5 0 y2 is not simultaneously determined with y1 If we add zero correla tion between u1 and u2 this rules out omitted variables or measurement errors in u1 that are correlated with y2 We should not be surprised that OLS estimation of equation 1610 works in this case When y2 is correlated with u1 because of simultaneity we say that OLS suffers from simultaneity bias Obtaining the direction of the bias in the coefficients is generally complicated as we saw with omitted variables bias in Chapters 3 and 5 But in simple models we can determine the direction of the bias For example suppose that we simplify equation 1610 by dropping z1 from the equation and we assume that u1 and u2 are uncorrelated Then the covariance between y2 and u1 is Cov1y2u12 5 Cov1v2u12 5 3a211 2 a2a12 4E1u2 12 5 3a211 2 a2a12 4s2 1 where s2 1 5 Var1u12 0 Therefore the asymptotic bias or inconsistency in the OLS estimator of a1 has the same sign as a211 2 a2a12 If a2 0 and a2a1 1 the asymptotic bias is positive Unfortunately just as in our calculation of omitted variables bias from Section 33 the conclusions do not carry over to more general models But they do serve as a useful guide For example in Example 161 we think a2 0 and a2a1 0 which means that the OLS estimator of a1 would have a positive bias If a1 5 0 OLS would on average estimate a positive impact of more police on the murder rate generally the estimator of a1 is biased upward Because we expect an increase in the size of the police force to reduce murder rates ceteris paribus the upward bias means that OLS will underestimate the effectiveness of a larger police force 163 Identifying and Estimating a Structural Equation As we saw in the previous section OLS is biased and inconsistent when applied to a structural equa tion in a simultaneous equations system In Chapter 15 we learned that the method of two stage least squares can be used to solve the problem of endogenous explanatory variables We now show how 2SLS can be applied to SEMs The mechanics of 2SLS are similar to those in Chapter 15 The difference is that because we specify a structural equation for each endogenous variable we can immediately see whether sufficient IVs are available to estimate either equation We begin by discussing the identification problem Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 505 163a Identification in a TwoEquation System We mentioned the notion of identification in Chapter 15 When we estimate a model by OLS the key identification condition is that each explanatory variable is uncorrelated with the error term As we demonstrated in Section 162 this fundamental condition no longer holds in general for SEMs However if we have some instrumental variables we can still identify or consistently estimate the parameters in an SEM equation just as with omitted variables or measurement error Before we consider a general twoequation SEM it is useful to gain intuition by considering a simple supply and demand example Write the system in equilibrium form that is with qs 5 qd 5 q imposed as q 5 a1p 1 b1z1 1 u1 1615 and q 5 a2p 1 u2 1616 For concreteness let q be per capita milk consumption at the county level let p be the average price per gallon of milk in the county and let z1 be the price of cattle feed which we assume is exogenous to the supply and demand equations for milk This means that 1615 must be the supply function as the price of cattle feed would shift supply 1b1 02 but not demand The demand function contains no observed demand shifters Given a random sample on q p z1 which of these equations can be estimated That is which is an identified equation It turns out that the demand equation 1616 is identified but the supply equation is not This is easy to see by using our rules for IV estimation from Chapter 15 we can use z1 as an IV for price in equation 1616 However because z1 appears in equation 1615 we have no IV for price in the supply equation Intuitively the fact that the demand equation is identified follows because we have an observed variable z1 that shifts the supply equation while not affecting the demand equation Given variation in z1 and no errors we can trace out the demand curve as shown in Figure 161 The presence of the price quantity demand equation supply equations FiguRE 161 Shifting supply equations trace out the demand equation Each supply equation is drawn for a different value of the exogenous variable z1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 506 unobserved demand shifter u2 causes us to estimate the demand equation with error but the estimators will be consistent provided z1 is uncorrelated with u2 The supply equation cannot be traced out because there are no exogenous observed factors shift ing the demand curve It does not help that there are unobserved factors shifting the demand function we need something observed If as in the labor demand function 162 we have an observed exog enous demand shiftersuch as income in the milk demand functionthen the supply function would also be identified To summarize In the system of 1615 and 1616 it is the presence of an exogenous variable in the supply equation that allows us to estimate the demand equation Extending the identification discussion to a general twoequation model is not difficult Write the two equations as y1 5 b10 1 a1y2 1 z1b1 1 u1 1617 and y2 5 b20 1 a2y1 1 z2b2 1 u2 1618 where y1 and y2 are the endogenous variables and u1 and u2 are the structural error terms The intercept in the first equation is b10 and the intercept in the second equation is b20 The variable z1 denotes a set of k1 exogenous variables appearing in the first equation z1 5 1z11 z12 p z1k12 Similarly z2 is the set of k2 exogenous variables in the second equation z2 5 1z21 z22 p z2k22 In many cases z1 and z2 will overlap As a shorthand form we use the notation z1b1 5 b11z11 1 b12z12 1 p 1 b1k1z1k1 and z2b2 5 b21z21 1 b22z22 1 p 1 b2k2z2k2 that is z1b1 stands for all exogenous variables in the first equation with each multiplied by a coef ficient and similarly for z2b2 Some authors use the notation z91b1 and z92b2 instead If you have an interest in the matrix algebra approach to econometrics see Appendix E The fact that z1 and z2 generally contain different exogenous variables means that we have imposed exclusion restrictions on the model In other words we assume that certain exogenous vari ables do not appear in the first equation and others are absent from the second equation As we saw with the previous supply and demand examples this allows us to distinguish between the two struc tural equations When can we solve equations 1617 and 1618 for y1 and y2 as linear functions of all exog enous variables and the structural errors u1 and u2 The condition is the same as that in 1613 namely a2a1 2 1 The proof is virtually identical to the simple model in Section 162 Under this assumption reduced forms exist for y1 and y2 The key question is Under what assumptions can we estimate the parameters in say 1617 This is the identification issue The rank condition for identification of equation 1617 is easy to state Rank Condition for Identification of a Structural Equation The first equation in a twoequation simultaneous equations model is identified if and only if the second equation contains at least one exogenous variable with a nonzero coefficient that is excluded from the first equation This is the necessary and sufficient condition for equation 1617 to be identified The order condition which we discussed in Chapter 15 is necessary for the rank condition The order condi tion for identifying the first equation states that at least one exogenous variable is excluded from this equation The order condition is trivial to check once both equations have been specified The rank condition requires more at least one of the exogenous variables excluded from the first equation must have a nonzero population coefficient in the second equation This ensures that at least one of the exogenous variables omitted from the first equation actually appears in the reduced form of y2 so Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 507 that we can use these variables as instruments for y2 We can test this using a t or an F test as in Chapter 15 some examples follow Identification of the second equation is naturally just the mirror image of the statement for the first equation Also if we write the equations as in the labor supply and demand example in Section 161so that y1 appears on the lefthand side in both equations with y2 on the righthand sidethe identification condition is identical ExamplE 163 labor Supply of married Working Women To illustrate the identification issue consider labor supply for married women already in the work force In place of the demand function we write the wage offer as a function of hours and the usual productivity variables With the equilibrium condition imposed the two structural equations are hours 5 a1log1wage2 1 b10 1 b11educ 1 b12age 1 b13kidslt6 1 b14nwifeinc 1 u1 1619 and log1wage2 5 a2hours 1 b20 1 b21educ 1 b22exper 1 b23exper2 1 u2 1620 The variable age is the womans age in years kidslt6 is the number of children less than six years old nwifeinc is the womans nonwage income which includes husbands earnings and educ and exper are years of education and prior experience respectively All variables except hours and logwage are assumed to be exogenous This is a tenuous assumption as educ might be correlated with omit ted ability in either equation But for illustration purposes we ignore the omitted ability problem The functional form in this systemwhere hours appears in level form but wage is in logarithmic formis popular in labor economics We can write this system as in equations 1617 and 1618 by defining y1 5 hours and y2 5 log1wage2 The first equation is the supply function It satisfies the order condition because two exogenous variables exper and exper2 are omitted from the labor supply equation These exclusion restrictions are crucial assumptions we are assuming that once wage education age number of small children and other income are controlled for past experience has no effect on current labor supply One could certainly question this assumption but we use it for illustration Given equations 1619 and 1620 the rank condition for identifying the first equation is that at least one of exper and exper2 has a nonzero coefficient in equation 1620 If b22 5 0 and b23 5 0 there are no exogenous variables appearing in the second equation that do not also appear in the first educ appears in both We can state the rank condition for identification of 1619 equivalently in terms of the reduced form for logwage which is log1wage2 5 p20 1 p21educ 1 p22age 1 p23kidslt6 1621 1 p24nwifeinc 1 p25exper 1 p26exper2 1 v2 For identification we need p25 2 0 or p26 2 0 something we can test using a standard F statistic as we discussed in Chapter 15 The wage offer equation 1620 is identified if at least one of age kidslt6 or nwifeinc has a nonzero coefficient in 1619 This is identical to assuming that the reduced form for hourswhich has the same form as the righthand side of 1621depends on at least one of age kidslt6 or nwi feinc In specifying the wage offer equation we are assuming that age kidslt6 and nwifeinc have no effect on the offered wage once hours education and experience are accounted for These would be poor assumptions if these variables somehow have direct effects on productivity or if women are dis criminated against based on their age or number of small children Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 508 In Example 163 we take the population of interest to be married women who are in the workforce so that equilibrium hours are positive This excludes the group of married women who choose not to work outside the home Including such women in the model raises some difficult problems For instance if a woman does not work we cannot observe her wage offer We touch on these issues in Chapter 17 but for now we must think of equations 1619 and 1620 as holding only for women who have hours 0 ExamplE 164 Inflation and Openness Romer 1993 proposes theoretical models of inflation that imply that more open countries should have lower inflation rates His empirical analysis explains average annual inflation rates since 1973 in terms of the average share of imports in gross domestic or national product since 1973which is his measure of openness In addition to estimating the key equation by OLS he uses instrumental variables While Romer does not specify both equations in a simultaneous system he has in mind a twoequation system inf 5 b10 1 a1open 1 b11log1pcinc2 1 u1 1622 open 5 b20 1 a2inf 1 b21log1pcinc2 1 b22log1land2 1 u2 1623 where pcinc is 1980 per capita income in US dollars assumed to be exogenous and land is the land area of the country in square miles also assumed to be exogenous Equation 1622 is the one of interest with the hypothesis that a1 0 More open economies have lower inflation rates The second equation reflects the fact that the degree of openness might depend on the average infla tion rate as well as other factors The variable logpcinc appears in both equations but logland is assumed to appear only in the second equation The idea is that ceteris paribus a smaller country is likely to be more open so b22 0 Using the identification rule that was stated ear lier equation 1622 is identified provided b22 2 0 Equation 1623 is not identified because it con tains both exogenous variables But we are interested in 1622 163b Estimation by 2SLS Once we have determined that an equation is identified we can estimate it by two stage least squares The instrumental variables consist of the exogenous variables appearing in either equation ExamplE 165 labor Supply of married Working Women We use the data on working married women in MROZ to estimate the labor supply equation 1619 by 2SLS The full set of instruments includes educ age kidslt6 nwifeinc exper and exper2 The estimated labor supply curve is hours 5 222566 1 163956 log1wage2 2 18375 educ 1574562 1470582 159102 2 781 age 2 19815 kidslt6 2 1017 nwifeinc 1624 19382 1182932 16612 n 5 428 If we have money supply growth since 1973 for each country which we assume is exogenous does this help identify equation 1623 Exploring FurthEr 162 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 509 where the reported standard errors are computed using a degreesoffreedom adjustment This equa tion shows that the labor supply curve slopes upward The estimated coefficient on logwage has the following interpretation holding other factors fixed Dhours 1641Dwage2 We can calculate labor supply elasticities by multiplying both sides of this last equation by 100hours 100 1Dhourshours2 11640hours2 1Dwage2 or Dhours 11640hours2 1Dwage2 which implies that the labor supply elasticity with respect to wage is simply 1640hours The elasticity is not constant in this model because hours not loghours is the dependent variable in 1624 At the average hours worked 1303 the estimated elasticity is 1640 1303 126 which implies a greater than 1 increase in hours worked given a 1 increase in wage This is a large esti mated elasticity At higher hours the elasticity will be smaller at lower hours such as hours 5 800 the elasticity is over two For comparison when 1619 is estimated by OLS the coefficient on logwage is 2205 1se 5 54882 which implies no wage effect on hours worked To confirm that logwage is in fact endogenous in 1619 we can carry out the test from Section 155 When we add the reduced form residuals v2 to the equation and estimate by OLS the t statistic on v2 is 2661 which is very significant and so logwage appears to be endogenous The wage offer equation 1620 can also be estimated by 2SLS The result is log1wage2 5 2656 1 00013 hours 1 110 educ 13382 1000252 10162 1 035 exper 2 00071 exper2 1625 10192 1000452 n 5 428 This differs from previous wage equations in that hours is included as an explanatory variable and 2SLS is used to account for endogeneity of hours and we assume that educ and exper are exog enous The coefficient on hours is statistically insignificant which means that there is no evidence that the wage offer increases with hours worked The other coefficients are similar to what we get by dropping hours and estimating the equation by OLS Estimating the effect of openness on inflation by instrumental variables is also straightforward ExamplE 166 Inflation and Openness Before we estimate 1622 using the data in OPENNESS we check to see whether open has suffi cient partial correlation with the proposed IV logland The reduced form regression is open 5 11708 1 546 log1pcinc2 2 757 log1land2 115852 114932 1812 n 5 114 R2 5 449 The t statistic on logland is over nine in absolute value which verifies Romers assertion that smaller countries are more open The fact that logpcinc is so insignificant in this regression is irrelevant Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 510 Estimating 1622 using logland as an IV for open gives inf 5 2690 2 337 open 1 376 log1pcinc2 115402 11442 120152 1626 n 5 114 The coefficient on open is statistically significant at about the 1 level against a onesided alternative 1a1 02 The effect is economically important as well for every percentage point increase in the import share of GDP annual inflation is about onethird of a percentage point lower For comparison the OLS estimate is 2215 1se 5 0952 164 Systems with More Than Two Equations Simultaneous equations models can consist of more than two equations Studying general identifica tion of these models is difficult and requires matrix algebra Once an equation in a general system has been shown to be identified it can be estimated by 2SLS 164a Identification in Systems with Three or More Equations We will use a threeequation system to illustrate the issues that arise in the identification of compli cated SEMs With intercepts suppressed write the model as y1 5 a12y2 1 a13y3 1 b11z1 1 u1 1627 y2 5 a21y1 1 b21z1 1 b22z2 1 b23z3 1 u2 1628 y3 5 a32y2 1 b31z1 1 b32z2 1 b33z3 1 b34z4 1 u3 1629 where the yg are the endogenous variables and the zj are exogenous The first subscript on the parame ters indicates the equation number and the second indicates the variable number we use a for param eters on endogenous variables and b for parameters on exogenous variables Which of these equations can be estimated It is generally difficult to show that an equation in an SEM with more than two equations is identified but it is easy to see when certain equations are not identified In system 1627 through 1629 we can easily see that 1629 falls into this cat egory Because every exogenous variable appears in this equation we have no IVs for y2 Therefore we cannot consistently estimate the parameters of this equation For the reasons we discussed in Section 162 OLS estimation will not usually be consistent What about equation 1627 Things look promising because z2 z3 and z4 are all excluded from the equationthis is another example of exclusion restrictions Although there are two endogenous variables in this equation we have three potential IVs for y2 and y3 Therefore equation 1627 passes the order condition For completeness we state the order condition for general SEMs Order Condition for Identification An equation in any SEM satisfies the order condi tion for identification if the number of excluded exogenous variables from the equation is at least as large as the number of righthand side endogenous variables The second equation 1628 also passes the order condition because there is one excluded exogenous variable z4 and one righthand side endogenous variable y1 How would you test whether the difference between the OLS and IV estimates on open are statistically different Exploring FurthEr 163 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 511 As we discussed in Chapter 15 and in the previous section the order condition is only necessary not sufficient for identification For example if b34 5 0 z4 appears nowhere in the system which means it is not correlated with y1 y2 or y3 If b34 5 0 then the second equation is not identified because z4 is useless as an IV for y1 This again illustrates that identification of an equation depends on the values of the parameters which we can never know for sure in the other equations There are many subtle ways that identification can fail in complicated SEMs To obtain sufficient conditions we need to extend the rank condition for identification in twoequation systems This is possible but it requires matrix algebra see for example Wooldridge 2010 Chapter 9 In many applications one assumes that unless there is obviously failure of identification an equation that satisfies the order condition is identified The nomenclature on overidentified and just identified equations from Chapter 15 originated with SEMs In terms of the order condition 1627 is an overidentified equation because we need only two IVs for y2 and y3 but we have three available z2 z3 and z4 there is one overidentify ing restriction in this equation In general the number of overidentifying restrictions equals the total number of exogenous variables in the system minus the total number of explanatory variables in the equation These can be tested using the overidentification test from Section 155 Equation 1628 is a just identified equation and the third equation is an unidentified equation 164b Estimation Regardless of the number of equations in an SEM each identified equation can be estimated by 2SLS The instruments for a particular equation consist of the exogenous variables appearing anywhere in the system Tests for endogeneity heteroskedasticity serial correlation and overidentifying restric tions can be obtained just as in Chapter 15 It turns out that when any system with two or more equations is correctly specified and certain additional assumptions hold system estimation methods are generally more efficient than estimating each equation by 2SLS The most common system estimation method in the context of SEMs is three stage least squares These methods with or without endogenous explanatory variables are beyond the scope of this text See for example Wooldridge 2010 Chapters 7 and 8 165 Simultaneous Equations Models with Time Series Among the earliest applications of SEMs was estimation of large systems of simultaneous equations that were used to describe a countrys economy A simple Keynesian model of aggregate demand that ignores exports and imports is Ct 5 b0 1 b11Yt 2 Tt2 1 b2rt 1 ut1 1630 It 5 g0 1 g1rt 1 ut2 1631 Yt Ct 1 It 1 Gt 1632 where Ct 5 consumption Yt 5 income Tt 5 tax receipts rt 5 the interest rate It 5 investment and Gt 5 government spending See for example Mankiw 1994 Chapter 9 For concreteness assume t represents year Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 512 The first equation is an aggregate consumption function where consumption depends on dispos able income the interest rate and the unobserved structural error ut1 The second equation is a very simple investment function Equation 1632 is an identity that is a result of national income account ing it holds by definition without error Thus there is no sense in which we estimate 1632 but we need this equation to round out the model Because there are three equations in the system there must also be three endogenous variables Given the first two equations it is clear that we intend for Ct and It to be endogenous In addition because of the accounting identity Yt is endogenous We would assume at least in this model that Tt rt and Gt are exogenous so that they are uncorrelated with ut1 and ut2 We will discuss problems with this kind of assumption later If rt is exogenous then OLS estimation of equation 1631 is natural The consumption function however depends on disposable income which is endogenous because Yt is We have two instruments available under the maintained exogeneity assumptions Tt and Gt Therefore if we follow our prescription for estimating crosssectional equations we would estimate 1630 by 2SLS using instruments 1Tt Gt rt2 Models such as 1630 through 1632 are seldom estimated now for several good reasons First it is very difficult to justify at an aggregate level the assumption that taxes interest rates and government spending are exogenous Taxes clearly depend directly on income for example with a single marginal income tax rate tt in year t Tt 5 ttYt We can easily allow this by replacing 1Yt 2 Tt2 with 11 2 tt2Yt in 1630 and we can still estimate the equation by 2SLS if we assume that govern ment spending is exogenous We could also add the tax rate to the instrument list if it is exogenous But are government spending and tax rates really exogenous They certainly could be in principle if the government sets spending and tax rates independently of what is happening in the economy But it is a difficult case to make in reality government spending generally depends on the level of income and at high levels of income the same tax receipts are collected for lower marginal tax rates In addition assuming that interest rates are exogenous is extremely questionable We could specify a more realistic model that includes money demand and supply and then interest rates could be jointly determined with Ct It and Yt But then finding enough exogenous variables to identify the equations becomes quite difficult and the following problems with these models still pertain Some have argued that certain components of government spending such as defense spending see for example Hall 1988 and Ramey 1991are exogenous in a variety of simultaneous equa tions applications But this is not universally agreed upon and in any case defense spending is not always appropriately correlated with the endogenous explanatory variables see Shea 1993 for dis cussion and Computer Exercises C6 for an example A second problem with a model such as 1630 through 1632 is that it is completely static Especially with monthly or quarterly data but even with annual data we often expect adjustment lags One argument in favor of static Keynesiantype models is that they are intended to describe the long run without worrying about shortrun dynamics Allowing dynamics is not very difficult For example we could add lagged income to equation 1631 It 5 g0 1 g1rt 1 g2Yt21 1 ut2 1633 In other words we add a lagged endogenous variable but not It21 to the investment equation Can we treat Yt21 as exogenous in this equation Under certain assumptions on ut2 the answer is yes But we typically call a lagged endogenous variable in an SEM a predetermined variable Lags of exog enous variables are also predetermined If we assume that ut2 is uncorrelated with current exogenous variables which is standard and all past endogenous and exogenous variables then Yt21 is uncor related with ut2 Given exogeneity of rt we can estimate 1633 by OLS If we add lagged consumption to 1630 we can treat Ct21 as exogenous in this equation under the same assumptions on ut1 that we made for ut2 in the previous paragraph Current disposable income is still endogenous in Ct 5 b0 1 b11Yt 2 Tt2 1 b2rt 1 b3Ct21 1 ut1 1634 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 513 so we could estimate this equation by 2SLS using instruments 1Tt Gt rt Ct2l2 if investment is determined by 1633 Yt21 should be added to the instrument list To see why use 1632 1633 and 1634 to find the reduced form for Yt in terms of the exogenous and predetermined variables Tt rt Gt Ct21 and Yt21 Because Yt21 shows up in this reduced form it should be used as an IV The presence of dynamics in aggregate SEMs is at least for the purposes of forecasting a clear improvement over static SEMs But there are still some important problems with estimating SEMs using aggregate time series data some of which we discussed in Chapters 11 and 15 Recall that the validity of the usual OLS or 2SLS inference procedures in time series applications hinges on the notion of weak dependence Unfortunately series such as aggregate consumption income investment and even interest rates seem to violate the weak dependence requirements In the terminology of Chapter 11 they have unit roots These series also tend to have exponential trends although this can be partly overcome by using the logarithmic transformation and assuming different functional forms Generally even the large sample let alone the small sample properties of OLS and 2SLS are complicated and dependent on vari ous assumptions when they are applied to equations with I1 variables We will briefly touch on these issues in Chapter 18 An advanced general treatment is given by Hamilton 1994 Does the previous discussion mean that SEMs are not usefully applied to time series data Not at all The problems with trends and high persistence can be avoided by specifying systems in first differences or growth rates But one should recognize that this is a different SEM than one specified in levels For example if we specify consumption growth as a function of disposable income growth and interest rate changes this is different from 1630 Also as we discussed earlier incorporat ing dynamics is not especially difficult Finally the problem of finding truly exogenous variables to include in SEMs is often easier with disaggregated data For example for manufacturing industries Shea 1993 describes how output or more precisely growth in output in other industries can be used as an instrument in estimating supply functions Ramey 1991 also has a convincing analysis of estimating industry cost functions by instrumental variables using time series data The next example shows how aggregate data can be used to test an important economic theory the permanent income theory of consumption usually called the permanent income hypothesis PIH The approach used in this example is not strictly speaking based on a simultaneous equations model but we can think of consumption and income growth as well as interest rates as being jointly determined ExamplE 167 Testing the permanent Income Hypothesis Campbell and Mankiw 1990 used instrumental variables methods to test various versions of the PIH We will use the annual data from 1959 through 1995 in CONSUMP to mimic one of their analy ses Campbell and Mankiw used quarterly data running through 1985 One equation estimated by Campbell and Mankiw using our notation is gct 5 b0 1 b1gyt 1 b2r3t 1 ut 1635 where gct 5 Dlog1ct2 5 annual growth in real per capita consumption 1excluding durables2 gyt 5 growth in real disposable income and r3t 5 the 1ex post2 real interest rate as measured by the return on threemonth Tbill rates r3t 5 i3t 2 inft where the inflation rate is based on the Consumer Price Index The growth rates of consumption and disposable income are not trending and they are weakly dependent we will assume this is the case for r3t as well so that we can apply standard asymptotic theory The key feature of equation 1635 is that the PIH implies that the error term ut has a zero mean conditional on all information observed at time t 2 1 or earlier E1ut0It212 5 0 However ut is not Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 514 necessarily uncorrelated with gyt or r3t a traditional way to think about this is that these variables are jointly determined but we are not writing down a full threeequation system Because ut is uncorrelated with all variables dated t 2 1 or earlier valid instruments for estimat ing 1635 are lagged values of gc gy and r3 and lags of other observable variables but we will not use those here What are the hypotheses of interest The pure form of the PIH has b1 5 b2 5 0 Campbell and Mankiw argue that b1 is positive if some fraction of the population consumes current income rather than permanent income The PIH with a nonconstant real interest rate implies that b2 0 When we estimate 1635 by 2SLS using instruments gc21 gy21 and r321 for the endogenous variables gyt and r3t we obtain gct 5 0081 1 586 gyt 2 00027r3t 100322 11352 1000762 1636 n 5 35 R2 5 678 Therefore the pure form of the PIH is strongly rejected because the coefficient on gy is economically large a 1 increase in disposable income increases consumption by over 5 and statistically significant 1t 5 4342 By contrast the real interest rate coefficient is very small and statistically insignificant These findings are qualitatively the same as Campbell and Mankiws The PIH also implies that the errors 5ut6 are serially uncorrelated After 2SLS estimation we obtain the residuals u t and include u t21 as an additional explanatory variable in 1636 we still use instru ments gct21 gyt21 r3t21 and u t21 acts as its own instrument see Section 157 The coefficient on u t21 is r 5 187 1se 5 1332 so there is some evidence of positive serial correlation although not at the 5 significance level Campbell and Mankiw discuss why with the available quarterly data positive serial cor relation might be found in the errors even if the PIH holds some of those concerns carry over to annual data Using growth rates of trending or I1 variables in SEMs is fairly common in time series applications For example Shea 1993 estimates industry supply curves specified in terms of growth rates If a structural model contains a time trend which may capture exogenous trending factors that are not directly modeledthen the trend acts as its own IV 166 Simultaneous Equations Models with Panel Data Simultaneous equations models also arise in panel data contexts For example we can imagine esti mating labor supply and wage offer equations as in Example 163 for a group of people working over a given period of time In addition to allowing for simultaneous determination of variables within each time period we can allow for unobserved effects in each equation In a labor supply function it would be useful to allow an unobserved taste for leisure that does not change over time The basic approach to estimating SEMs with panel data involves two steps 1 eliminate the unobserved effects from the equations of interest using the fixed effects transformation or first Suppose that for a particular city you have monthly data on per capita consumption of fish per capita income the price of fish and the prices of chicken and beef income and chicken and beef prices are exogenous Assume that there is no seasonality in the demand function for fish but there is in the supply of fish How can you use this infor mation to estimate a constant elasticity demandforfish equation Specify an equa tion and discuss identification Hint You should have 11 instrumental variables for the price of fish Exploring FurthEr 164 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 515 differencing and 2 find instrumental variables for the endogenous variables in the transformed equa tion This can be very challenging because for a convincing analysis we need to find instruments that change over time To see why write an SEM for panel data as yit1 5 a1yit2 1 zit1b1 1 ai1 1 uit1 1637 yit2 5 a2yit1 1 zit2b2 1 ai2 1 uit2 1638 where i denotes cross section t denotes time period and zit1b1 or zit2b2 denotes linear functions of a set of exogenous explanatory variables in each equation The most general analysis allows the unob served effects ai1 and ai2 to be correlated with all explanatory variables even the elements in z However we assume that the idiosyncratic structural errors uit1 and uit2 are uncorrelated with the z in both equations and across all time periods this is the sense in which the z are exogenous Except under special circumstances yit2 is correlated with uit1 and yit1 is correlated with uit2 Suppose we are interested in equation 1637 We cannot estimate it by OLS as the composite error ai1 1 uit1 is potentially correlated with all explanatory variables Suppose we difference over time to remove the unobserved effect ai1 Dyit1 5 a1Dyit2 1 Dzit1b1 1 Duit1 1639 As usual with differencing or timedemeaning we can only estimate the effects of variables that change over time for at least some crosssectional units Now the error term in this equation is uncorrelated with Dzit1 by assumption But Dyit2 and Duit1 are possibly correlated Therefore we need an IV for Dyit2 As with the case of pure crosssectional or pure time series data possible IVs come from the other equation elements in zit2 that are not also in zit1 In practice we need timevarying elements in zit2 that are not also in zit1 This is because we need an instrument for Dyit2 and a change in a variable from one period to the next is unlikely to be highly correlated with the level of exogenous variables In fact if we difference 1638 we see that the natural IVs for Dyit2 are those elements in Dzit2 that are not also in Dzit1 As an example of the problems that can arise consider a panel data version of the labor supply function in Example 163 After differencing suppose we have the equation Dhoursit 5 b0 1 a1Dlog1wageit2 1 D1other factorsit2 and we wish to use Dexperit as an instrument for Dlog1wageit2 The problem is that because we are looking at people who work in every time period Dexperit 5 1 for all i and t Each person gets another year of experience after a year passes We cannot use an IV that is the same value for all i and t and so we must look elsewhere Often participation in an experimental program can be used to obtain IVs in panel data contexts In Example 1510 we used receipt of job training grants as an IV for the change in hours of training in determining the effects of job training on worker productivity In fact we could view that in an SEM context job training and worker productivity are jointly determined but receiving a job training grant is exogenous in equation 1557 We can sometimes come up with clever convincing instrumental variables in panel data applica tions as the following example illustrates ExamplE 168 Effect of prison population on Violent Crime Rates In order to estimate the causal effect of prison population increases on crime rates at the state level Levitt 1996 used instances of prison overcrowding litigation as instruments for the growth in prison population The equation Levitt estimated is in first differences we can write an underlying fixed effects model as log1crimeit2 5 ut 1 allog1prisonit2 1 zit1b1 1 ai1 1 uit1 1640 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 516 PART 1 Regression Analysis with CrossSectional Data where ut denotes different time intercepts and crime and prison are measured per 100000 people The prison population variable is measured on the last day of the previous year The vector zit1 con tains log of police per capita log of income per capita the unemployment rate proportions of black and those living in metropolitan areas and age distribution proportions Differencing 1640 gives the equation estimated by Levitt Dlog1crimeit2 5 jt 1 alDlog1prisonit2 1 Dzit1b1 1 Duit1 1641 Simultaneity between crime rates and prison population or more precisely in the growth rates makes OLS estimation of 1641 generally inconsistent Using the violent crime rate and a subset of the data from Levitt in PRISON for the years 1980 through 1993 for 51 14 5 714 total observations we obtain the pooled OLS estimate of a1 which is 2181 1se 5 0482 We also estimate 1641 by pooled 2SLS where the instruments for D logprison are two binary variables one each for whether a final decision was reached on overcrowding litigation in the current year or in the previous two years The pooled 2SLS estimate of a1 is 21032 1se 5 3702 Therefore the 2SLS estimated effect is much larger not surprisingly it is much less precise too Levitt found similar results when using a longer time period but with early observations missing for some states and more instruments Testing for AR1 serial correlation in rit1 5 Duit1 is easy After the pooled 2SLS estimation obtain the residuals rit1 Then include one lag of these residuals in the original equation and esti mate the equation by 2SLS where rit1 acts as its own instrument The first year is lost because of the lagging Then the usual 2SLS t statistic on the lagged residual is a valid test for serial correlation In Example 168 the coefficient on rit1 is only about 076 with t 5 167 With such a small coefficient and modest t statistic we can safely assume serial independence An alternative approach to estimating SEMs with panel data is to use the fixed effects transfor mation and then to apply an IV technique such as pooled 2SLS A simple procedure is to estimate the timedemeaned equation by pooled 2SLS which would look like y it1 5 a1y t2 1 z it1b1 1 u it1 t 5 1 2 p T 1642 where z it1 and z it2 are IVs This is equivalent to using 2SLS in the dummy variable formulation where the unitspecific dummy variables act as their own instruments Ayres and Levitt 1998 applied 2SLS to a timedemeaned equation to estimate the effect of LoJack electronic theft prevention devices on car theft rates in cities If 1642 is estimated directly then the df needs to be corrected to N1T 2 12 2 k1 where k1 is the total number of elements in a1 and b1 Including unitspecific dummy variables and applying pooled 2SLS to the original data produces the correct df A detailed treatment of 2SLS with panel data is given in Wooldridge 2010 Chapter 11 Summary Simultaneous equations models are appropriate when each equation in the system has a ceteris paribus in terpretation Good examples are when separate equations describe different sides of a market or the behav ioral relationships of different economic agents Supply and demand examples are leading cases but there are many other applications of SEMs in economics and the social sciences An important feature of SEMs is that by fully specifying the system it is clear which variables are as sumed to be exogenous and which ones appear in each equation Given a full system we are able to deter mine which equations can be identified that is can be estimated In the important case of a twoequation system identification of say the first equation is easy to state at least one exogenous variable must be excluded from the first equation that appears with a nonzero coefficient in the second equation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 517 As we know from previous chapters OLS estimation of an equation that contains an endogenous explanatory variable generally produces biased and inconsistent estimators Instead 2SLS can be used to estimate any identified equation in a system More advanced system methods are available but they are beyond the scope of our treatment The distinction between omitted variables and simultaneity in applications is not always sharp Both prob lems not to mention measurement error can appear in the same equation A good example is the labor supply of married women Years of education educ appears in both the labor supply and the wage offer functions see equations 1619 and 1620 If omitted ability is in the error term of the labor supply function then wage and education are both endogenous The important thing is that an equation estimated by 2SLS can stand on its own SEMs can be applied to time series data as well As with OLS estimation we must be aware of trend ing integrated processes in applying 2SLS Problems such as serial correlation can be handled as in Sec tion 157 We also gave an example of how to estimate an SEM using panel data where the equation is first differenced to remove the unobserved effect Then we can estimate the differenced equation by pooled 2SLS just as in Chapter 15 Alternatively in some cases we can use timedemeaning of all variables including the IVs and then apply pooled 2SLS this is identical to putting in dummies for each cross sectional observation and using 2SLS where the dummies act as their own instruments SEM applications with panel data are very powerful as they allow us to control for unobserved heterogeneity while dealing with simultaneity They are becoming more and more common and are not especially difficult to estimate Key Terms Endogenous Variables Exclusion Restrictions Exogenous Variables Identified Equation Just Identified Equation Lagged Endogenous Variable Order Condition Overidentified Equation Predetermined Variable Rank Condition Reduced Form Equation Reduced Form Error Reduced Form Parameters Simultaneity Simultaneity Bias Simultaneous Equations Model SEM Structural Equation Structural Errors Structural Parameters Unidentified Equation Problems 1 Write a twoequation system in supply and demand form that is with the same variable yt typically quantity appearing on the lefthand side y1 5 a1y2 1 b1z1 1 u1 y1 5 a2y2 1 b2z2 1 u2 i If a1 5 0 or a2 5 0 explain why a reduced form exists for y1 Remember a reduced form expresses y1 as a linear function of the exogenous variables and the structural errors If a1 2 0 and a2 5 0 find the reduced form for y2 ii If a1 2 0 a2 2 0 and a1 2 a2 find the reduced form for y1 Does y2 have a reduced form in this case iii Is the condition a1 2 a2 likely to be met in supply and demand examples Explain 2 Let corn denote per capita consumption of corn in bushels at the county level let price be the price per bushel of corn let income denote per capita county income and let rainfall be inches of rainfall during the last corngrowing season The following simultaneous equations model imposes the equilibrium condition that supply equals demand corn 5 a1price 1 b1income 1 u1 corn 5 a2 price 1 b2rainfall 1 g2rainfall2 1 u2 Which is the supply equation and which is the demand equation Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 518 PART 1 Regression Analysis with CrossSectional Data 3 In Problem 3 of Chapter 3 we estimated an equation to test for a tradeoff between minutes per week spent sleeping sleep and minutes per week spent working totwrk for a random sample of individu als We also included education and age in the equation Because sleep and totwrk are jointly chosen by each individual is the estimated tradeoff between sleeping and working subject to a simultaneity bias criticism Explain 4 Suppose that annual earnings and alcohol consumption are determined by the SEM log1earnings2 5 b0 1 b1alcohol 1 b2educ 1 u1 alcohol 5 g0 1 g1log1earnings2 1 g2educ 1 g3log1price2 1 u2 where price is a local price index for alcohol which includes state and local taxes Assume that educ and price are exogenous If b1 b2 g1 g2 and g3 are all different from zero which equation is identi fied How would you estimate that equation 5 A simple model to determine the effectiveness of condom usage on reducing sexually transmitted dis eases among sexually active high school students is infrate 5 b0 1 b1conuse 1 b2percmale 1 b3avginc 1 b4city 1 u1 where infrate 5 the percentage of sexually active students who have contracted venereal disease conuse 5 the percentage of boys who claim to use condoms regularly avginc 5 average family income city 5 a dummy variable indicating whether a school is in a city The model is at the school level i Interpreting the preceding equation in a causal ceteris paribus fashion what should be the sign of b1 ii Why might infrate and conuse be jointly determined iii If condom usage increases with the rate of venereal disease so that g1 0 in the equation conuse 5 g0 1 g1infrate 1 other factors what is the likely bias in estimating b1 by OLS iv Let condis be a binary variable equal to unity if a school has a program to distribute condoms Explain how this can be used to estimate b1 and the other betas by IV What do we have to assume about condis in each equation 6 Consider a linear probability model for whether employers offer a pension plan based on the percentage of workers belonging to a union as well as other factors pension 5 b0 1 b1percunion 1 b2avgage 1 b3avgeduc 1 b4 percmale 1 b5 percmarr 1 u1 i Why might percunion be jointly determined with pension ii Suppose that you can survey workers at firms and collect information on workers families Can you think of information that can be used to construct an IV for percunion iii How would you test whether your variable is at least a reasonable IV candidate for percunion 7 For a large university you are asked to estimate the demand for tickets to womens basketball games You can collect time series data over 10 seasons for a total of about 150 observations One possible model is lATTENDt 5 b0 1 b1lPRICEt 1 b2WINPERCt 1 b3RIVALt 1 b4WEEKENDt 1 b5t 1 ut Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 519 where PRICEt 5 the price of admission probably measured in real termssay deflating by a regional consumer price index WINPERCt 5 the teams current winning percentage RIVALt 5 a dummy variable indicating a game against a rival WEEKENDt 5 a dummy variable indicating whether the game is on a weekend The l denotes natural logarithm so that the demand function has a constant price elasticity i Why is it a good idea to have a time trend in the equation ii The supply of tickets is fixed by the stadium capacity assume this has not changed over the 10 years This means that quantity supplied does not vary with price Does this mean that price is necessarily exogenous in the demand equation Hint The answer is no iii Suppose that the nominal price of admission changes slowlysay at the beginning of each season The athletic office chooses price based partly on last seasons average attendance as well as last seasons team success Under what assumptions is last seasons winning percentage 1SEASPERCt212 a valid instrumental variable for lPRICEt iv Does it seem reasonable to include the log of the real price of mens basketball games in the equation Explain What sign does economic theory predict for its coefficient Can you think of another variable related to mens basketball that might belong in the womens attendance equation v If you are worried that some of the series particularly lATTEND and lPRICE have unit roots how might you change the estimated equation vi If some games are sold out what problems does this cause for estimating the demand function Hint If a game is sold out do you necessarily observe the true demand 8 How big is the effect of perstudent school expenditures on local housing values Let HPRICE be the median housing price in a school district and let EXPEND be perstudent expenditures Using panel data for the years 1992 1994 and 1996 we postulate the model lHPRICEit 5 ut 1 b1lEXPENDit 1 b2lPOLICEit 1 b3lMEDINCit 1 b4PROPTAXit 1 ai1 1 uit1 where POLICEit is per capita police expenditures MEDINCit is median income and PROPTAXit is the property tax rate l denotes natural logarithm Expenditures and housing price are simultaneously determined because the value of homes directly affects the revenues available for funding schools Suppose that in 1994 the way schools were funded was drastically changed rather than being raised by local property taxes school funding was largely determined at the state level Let lSTATEALLit denote the log of the state allocation for district i in year t which is exogenous in the preceding equation once we control for expenditures and a district fixed effect How would you estimate the bj Computer Exercises C1 Use SMOKE for this exercise i A model to estimate the effects of smoking on annual income perhaps through lost work days due to illness or productivity effects is log1income2 5 b0 1 b1cigs 1 b2educ 1 b3age 1 b4age2 1 u1 where cigs is number of cigarettes smoked per day on average How do you interpret b1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 520 PART 1 Regression Analysis with CrossSectional Data ii To reflect the fact that cigarette consumption might be jointly determined with income a demand for cigarettes equation is cigs 5 g0 1 g1log1income2 1 g2educ 1 g3age 1 g4age2 1 g5log1cigpric2 1 g6restaurn 1 u2 where cigpric is the price of a pack of cigarettes in cents and restaurn is a binary variable equal to unity if the person lives in a state with restaurant smoking restrictions Assuming these are exogenous to the individual what signs would you expect for g5 and g6 iii Under what assumption is the income equation from part i identified iv Estimate the income equation by OLS and discuss the estimate of b1 v Estimate the reduced form for cigs Recall that this entails regressing cigs on all exogenous variables Are logcigpric and restaurn significant in the reduced form vi Now estimate the income equation by 2SLS Discuss how the estimate of b1 compares with the OLS estimate vii Do you think that cigarette prices and restaurant smoking restrictions are exogenous in the income equation C2 Use MROZ for this exercise i Reestimate the labor supply function in Example 165 using loghours as the dependent variable Compare the estimated elasticity which is now constant to the estimate obtained from equation 1624 at the average hours worked ii In the labor supply equation from part i allow educ to be endogenous because of omitted ability Use motheduc and fatheduc as IVs for educ Remember you now have two endogenous variables in the equation iii Test the overidentifying restrictions in the 2SLS estimation from part ii Do the IVs pass the test C3 Use the data in OPENNESS for this exercise i Because logpcinc is insignificant in both 1622 and the reduced form for open drop it from the analysis Estimate 1622 by OLS and IV without logpcinc Do any important conclusions change ii Still leaving logpcinc out of the analysis is land or logland a better instrument for open Hint Regress open on each of these separately and jointly iii Now return to 1622 Add the dummy variable oil to the equation and treat it as exogenous Estimate the equation by IV Does being an oil producer have a ceteris paribus effect on inflation C4 Use the data in CONSUMP for this exercise i In Example 167 use the method from Section 155 to test the single overidentifying restriction in estimating 1635 What do you conclude ii Campbell and Mankiw 1990 use second lags of all variables as IVs because of potential data measurement problems and informational lags Reestimate 1635 using only gct22 gyt22 and r3t22 as IVs How do the estimates compare with those in 1636 iii Regress gyt on the IVs from part ii and test whether gyt is sufficiently correlated with them Why is this important C5 Use the Economic Report of the President 2005 or later to update the data in CONSUMP at least through 2003 Reestimate equation 1635 Do any important conclusions change C6 Use the data in CEMENT for this exercise i A static inverse supply function for the monthly growth in cement price gprc as a function of growth in quantity gcem is gprct 5 a1gcemt 1 b0 1 b1gprcpet 1 b2febt 1 p 1 b12dect 1 us t Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 521 where gprcpet growth in the price of petroleum is assumed to be exogenous and feb dec are monthly dummy variables What signs do you expect for a1 and b1 Estimate the equation by OLS Does the supply function slope upward ii The variable gdefs is the monthly growth in real defense spending in the United States What do you need to assume about gdefs for it to be a good IV for gcem Test whether gcem is partially correlated with gdefs Do not worry about possible serial correlation in the reduced form Can you use gdefs as an IV in estimating the supply function iii Shea 1993 argues that the growth in output of residential gres and nonresidential gnon construction are valid instruments for gcem The idea is that these are demand shifters that should be roughly uncorrelated with the supply error us t Test whether gcem is partially correlated with gres and gnon again do not worry about serial correlation in the reduced form iv Estimate the supply function using gres and gnon as IVs for gcem What do you conclude about the static supply function for cement The dynamic supply function is apparently upward sloping see Shea 1993 C7 Refer to Example 139 and the data in CRIME4 i Suppose that after differencing to remove the unobserved effect you think Dlog1polpc2 is simultaneously determined with Dlog1crmrte2 in particular increases in crime are associated with increases in police officers How does this help to explain the positive coefficient on Dlog1polpc2 in equation 1333 ii The variable taxpc is the taxes collected per person in the county Does it seem reasonable to exclude this from the crime equation iii Estimate the reduced form for Dlog1polpc2 using pooled OLS including the potential IV Dlog1taxpc2 Does it look like Dlog1taxpc2 is a good IV candidate Explain iv Suppose that in several of the years the state of North Carolina awarded grants to some counties to increase the size of their county police force How could you use this information to estimate the effect of additional police officers on the crime rate C8 Use the data set in FISH which comes from Graddy 1995 to do this exercise The data set is also used in Computer Exercise C9 in Chapter 12 Now we will use it to estimate a demand function for fish i Assume that the demand equation can be written in equilibrium for each time period as log1totqtyt2 5 a1log1avgprct2 1 b10 1 b11mont 1 b12tuest 1 b13wedt 1 b14thurst 1 ut1 so that demand is allowed to differ across days of the week Treating the price variable as endogenous what additional information do we need to estimate the demandequation parameters consistently ii The variables wave2t and wave3t are measures of ocean wave heights over the past several days What two assumptions do we need to make in order to use wave2t and wave3t as IVs for log1avgprct2 in estimating the demand equation iii Regress log1avgprct2 on the dayoftheweek dummies and the two wave measures Are wave2t and wave3t jointly significant What is the pvalue of the test iv Now estimate the demand equation by 2SLS What is the 95 confidence interval for the price elasticity of demand Is the estimated elasticity reasonable v Obtain the 2SLS residuals u t1 Add a single lag u t211 in estimating the demand equation by 2SLS Remember use u t211 as its own instrument Is there evidence of AR1 serial correlation in the demand equation errors vi Given that the supply equation evidently depends on the wave variables what two assumptions would we need to make in order to estimate the price elasticity of supply vii In the reduced form equation for log1avgprct2 are the dayoftheweek dummies jointly significant What do you conclude about being able to estimate the supply elasticity Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 522 PART 1 Regression Analysis with CrossSectional Data C9 For this exercise use the data in AIRFARE but only for the year 1997 i A simple demand function for airline seats on routes in the United States is log1passen2 5 b10 1 a1log1fare2 1 b11log1dist2 1 b123log1dist2 42 1 u1 where passen 5 average passengers per day fare 5 average airfare and dist 5 the route distance 1in miles2 If this is truly a demand function what should be the sign of a1 ii Estimate the equation from part i by OLS What is the estimated price elasticity iii Consider the variable concen which is a measure of market concentration Specifically it is the share of business accounted for by the largest carrier Explain in words what we must assume to treat concen as exogenous in the demand equation iv Now assume concen is exogenous to the demand equation Estimate the reduced form for logfare and confirm that concen has a positive partial effect on logfare v Estimate the demand function using IV Now what is the estimated price elasticity of demand How does it compare with the OLS estimate vi Using the IV estimates describe how demand for seats depends on route distance C10 Use the entire panel data set in AIRFARE for this exercise The demand equation in a simultaneous equations unobserved effects model is log1passenit2 5 ut1 1 a1log1fareit2 1 ai1 1 uit1 where we absorb the distance variables into ai1 i Estimate the demand function using fixed effects being sure to include year dummies to account for the different intercepts What is the estimated elasticity ii Use fixed effects to estimate the reduced form log1fareit2 5 ut2 1 p21concenit 1 ai2 1 vit2 Perform the appropriate test to ensure that concenit can be used as an IV for log1fareit2 iii Now estimate the demand function using the fixed effects transformation along with IV as in equation 1642 What is the estimated elasticity Is it statistically significant C11 A common method for estimating Engel curves is to model expenditure shares as a function of total expenditure and possibly demographic variables A common specification has the form sgood 5 b0 1 b1ltotexpend 1 demographics 1 u where sgood is the fraction of spending on a particular good out of total expenditure and ltotexpend is the log of total expenditure The sign and magnitude of b1 are of interest across various expenditure categories To account for the potential endogeneity of ltotexpendwhich can be viewed as an omitted variables or simultaneous equations problem or boththe log of family income is often used as an instrumental variable Let lincome denote the log of family income For the remainder of this question use the data in EXPENDSHARES which comes from Blundell Duncan and Pendakur 1998 i Use sfood the share of spending on food as the dependent variable What is the range of values of sfood Are you surprised there are no zeros ii Estimate the equation sfood 5 b0 1 b1ltotexpend 1 b2age 1 b3kids 1 u 1643 by OLS and report the coefficient on ltotexpend b OLS1 along with its heteroskedasticityrobust standard error Interpret the result Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 16 Simultaneous Equations Models 523 iii Using lincome as an IV for ltotexpend estimate the reduced form equation for ltotexpend be sure to include age and kids Assuming lincome is exogenous in 1643 is lincome a valid IV for ltotexpend iv Now estimate 1643 by instrumental variables How does b IV1 compare with b OLS1 What about the robust 95 confidence intervals v Use the test in Section 155 to test the null hypothesis that ltotexpend is exogenous in 1643 Be sure to report and interpret the pvalue Are there any overidentifying restrictions to test vi Substitute salcohol for sfood in 1643 and estimate the equation by OLS and 2SLS Now what do you find for the coefficients on ltotexpend Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 524 I n Chapter 7 we studied the linear probability model which is simply an application of the multiple regression model to a binary dependent variable A binary dependent variable is an example of a limited dependent variable LDV An LDV is broadly defined as a dependent variable whose range of values is substantively restricted A binary variable takes on only two values zero and one In Section 77 we discussed the interpretation of multiple regression estimates for generally discrete response variables focusing on the case where y takes on a small number of integer valuesfor example the number of times a young man is arrested during a year or the number of children born to a woman Elsewhere we have encountered several other limited dependent variables including the percentage of people participating in a pension plan which must be between zero and 100 and college grade point average which is between zero and 40 at most colleges Most economic variables we would like to explain are limited in some way often because they must be positive For example hourly wage housing price and nominal interest rates must be greater than zero But not all such variables need special treatment If a strictly positive variable takes on many different values a special econometric model is rarely necessary When y is discrete and takes on a small number of values it makes no sense to treat it as an approximately continuous variable Discreteness of y does not in itself mean that linear models are inappropriate However as we saw in Chapter 7 for binary response the linear probability model has certain drawbacks In Section 171 we discuss logit and probit models which overcome the shortcomings of the LPM the disadvantage is that they are more difficult to interpret Limited Dependent Variable Models and Sample Selection Corrections c h a p t e r 17 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 525 Other kinds of limited dependent variables arise in econometric analysis especially when the behavior of individuals families or firms is being modeled Optimizing behavior often leads to a corner solution response for some nontrivial fraction of the population That is it is optimal to choose a zero quantity or dollar value for example During any given year a significant number of families will make zero charitable contributions Therefore annual family charitable contributions has a population distribution that is spread out over a large range of positive values but with a pileup at the value zero Although a linear model could be appropriate for capturing the expected value of charitable contributions a linear model will likely lead to negative predictions for some families Taking the natural log is not possible because many observations are zero The Tobit model which we cover in Section 172 is explicitly designed to model corner solution dependent variables Another important kind of LDV is a count variable which takes on nonnegative integer values Section 173 illustrates how Poisson regression models are well suited for modeling count variables In some cases we encounter limited dependent variables due to data censoring a topic we introduce in Section 174 The general problem of sample selection where we observe a nonrandom sample from the underlying population is treated in Section 175 Limited dependent variable models can be used for time series and panel data but they are most often applied to crosssectional data Sample selection problems are usually confined to cross sectional or panel data We focus on crosssectional applications in this chapter Wooldridge 2010 analyzes these problems in the context of panel data models and provides many more details for crosssectional and panel data applications 171 Logit and Probit Models for Binary Response The linear probability model is simple to estimate and use but it has some drawbacks that we dis cussed in Section 75 The two most important disadvantages are that the fitted probabilities can be less than zero or greater than one and the partial effect of any explanatory variable appearing in level form is constant These limitations of the LPM can be overcome by using more sophisticated binary response models In a binary response model interest lies primarily in the response probability P1y 5 10x2 5 P1y 5 10x1 x2 p xk2 171 where we use x to denote the full set of explanatory variables For example when y is an employment indicator x might contain various individual characteristics such as education age marital status and other factors that affect employment status including a binary indicator variable for participation in a recent job training program 171a Specifying Logit and Probit Models In the LPM we assume that the response probability is linear in a set of parameters bj see equation 727 To avoid the LPM limitations consider a class of binary response models of the form P1y 5 10x2 5 G1b0 1 b1x1 1 p 1 bkxk2 5 G1b0 1 xb2 172 where G is a function taking on values strictly between zero and one 0 G1z2 1 for all real numbers z This ensures that the estimated response probabilities are strictly between zero and one As in earlier chapters we write xb 5 b1x1 1 p 1 bkxk Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 526 Various nonlinear functions have been suggested for the function G to make sure that the probabilities are between zero and one The two we will cover here are used in the vast majority of applications along with the LPM In the logit model G is the logistic function G1z2 5 exp1z231 1 exp1z2 4 5 L1z2 173 which is between zero and one for all real numbers z This is the cumulative distribution function cdf for a standard logistic random variable In the probit model G is the standard normal cdf which is expressed as an integral G1z2 5 F1z2 3 z 2 f1v2dv 174 where f1z2 is the standard normal density f1z2 5 12p2 212exp12z222 175 This choice of G again ensures that 172 is strictly between zero and one for all values of the param eters and the xj The G functions in 173 and 174 are both increasing functions Each increases most quickly at z 5 0 G1z2 S 0 as z S 2 and G1z2 S 1 as z S The logistic function is plotted in Figure 171 The standard normal cdf has a shape very similar to that of the logistic cdf Logit and probit models can be derived from an underlying latent variable model Let yp be an unobserved or latent variable and suppose that yp 5 b0 1 xb 1 e y 5 13yp 04 176 where we introduce the notation 13 4 to define a binary outcome The function 13 4 is called the indicator function which takes on the value one if the event in brackets is true and zero otherwise Therefore y is one if yp 0 and y is zero if yp 0 We assume that e is independent of x and that e either has the standard logistic distribution or the standard normal distribution In either case e is Gz 5 expz1 1 expz 3 1 5 0 23 22 21 0 1 2 z FiguRE 171 Graph of the logistic function G 1z 2 5 exp1z 231 1 exp1z 2 4 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 527 symmetrically distributed about zero which means that 1 2 G12z2 5 G1z2 for all real numbers z Economists tend to favor the normality assumption for e which is why the probit model is more pop ular than logit in econometrics In addition several specification problems which we touch on later are most easily analyzed using probit because of properties of the normal distribution From 176 and the assumptions given we can derive the response probability for y P1y 5 10x2 5 P1yp 00x2 5 P3e 21b0 1 xb2 0x4 5 1 2 G321b0 1 xb2 4 5 G1b0 1 xb2 which is exactly the same as 172 In most applications of binary response models the primary goal is to explain the effects of the xj on the response probability P1y 5 10x2 The latent variable formulation tends to give the impres sion that we are primarily interested in the effects of each xj on yp As we will see for logit and probit the direction of the effect of xj on E1yp0x2 5 b0 1 xb and on E1y0x2 5 P1y 5 10x2 5 G1b0 1 xb2 is always the same But the latent variable yp rarely has a welldefined unit of measurement For example yp might be the difference in utility levels from two different actions Thus the magnitudes of each bj are not by themselves especially useful in contrast to the linear probability model For most purposes we want to estimate the effect of xj on the probability of success P1y 5 10x2 but this is complicated by the nonlinear nature of G1 2 To find the partial effect of roughly continuous variables on the response probability we must rely on calculus If xj is a roughly continuous variable its partial effect on p1x2 5 P1y 5 10x2 is obtained from the partial derivative p1x2 xj 5 g1b0 1 xb2bj where g1z2 dG dz 1z2 177 Because G is the cdf of a continuous random variable g is a probability density function pdf In the logit and probit cases G1 2 is a strictly increasing cdf and so g1z2 0 for all z Therefore the partial effect of xj on p1x2 depends on x through the positive quantity g1b0 1 xb2 which means that the partial effect always has the same sign as bj Equation 177 shows that the relative effects of any two continuous explanatory variables do not depend on x the ratio of the partial effects for xj and xh is bjbh In the typical case that g is a sym metric density about zero with a unique mode at zero the largest effect occurs when b0 1 xb 5 0 For example in the probit case with g1z2 5 f1z2 g102 5 f102 5 12p 40 In the logit case g1z2 5 exp1z231 1 exp1z2 42 and so g102 5 25 If say x1 is a binary explanatory variable then the partial effect from changing x1 from zero to one holding all other variables fixed is simply G1b0 1 b1 1 b2x2 1 p 1 bkxk2 2 G1b0 1 b2x2 1 p 1 bkxk2 178 Again this depends on all the values of the other xj For example if y is an employment indicator and x1 is a dummy variable indicating participation in a job training program then 178 is the change in the probability of employment due to the job training program this depends on other characteristics that affect employability such as education and experience Note that knowing the sign of b1 is suffi cient for determining whether the program had a positive or negative effect But to find the magnitude of the effect we have to estimate the quantity in 178 We can also use the difference in 178 for other kinds of discrete variables such as number of children If xk denotes this variable then the effect on the probability of xk going from ck to ck 1 1 is simply G3b0 1 b1x1 1 b2x2 1 p 1 bk1ck 1 12 4 179 2 G1b0 1 b1x1 1 b2x2 1 p 1 bkck2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 528 It is straightforward to include standard functional forms among the explanatory variables For example in the model P1y 5 10z2 5 G1b0 1 b1z1 1 b2z2 1 1 b3log1z22 1 b4z32 the partial effect of z1 on P1y 5 10z2 is P1y 5 10z2z1 5 g1b0 1 xb2 1b1 1 2b2z12 and the partial effect of z2 on the response probability is P1y 5 10z2z2 5 g1b0 1 xb2 1b3z22 where xb 5 b1z1 1 b2z2 1 1 b3log1z22 1 b4z3 Therefore g1b0 1 xb2 1b31002 is the approximate change in the response probability when z2 increases by 1 Sometimes we want to compute the elasticity of the response probability with respect to an explanatory variable although we must be careful in interpreting percentage changes in probabilities For example a change in a probability from 04 to 06 represents a 2percentagepoint increase in the probability but a 50 increase relative to the initial value Using calculus in the preceding model the elasticity of P1y 5 10z2 with respect to z2 can be shown to be b33g1b0 1 xb2G1b0 1 xb2 4 The elasticity with respect to z3 is 1b4z32 3g1b0 1 xb2G1b0 1 xb2 4 In the first case the elasticity is always the same sign as b2 but it generally depends on all parameters and all values of the explana tory variables If z3 0 the second elasticity always has the same sign as the parameter b4 Models with interactions among the explanatory variables can be a bit tricky but one should compute the partial derivatives and then evaluate the resulting partial effects at interesting values When measuring the effects of discrete variablesno matter how complicated the modelwe should use 179 We discuss this further in the subsection on interpreting the estimates on page 530 171b Maximum Likelihood Estimation of Logit and Probit Models How should we estimate nonlinear binary response models To estimate the LPM we can use ordi nary least squares see Section 75 or in some cases weighted least squares see Section 85 Because of the nonlinear nature of E1y0x2 OLS and WLS are not applicable We could use nonlinear versions of these methods but it is no more difficult to use maximum likelihood estimation MLE see Appendix 17A for a brief discussion Up until now we have had little need for MLE although we did note that under the classical linear model assumptions the OLS estimator is the maximum likelihood estimator conditional on the explanatory variables For estimating limited dependent var iable models maximum likelihood methods are indispensable Because MLE is based on the distribu tion of y given x the heteroskedasticity in Var1y0x2 is automatically accounted for Assume that we have a random sample of size n To obtain the maximum likelihood estimator conditional on the explanatory variables we need the density of yi given xi We can write this as f1y0xib2 5 3G1xib2 4y31 2 G1xib2 412y y 5 0 1 1710 where for simplicity we absorb the intercept into the vector xi We can easily see that when y 5 1 we get G1xib2 and when y 5 0 we get 1 2 G1xib2 The loglikelihood function for observation i is a function of the parameters and the data 1xi yi2 and is obtained by taking the log of 1710 i1b2 5 yilog3G1xib2 4 1 11 2 yi2log31 2 G1xib2 4 1711 Because G1 2 is strictly between zero and one for logit and probit i1b2 is well defined for all values of b The loglikelihood for a sample size of n is obtained by summing 1711 across all observa tions 1b2 5 g n i51i1b2 The MLE of b denoted by b maximizes this loglikelihood If G1 2 is the standard logit cdf then b is the logit estimator if G1 2 is the standard normal cdf then b is the probit estimator Because of the nonlinear nature of the maximization problem we cannot write formulas for the logit or probit maximum likelihood estimates In addition to raising computational issues this makes the statistical theory for logit and probit much more difficult than OLS or even 2SLS Nevertheless Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 529 the general theory of MLE for random samples implies that under very general conditions the MLE is consistent asymptotically normal and asymptotically efficient See Wooldridge 2010 Chapter 13 for a general discussion We will just use the results here applying logit and probit mod els is fairly easy provided we understand what the statistics mean Each b j comes with an asymptotic standard error the formula for which is complicated and pre sented in the chapter appendix Once we have the standard errorsand these are reported along with the coefficient estimates by any package that supports logit and probitwe can construct asymp totic t tests and confidence intervals just as with OLS 2SLS and the other estimators we have encountered In particular to test H0 bj 5 0 we form the t statistic b jse1bj 2 and carry out the test in the usual way once we have decided on a one or twosided alternative 171c Testing Multiple Hypotheses We can also test multiple restrictions in logit and probit models In most cases these are tests of mul tiple exclusion restrictions as in Section 45 We will focus on exclusion restrictions here There are three ways to test exclusion restrictions for logit and probit models The Lagrange mul tiplier or score test only requires estimating the model under the null hypothesis just as in the linear case in Section 52 we will not cover the score test here since it is rarely needed to test exclusion restrictions See Wooldridge 2010 Chapter 15 for other uses of the score test in binary response models The Wald test requires estimation of only the unrestricted model In the linear model case the Wald statistic after a simple transformation is essentially the F statistic so there is no need to cover the Wald statistic separately The formula for the Wald statistic is given in Wooldridge 2010 Chapter 15 This statistic is computed by econometrics packages that allow exclusion restrictions to be tested after the unrestricted model has been estimated It has an asymptotic chisquare distribution with df equal to the number of restrictions being tested If both the restricted and unrestricted models are easy to estimateas is usually the case with exclu sion restrictionsthen the likelihood ratio LR test becomes very attractive The LR test is based on the same concept as the F test in a linear model The F test measures the increase in the sum of squared residuals when variables are dropped from the model The LR test is based on the difference in the loglikelihood functions for the unrestricted and restricted models The idea is this Because the MLE maximizes the loglikelihood function dropping variables generally leads to a smalleror at least no largerloglikelihood This is similar to the fact that the Rsquared never increases when variables are dropped from a regression The question is whether the fall in the loglikelihood is large enough to con clude that the dropped variables are important We can make this decision once we have a test statistic and a set of critical values The likelihood ratio statistic is twice the differ ence in the loglikelihoods LR 5 21ur 2 r2 1712 A probit model to explain whether a firm is taken over by another firm during a given year is P1takeover 5 10x2 5 F1b0 1 b1avgprof 1 b2mktval 1 b3debtearn 1 b4ceoten 1 b5ceosal 1 b5ceoage2 where takeover is a binary response variable avgprof is the firms average profit margin over several prior years mktval is market value of the firm debtearn is the debtto earnings ratio and ceoten ceosal and ceo age are the tenure annual salary and age of the chief executive officer respectively State the null hypothesis that other factors being equal variables related to the CEO have no effect on the probability of takeover How many df are in the chisquare distribu tion for the LR or Wald test Exploring FurthEr 171 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 530 where ur is the loglikelihood value for the unrestricted model and r is the loglikelihood value for the restricted model Because ur r LR is nonnegative and usually strictly positive In computing the LR statistic for binary response models it is important to know that the loglikelihood function is always a negative number This fact follows from equation 1711 because yi is either zero or one and both variables inside the log function are strictly between zero and one which means their natural logs are negative That the loglikelihood functions are both negative does not change the way we compute the LR statistic we simply preserve the negative signs in equation 1712 The multiplication by two in 1712 is needed so that LR has an approximate chisquare distribu tion under H0 If we are testing q exclusion restrictions LR a x2 q This means that to test H0 at the 5 level we use as our critical value the 95th percentile in the x2 q distribution Computing pvalues is easy with most software packages 171d Interpreting the Logit and Probit Estimates Given modern computers from a practical perspective the most difficult aspect of logit or probit models is presenting and interpreting the results The coefficient estimates their standard errors and the value of the loglikelihood function are reported by all software packages that do logit and probit and these should be reported in any application The coefficients give the signs of the partial effects of each xj on the response probability and the statistical significance of xj is determined by whether we can reject H0 bj 5 0 at a sufficiently small significance level As we briefly discussed in Section 75 for the linear probability model we can compute a goodnessoffit measure called the percent correctly predicted As before we define a binary pre dictor of yi to be one if the predicted probability is at least 5 and zero otherwise Mathematically yi 5 1 if G1b 0 1 xib 2 5 and yi 5 0 if G1b 0 1 xib 2 5 Given 5yi i 5 1 2 p n6 we can see how well yi predicts yi across all observations There are four possible outcomes on each pair 1yi yi2 when both are zero or both are one we make the correct prediction In the two cases where one of the pair is zero and the other is one we make the incorrect prediction The percentage correctly predicted is the percentage of times that yi 5 yi Although the percentage correctly predicted is useful as a goodnessoffit measure it can be mis leading In particular it is possible to get rather high percentages correctly predicted even when the least likely outcome is very poorly predicted For example suppose that n 5 200 160 observations have yi 5 0 and out of these 160 observations 140 of the yi are also zero so we correctly predict 875 of the zero outcomes Even if none of the predictions is correct when yi 5 1 we still correctly predict 70 of all outcomes 1140200 5 702 Often we hope to have some ability to predict the least likely outcome such as whether someone is arrested for committing a crime and so we should be up front about how well we do in predicting each outcome Therefore it makes sense to also com pute the percentage correctly predicted for each of the outcomes Problem 1 asks you to show that the overall percentage correctly predicted is a weighted average of q 0 the percentage correctly predicted for yi 5 0 and q 1 the percentage correctly predicted for yi 5 1 where the weights are the fractions of zeros and ones in the sample respectively Some have criticized the prediction rule just described for using a threshold value of 5 espe cially when one of the outcomes is unlikely For example if y 5 08 only 8 successes in the sample it could be that we never predict yi 5 1 because the estimated probability of success is never greater than 5 One alternative is to use the fraction of successes in the sample as the threshold08 in the previous example In other words define yi 5 1 when G1b0 1 xib 2 08 and zero other wise Using this rule will certainly increase the number of predicted successes but not without cost we will necessarily make more mistakesperhaps many morein predicting zeros failures In terms of the overall percentage correctly predicted we may do worse than using the 5 threshold A third possibility is to choose the threshold such that the fraction of yi 5 1 in the sample is the same as or very close to y In other words search over threshold values t 0 t 1 such that if we define yi 5 1 when G1b0 1 xib 2 t then g n i51yi g n i51 yi The trial and error required to Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 531 find the desired value of t can be tedious but it is feasible In some cases it will not be possible to make the number of predicted successes exactly the same as the number of successes in the sample Now given this set of yi we can compute the percentage correctly predicted for each of the two out comes as well as the overall percentage correctly predicted There are also various pseudo Rsquared measures for binary response McFadden 1974 sug gests the measure 1 2 uro where ur is the loglikelihood function for the estimated model and 0 is the loglikelihood function in the model with only an intercept Why does this measure make sense Recall that the loglikelihoods are negative and so uro 5 0ur00o0 Further 0ur0 0o0 If the covariates have no explanatory power then uro 5 1 and the pseudo Rsquared is zero just as the usual Rsquared is zero in a linear regression when the covariates have no explanatory power Usually 0ur0 0o0 in which 1 2 ur o 0 If ur were zero the pseudo Rsquared would equal unity In fact ur cannot reach zero in a probit or logit model as that would require the estimated probabilities when yi 5 1 all to be unity and the estimated probabilities when yi 5 0 all to be zero Alternative pseudo Rsquareds for probit and logit are more directly related to the usual Rsquared from OLS estimation of a linear probability model For either probit or logit let yi 5 G1b 0 1 xib 2 be the fitted probabilities Since these probabilities are also estimates of E1yi0xi2 we can base an Rsquared on how close the yi are to the yi One possibility that suggests itself from standard regres sion analysis is to compute the squared correlation between yi and yi Remember in a linear regres sion framework this is an algebraically equivalent way to obtain the usual Rsquared see equation 329 Therefore we can compute a pseudo Rsquared for probit and logit that is directly comparable to the usual Rsquared from estimation of a linear probability model In any case goodnessoffit is usually less important than trying to obtain convincing estimates of the ceteris paribus effects of the explanatory variables Often we want to estimate the effects of the xj on the response probabilities P1y 5 10x2 If xj is roughly continuous then DP 1y 5 10x2 3g1b0 1 xb 2b j4Dxj 1713 for small changes in xj So for Dxj 5 1 the change in the estimated success probability is roughly g1b 0 1 xb 2b j Compared with the linear probability model the cost of using probit and logit mod els is that the partial effects in equation 1713 are harder to summarize because the scale factor g1b 0 1 xb 2 depends on x that is on all of the explanatory variables One possibility is to plug in interesting values for the xj such as means medians minimums maximums and lower and upper quartilesand then see how g1b 0 1 xb 2 changes Although attractive this can be tedious and result in too much information even if the number of explanatory variables is moderate As a quick summary for getting at the magnitudes of the partial effects it is handy to have a sin gle scale factor that can be used to multiply each b j or at least those coefficients on roughly continu ous variables One method commonly used in econometrics packages that routinely estimate probit and logit models is to replace each explanatory variable with its sample average In other words the adjustment factor is g1b 0 1 xb 2 5 g1b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk2 1714 where g1 2 is the standard normal density in the probit case and g1z2 5 exp1z231 1 exp1z2 42 in the logit case The idea behind 1714 is that when it is multiplied by b j we obtain the partial effect of xj for the average person in the sample Thus if we multiply a coefficient by 1714 we generally obtain the partial effect at the average PEA There are at least two potential problems with using PEAs to summarize the partial effects of the explanatory variables First if some of the explanatory variables are discrete the averages of them represent no one in the sample or population for that matter For example if x1 5 female and 475 of the sample is female what sense does it make to plug in x1 5 475 to represent the aver age person Second if a continuous explanatory variable appears as a nonlinear functionsay as a natural log or in a quadraticit is not clear whether we want to average the nonlinear function or Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 532 plug the average into the nonlinear function For example should we use log1sales2 or log1sales2 to represent average firm size Econometrics packages that compute the scale factor in 1714 default to the former the software is written to compute the averages of the regressors included in the probit or logit estimation A different approach to computing a scale factor circumvents the issue of which values to plug in for the explanatory variables Instead the second scale factor results from averaging the individual partial effects across the sample leading to what is called the average partial effect APE or some times the average marginal effect AME For a continuous explanatory variable xj the average partial effect is n21g n i51 3g1b 0 1 xib 2b j4 5 3n21g n i51 g1b 0 1 xib 2 4b i The term multiplying bj acts as a scale factor n21 a n i51 g1b0 1 xib 2 1715 Equation 1715 is easily computed after probit or logit estimation where g1b 0 1 xib 2 5 f1b 0 1 xib 2 in the probit case and g1b 0 1 xib 2 5 exp1b 0 1 xib 231 1 exp1b 0 1 xib 2 42 in the logit case The two scale factors differand are possibly quite differentbecause in 1715 we are using the aver age of the nonlinear function rather than the nonlinear function of the average as in 1714 Because both of the scale factors just described depend on the calculus approximation in 1713 neither makes much sense for discrete explanatory variables Instead it is better to use equation 179 to directly estimate the change in the probability For a change in xk from ck to ck 1 1 the discrete analog of the partial effect based on 1714 is G3b 0 1 b 1x1 1 p 1 b k21xk21 1 b k1ck 1 12 4 1716 2 G1b 0 1 b 1x1 1 p 1 b k21xk21 1 b kck2 where G is the standard normal cdf in the probit case and G1z2 5 exp1z231 1 exp1z2 4 in the logit case The average partial effect which usually is more comparable to LPM estimates is n21 a n i51 5G3b 0 1 b 1xi1 1 p 1 b k21xik21 1 b k1ck 1 12 4 1717 2 G1b 0 1 b 1xi1 1 p 1 b k21xik21 1 b kck2 6 The quantity in equation 1717 is a partial effect because all explanatory variables other than xk are being held fixed at their observed values It is not necessarily a marginal effect because the change in xk from ck to ck 1 1 may not be a marginal or small increase whether it is depends on the definition of xk Obtaining expression 1717 for either probit or logit is actually rather simple First for each observation we estimate the probability of success for the two chosen values of xk plugging in the actual outcomes for the other explanatory variables So we would have n estimated differences Then we average the differences in estimated probabilities across all observations For binary xk both 1716 and 1717 are easily computed using certain econometrics packages such as Stata The expression in 1717 has a particularly useful interpretation when xk is a binary variable For each unit i we estimate the predicted difference in the probability that yi 5 1 when xk 5 1 and xk 5 0 namely G1b 0 1 b 1xi1 1 p 1 b k21xi k21 1 b k2 2 G1b 0 1 b 1xi1 1 p 1 b k21xi k212 For each i this difference is the estimated effect of switching xk from zero to one whether unit i had xik 5 1 or xik 5 0 For example if y is an employment indicator equal to one if the person is employed after participation in a job training program indicated by xk then we can estimate the dif ference in employment probabilities for each person in both states of the world This counterfactual Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 533 reasoning is similar to that in Chapter 16 which we used to motivate simultaneous equations models The estimated effect of the job training program on the employment probability is the average of the estimated differences in probabilities As another example suppose that y indicates whether a family was approved for a mortgage and xk is a binary race indicator say equal to one for nonwhites Then for each family we can estimate the predicted difference in having the mortgage approved as a func tion of income wealth credit rating and so onwhich would be elements of 1xi1 xi2 p xi k212 under the two scenarios that the household head is nonwhite versus white Hopefully we have controlled for enough factors so that averaging the differences in probabilities results in a convincing estimate of the race effect In applications where one applies probit logit and the LPM it makes sense to compute the scale factors described above for probit and logit in making comparisons of partial effects Still sometimes one wants a quicker way to compare magnitudes of the different estimates As mentioned earlier for probit g102 4 and for logit g102 5 25 Thus to make the magnitudes of probit and logit roughly comparable we can multiply the probit coefficients by 425 5 16 or we can multiply the logit esti mates by 625 In the LPM g0 is effectively one so the logit slope estimates can be divided by four to make them comparable to the LPM estimates the probit slope estimates can be divided by 25 to make them comparable to the LPM estimates Still in most cases we want the more accurate com parisons obtained by using the scale factors in 1715 for logit and probit ExamplE 171 married Womens labor Force participation We now use the data on 753 married women in MROZ to estimate the labor force participation model from Example 88see also Section 75by logit and probit We also report the linear probability model estimates from Example 88 using the heteroskedasticityrobust standard errors The results with standard errors in parentheses are given in Table 171 TAblE 171 LPM Logit and Probit Estimates of Labor Force Participation Dependent Variable inlf Independent Variables LPM OLS Logit MLE Probit MLE nwifeinc 20034 0015 2021 008 2012 005 educ 038 007 221 043 131 025 exper 039 006 206 032 123 019 exper 2 200060 00019 20032 0010 20019 0006 age 2016 002 2088 015 2053 008 kidslt6 2262 032 21443 204 2868 119 kidsge6 013 014 060 075 036 043 constant 586 152 425 860 270 509 Percentage correctly predicted Loglikelihood value Pseudo Rsquared 734 264 736 240177 220 734 240130 221 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 534 The estimates from the three models tell a consistent story The signs of the coefficients are the same across models and the same variables are statistically significant in each model The pseudo Rsquared for the LPM is just the usual Rsquared reported for OLS for logit and probit the pseudo Rsquared is the measure based on the loglikelihoods described earlier As we have already emphasized the magnitudes of the coefficient estimates across models are not directly comparable Instead we compute the scale factors in equations 1714 and 1715 If we evaluate the standard normal pdf f1b 0 1 b 1x1 1 b 2x2 1 p 1 b kxk2 at the sample averages of the explanatory variables including the average of exper2 kidslt6 and kidsge6 the result is approxi mately 391 When we compute 1714 for the logit case we obtain about 243 The ratio of these 391243 161 is very close to the simple rule of thumb for scaling up the probit estimates to make them comparable to the logit estimates multiply the probit estimates by 16 Nevertheless for compar ing probit and logit to the LPM estimates it is better to use 1715 These scale factors are about 301 probit and 179 logit For example the scaled logit coefficient on educ is about 17912212 040 and the scaled probit coefficient on educ is about 30111312 039 both are remarkably close to the LPM estimate of 038 Even on the discrete variable kidslt6 the scaled logit and probit coef ficients are similar to the LPM coefficient of 2262 These are 1791214432 2258 logit and 301128682 2261 probit Table 172 reports the average partial effects for all explanatory variables and for each of the three estimated models We obtained the estimates and standard errors from the statistical package Stata 13 These APEs treat all explanatory variables as continuous even the variables for the num ber of children Obtaining the APE for exper requires some care as it must account for the quadratic functional form in exper Even for the linear model we must compute the derivative and then find the average In the LPM column the APE of exper is the average of the derivative with respect to exper so 039 2 0012 experi averaged across all i The remaining APE entries for the LPM column are simply the OLS coefficients in Table 171 The APEs for exper for the logit and probit models also account for the quadratic in exper As is clear from the table the APEs and their statistical signifi cance are very similar for all explanatory variables across all three models The biggest difference between the LPM model and the logit and probit models is that the LPM assumes constant marginal effects for educ kidslt6 and so on while the logit and probit models TAblE 172 Average Partial Effects for the Labor Force Participation Models Independent Variables LPM Logit Probit nwifeinc 0034 0015 0038 0015 0036 0014 educ 038 007 039 007 039 007 exper 027 002 025 002 026 002 age 016 002 016 002 016 002 kidslt6 262 032 258 032 261 032 kidsge6 013 014 011 013 011 013 Using the probit estimates and the calcu lus approximation what is the approximate change in the response probability when exper increases from 10 to 11 Exploring FurthEr 172 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 535 imply diminishing magnitudes of the partial effects In the LPM one more small child is estimated to reduce the probability of labor force participation by about 262 regardless of how many young children the woman already has and regardless of the levels of the other explanatory variables We can contrast this with the estimated marginal effect from probit For concreteness take a woman with nwifeinc 5 2013 educ 5 123 exper 5 106 and age 5 425which are roughly the sample averagesand kidsge6 5 1 What is the estimated decrease in the probability of working in going from zero to one small child We evaluate the standard normal cdf F1b 0 1 b 1x1 1 p 1 b kxk2 with kidslt6 5 1 and kidslt6 5 0 and the other independent variables set at the preceding values We get roughly 373 2 707 5 2334 which means that the labor force participation probability is about 334 lower when a woman has one young child If the woman goes from one to two young chil dren the probability falls even more but the marginal effect is not as large 117 2 373 5 2256 Interestingly the estimate from the linear probability model which is supposed to estimate the effect near the average is in fact between these two estimates Note that the calculations provided here which use coefficients mostly rounded to the third decimal place will differ somewhat from calcula tions obtained within a statistical packagewhich would be subject to less rounding error Figure 172 illustrates how the estimated response probabilities from nonlinear binary response models can differ from the linear probability model The estimated probability of labor force par ticipation is graphed against years of education for the linear probability model and the probit model The graph for the logit model is very similar to that for the probit model In both cases the explanatory variables other than educ are set at their sample averages In particular the two equa tions graphed are inlf 5 102 1 038 educ for the linear model and inlf 5 F121403 1 131 educ2 At lower levels of education the linear probability model estimates higher labor force participation probabilities than the probit model For example at eight years of education the linear probability model estimates a 406 labor force participation probability while the probit model estimates about 361 estimated probability of labor force participation 20 1 0 years of education 9 75 5 25 1 12 16 0 4 8 inlf 5 F 21403 1 131 educ inlf 5 102 1 038 educ FiguRE 172 Estimated response probabilities with respect to education for the linear probability and probit models Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 536 The estimates are the same at around 11 3 years of education At higher levels of education the probit model gives higher labor force participation probabilities In this sample the smallest years of educa tion is 5 and the largest is 17 so we really should not make comparisons outside this range The same issues concerning endogenous explanatory variables in linear models also arise in logit and probit models We do not have the space to cover them but it is possible to test and cor rect for endogenous explanatory variables using methods related to two stage least squares Evans and Schwab 1995 estimated a probit model for whether a student attends college where the key explanatory variable is a dummy variable for whether the student attends a Catholic school Evans and Schwab estimated a model by maximum likelihood that allows attending a Catholic school to be considered endogenous See Wooldridge 2010 Chapter 15 for an explanation of these methods Two other issues have received attention in the context of probit models The first is nonnormal ity of e in the latent variable model 176 Naturally if e does not have a standard normal distribution the response probability will not have the probit form Some authors tend to emphasize the inconsist ency in estimating the bj but this is the wrong focus unless we are only interested in the direction of the effects Because the response probability is unknown we could not estimate the magnitude of partial effects even if we had consistent estimates of the bj A second specification problem also defined in terms of the latent variable model is heteroske dasticity in e If Var1e0x2 depends on x the response probability no longer has the form G1b0 1 xb2 instead it depends on the form of the variance and requires more general estimation Such models are not often used in practice since logit and probit with flexible functional forms in the independent variables tend to work well Binary response models apply with little modification to independently pooled cross sections or to other data sets where the observations are independent but not necessarily identically distributed Often year or other time period dummy variables are included to account for aggregate time effects Just as with linear models logit and probit can be used to evaluate the impact of certain policies in the context of a natural experiment The linear probability model can be applied with panel data typically it would be estimated by fixed effects see Chapter 14 Logit and probit models with unobserved effects have recently become popular These models are complicated by the nonlinear nature of the response probabilities and they are difficult to estimate and interpret See Wooldridge 2010 Chapter 15 172 The Tobit Model for Corner Solution Responses As mentioned in the chapter introduction another important kind of limited dependent variable is a corner solution response Such a variable is zero for a nontrivial fraction of the population but is roughly continuously distributed over positive values An example is the amount an individual spends on alcohol in a given month In the population of people over age 21 in the United States this variable takes on a wide range of values For some significant fraction the amount spent on alcohol is zero The following treatment omits verification of some details concerning the Tobit model These are given in Wooldridge 2010 Chapter 17 Let y be a variable that is essentially continuous over strictly positive values but that takes on a value of zero with positive probability Nothing prevents us from using a linear model for y In fact a linear model might be a good approximation to E1y0x1 x2 p xk2 especially for xj near the mean values But we would possibly obtain negative fitted values which leads to negative predictions for y this is analogous to the problems with the LPM for binary outcomes Also the assumption that an explanatory variable appearing in level form has a constant partial effect on E1y0x2 can be misleading Probably Var1y0x2 would be heteroskedastic although we can easily deal with general heteroskedas ticity by computing robust standard errors and test statistics Because the distribution of y piles up at zero y clearly cannot have a conditional normal distribution So all inference would have only asymp totic justification as with the linear probability model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 537 In some cases it is important to have a model that implies nonnegative predicted values for y and which has sensible partial effects over a wide range of the explanatory variables Plus we sometimes want to estimate features of the distribution of y given x1 xk other than the conditional expecta tion The Tobit model is quite convenient for these purposes Typically the Tobit model expresses the observed response y in terms of an underlying latent variable yp 5 b0 1 xb 1 u u0x Normal10 s22 1718 y 5 max10 yp2 1719 The latent variable yp satisfies the classical linear model assumptions in particular it has a normal homoskedastic distribution with a linear conditional mean Equation 1719 implies that the observed variable y equals yp when yp 0 but y 5 0 when yp 0 Because yp is normally distributed y has a continuous distribution over strictly positive values In particular the density of y given x is the same as the density of yp given x for positive values Further P1y 5 00x2 5 P1yp 00x2 5 P1u 2xb0x2 5 P1us 2xbs0x2 5 F12xbs2 5 1 2 F1xbs2 because us has a standard normal distribution and is independent of x we have absorbed the inter cept into x for notational simplicity Therefore if 1xi yi2 is a random draw from the population the density of yi given xi is 12ps22 212exp321y 2 xib2 212s22 4 5 11s2f3 1y 2 xib2s4 y 0 1720 P1yi 5 00xi2 5 1 2 F1xibs2 1721 where f is the standard normal density function From 1720 and 1721 we can obtain the loglikelihood function for each observation i i1bs2 5 11yi 5 02log31 2 F1xibs2 4 1722 1 11yi 02log511s2f3 1yi 2 xib2s46 notice how this depends on s the standard deviation of u as well as on the bj The loglikelihood for a random sample of size n is obtained by summing 1722 across all i The maximum likelihood estimates of b and s are obtained by maximizing the loglikelihood this requires numerical methods although in most cases this is easily done using a packaged routine As in the case of logit and probit each Tobit esti mate comes with a standard error and these can be used to construct t statistics for each b j the matrix formula used to find the standard errors is compli cated and will not be presented here See for exam ple Wooldridge 2010 Chapter 17 Testing multiple exclusion restrictions is easily done using the Wald test or the likelihood ratio test The Wald test has a form similar to that of the logit or probit case the LR test is always given by 1712 where of course we use the Tobit loglikelihood functions for the restricted and unrestricted models 172a Interpreting the Tobit Estimates Using modern computers it is usually not much more difficult to obtain the maximum likelihood esti mates for Tobit models than the OLS estimates of a linear model Further the outputs from Tobit and OLS are often similar This makes it tempting to interpret the b j from Tobit as if these were estimates from a linear regression Unfortunately things are not so easy Let y be the number of extramarital affairs for a married woman from the US popu lation we would like to explain this vari able in terms of other characteristics of the womanin particular whether she works outside of the home her husband and her family Is this a good candidate for a Tobit model Exploring FurthEr 173 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 538 From equation 1718 we see that the bj measure the partial effects of the xj on E1yp0x2 where yp is the latent variable Sometimes yp has an interesting economic meaning but more often it does not The variable we want to explain is y as this is the observed outcome such as hours worked or amount of charitable contributions For example as a policy matter we are interested in the sensitiv ity of hours worked to changes in marginal tax rates We can estimate P1y 5 00x2 from 1721 which of course allows us to estimate P1y 00x2 What happens if we want to estimate the expected value of y as a function of x In Tobit models two expectations are of particular interest E1y0y 0x2 which is sometimes called the condi tional expectation because it is conditional on y 0 and E1y0x2 which is unfortunately called the unconditional expectation Both expectations are conditional on the explanatory variables The expectation E1y0y 0x2 tells us for given values of x the expected value of y for the subpopulation where y is positive Given E1y0y 0x2 we can easily find E1y0x2 E1y0x2 5 P1y 00x2 E1y0y 0x2 5 F1xbs2 E1y0y 0x2 1723 To obtain E1y0y 0x2 we use a result for normally distributed random variables if z Normal01 then E1z0z c2 5 f1c231 2 F1c2 4 for any constant c But E1y0y 0x2 5 xb 1 E1u0u 2xb2 5 xb 1 sE3 1us2 0 1us2 2xbs4 5 xb 1 sf1xbs2F1xbs2 because f12c2 5 f1c2 1 2 F12c2 5 F1c2 and us has a standard normal distribution independent of x We can summarize this as E1y0y 0x2 5 xb 1 sl1xbs2 1724 where l1c2 5 f1c2F1c2 is called the inverse Mills ratio it is the ratio between the standard normal pdf and standard normal cdf each evaluated at c Equation 1724 is important It shows that the expected value of y conditional on y 0 is equal to xb plus a strictly positive term which is s times the inverse Mills ratio evaluated at xbs This equation also shows why using OLS only for observations where yi 0 will not always consistently estimate b essentially the inverse Mills ratio is an omitted variable and it is generally correlated with the elements of x Combining 1723 and 1724 gives E1y0x2 5 F1xbs2 3xb 1 sl1xbs2 4 5 F1xbs2xb 1 sf1xbs2 1725 where the second equality follows because F1xbs2l1xbs2 5 f1xbs2 This equation shows that when y follows a Tobit model E1y0x2 is a nonlinear function of x and b Although it is not obvi ous the righthand side of equation 1725 can be shown to be positive for any values of x and b Therefore once we have estimates of b we can be sure that predicted values for ythat is estimates of E1y0x2are positive The cost of ensuring positive predictions for y is that equation 1725 is more complicated than a linear model for E1y0x2 Even more importantly the partial effects from 1725 are more complicated than for a linear model As we will see the partial effects of xj on E1y0y 0x2 and E1y0x2 have the same sign as the coefficient bj but the magnitude of the effects depends on the values of all explanatory variables and parameters Because s appears in 1725 it is not surprising that the partial effects depend on s too If xj is a continuous variable we can find the partial effects using calculus First E1y0y 0x2xj 5 bj 1 bj dl dc 1xbs2 assuming that xj is not functionally related to other regressors By differentiating l1c2 5 f1c2F1c2 and using dFdc 5 f1c2 and dfdc 5 2cf1c2 it can be shown that dldc 5 2l1c2 3c 1 l1c2 4 Therefore E1y0y 0x2xj 5 bj51 2 l1xbs2 3xbs 1 l1xbs2 46 1726 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 539 This shows that the partial effect of xj on E1y0y 0x2 is not determined just by bj The adjust ment factor is given by the term in brackets 5 6 and depends on a linear function of x xbs 5 1b0 1 b1x1 1 p 1 bkxk2s It can be shown that the adjustment factor is strictly between zero and one In practice we can estimate 1726 by plugging in the MLEs of the bj and s As with logit and probit models we must plug in values for the xj usually the mean values or other interesting values Equation 1726 reveals a subtle point that is sometimes lost in applying the Tobit model to cor ner solution responses the parameter s appears directly in the partial effects so having an estimate of s is crucial for estimating the partial effects Sometimes s is called an ancillary parameter which means it is auxiliary or unimportant Although it is true that the value of s does not affect the sign of the partial effects it does affect the magnitudes and we are often interested in the economic impor tance of the explanatory variables Therefore characterizing s as ancillary is misleading and comes from a confusion between the Tobit model for corner solution applications and applications to true data censoring For the latter see Section 174 All of the usual economic quantities such as elasticities can be computed For example the elas ticity of y with respect to x1 conditional on y 0 is E1y0y 0 x2 x1 x1 E1y0y 0 x2 1727 This can be computed when x1 appears in various functional forms including level logarithmic and quadratic forms If x1 is a binary variable the effect of interest is obtained as the difference between E1y0y 0 x2 with x1 5 1 and x1 5 0 Partial effects involving other discrete variables such as number of children can be handled similarly We can use 1725 to find the partial derivative of E1y0x2 with respect to continuous xj This derivative accounts for the fact that people starting at y 5 0 might choose y 0 when xj changes E1y0x2 xj 5 P1y 00x2 xj E1y0y 0x2 1 P1y 00x2 E1y0y 0x2 xj 1728 Because P1y 00x2 5 F1xbs2 P1y 00x2 xj 5 1bjs2f1xbs2 1729 so we can estimate each term in 1728 once we plug in the MLEs of the bj and s and particular values of the xj Remarkably when we plug 1726 and 1729 into 1728 and use the fact that F1c2l1c2 5 f1c2 for any c we obtain E1y0x2 xj 5 bjF1xbs2 1730 Equation 1730 allows us to roughly compare OLS and Tobit estimates Equation 1730 also can be derived directly from equation 1725 using the fact that df1z2dz 5 2zf1z2 The OLS slope coefficients say g j from the regression of yi on xi1 xi2 xik i 5 1 nthat is using all of the dataare direct estimates of E1y0x2xj To make the Tobit coefficient b j comparable to g j we must multiply b j by an adjustment factor As in the probit and logit cases there are two common approaches for computing an adjustment factor for obtaining partial effectsat least for continuous explanatory variables Both are based on equation 1730 First the PEA is obtained by evaluating F1xb s 2 which we denote F1xb s 2 We Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 540 can then use this single factor to multiply the coefficients on the continuous explanatory variables The PEA has the same drawbacks here as in the probit and logit cases we may not be interested in the partial effect for the average because the average is either uninteresting or meaningless Plus we must decide whether to use averages of nonlinear functions or plug the averages into the nonlinear functions The average partial effect APE is preferred in most cases Here we compute the scale factor as n21g n i51F1xib s 2 Unlike the PEA the APE does not require us to plug in a fictitious or non existent unit from the population and there are no decisions to make about plugging averages into nonlinear functions Like the PEA the APE scale factor is always between zero and one because 0 F1xb s 2 1 for any values of the explanatory variables In fact P 1yi 00xi2 5 F1xib s 2 and so the APE scale factor and the PEA scale factor tend to be closer to one when there are few observa tions with yi 5 0 In the case that yi 0 for all i the Tobit and OLS estimates of the parameters are identical Of course if yi 0 for all i we cannot justify the use of a Tobit model anyway Using log 1yi2 in a linear regression model makes much more sense Unfortunately for discrete explanatory variables comparing OLS and Tobit estimates is not so easy although using the scale factor for continuous explanatory variables often is a useful approxi mation For Tobit the partial effect of a discrete explanatory variable for example a binary variable should really be obtained by estimating E1y0x2 from equation 1725 For example if x1 is a binary we should first plug in x1 5 1 and then x1 5 0 If we set the other explanatory variables at their sam ple averages we obtain a measure analogous to 1716 for the logit and probit cases If we compute the difference in expected values for each individual and then average the difference we get an APE analogous to 1717 Fortunately many modern statistical packages routinely compute the APES for fairly complicated models including the Tobit model and allow both continuous and discrete explanatory variables ExamplE 172 married Womens annual labor Supply The file MROZ includes data on hours worked for 753 married women 428 of whom worked for a wage outside the home during the year 325 of the women worked zero hours For the women who worked positive hours the range is fairly broad extending from 12 to 4950 Thus annual hours worked is a good candidate for a Tobit model We also estimate a linear model using all 753 obser vations by OLS and compute the heteroskedasticityrobust standard errors The results are given in Table 173 This table has several noteworthy features First the Tobit coefficient estimates have the same sign as the corresponding OLS estimates and the statistical significance of the estimates is similar Possible exceptions are the coefficients on nwifeinc and kidsge6 but the t statistics have similar mag nitudes Second though it is tempting to compare the magnitudes of the OLS and Tobit estimates this is not very informative We must be careful not to think that because the Tobit coefficient on kidslt6 is roughly twice that of the OLS coefficient the Tobit model implies a much greater response of hours worked to young children We can multiply the Tobit estimates by appropriate adjustment factors to make them roughly comparable to the OLS estimates The APE scale factor n21 a i21 n F1xib s 2 turns out to be about 589 which we can use to obtain the average partial effects for the Tobit estimation If for example we multiply the educ coefficient by 589 we get 589180652 4750 that is 475 hours more which is quite a bit larger than the OLS partial effect about 288 hours Table 174 contains the APEs for all variables where the APEs for the linear model are simply the OLS coefficients except for the variable exper which appears as a quadratic The APEs and their standard errors obtained from Stata 13 are rounded to two decimal places and because of rounding can differ slightly from what is obtained by multiplying 589 by the reported Tobit coefficient The Tobit APEs for nwifeinc educ and kidslt6 are all substantially larger in magnitude than the corresponding OLS coefficients Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 541 The APEs for exper and age are similar and for kidsge6 which is nowhere close to being statistically significant the Tobit APE is smaller in magnitude If instead we want the estimated effect of another year of education starting at the average values of all explanatory variables then we compute the PEA scale factor F1xb s 2 This turns out to be about 645 when we use the squared average of experience 1exper2 2 rather than the average of exper2 This partial effect which is about 52 hours is almost twice as large as the OLS estimate We have reported an Rsquared for both the linear regression and the Tobit models The Rsquared for OLS is the usual one For Tobit the Rsquared is the square of the correlation coefficient between TAblE 173 OLS and Tobit Estimation of Annual Hours Worked Dependent Variable hours Independent Variables Linear OLS Tobit MLE nwifeinc 2345 224 2881 446 educ 2876 1304 8065 2158 exper 6567 1079 13156 1728 exper 2 2700 372 2186 054 age 23051 424 25441 742 kidslt6 244209 5746 289402 11188 kidsge6 23278 2280 21622 3864 constant 133048 27488 96531 44644 Loglikelihood value Rsquared s 266 75018 2381909 274 112202 TAblE 174 Average Partial Effects for the Hours Worked Models Independent Variables Linear Tobit nwifeinc 2345 224 2519 262 educ 2876 1304 4747 1262 exper 5078 445 4879 359 age 23051 424 23203 429 kidslt 6 244209 5746 252628 6471 kidsge 6 23278 2280 2955 2275 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 542 yi and yi where yi 5 F1xib s 2xib 1 s f1xib s 2 is the estimate of E1y0x 5 xi2 This is motivated by the fact that the usual Rsquared for OLS is equal to the squared correlation between the yi and the fitted values see equation 329 In nonlinear models such as the Tobit model the squared correla tion coefficient is not identical to an Rsquared based on a sum of squared residuals as in 328 This is because the fitted values as defined earlier and the residuals yi 2 yi are not uncorrelated in the sample An Rsquared defined as the squared correlation coefficient between yi and yi has the advan tage of always being between zero and one an Rsquared based on a sum of squared residuals need not have this feature We can see that based on the Rsquared measures the Tobit conditional mean function fits the hours data somewhat but not substantially better However we should remember that the Tobit esti mates are not chosen to maximize an Rsquaredthey maximize the loglikelihood functionwhereas the OLS estimates are the values that do produce the highest Rsquared given the linear functional form By construction all of the Tobit fitted values for hours are positive By contrast 39 of the OLS fit ted values are negative Although negative predictions are of some concern 39 out of 753 is just over 5 of the observations It is not entirely clear how negative fitted values for OLS translate into dif ferences in estimated partial effects Figure 173 plots estimates of E1hours0x2 as a function of educa tion for the Tobit model the other explanatory variables are set at their average values For the linear model the equation graphed is hours 5 38719 1 2876 educ For the Tobit model the equation gra phed is hours 5 F3 1269412 1 8065 educ21122024 1269412 1 8065 educ2 1 112202 f 3 1269412 1 8065 educ21122024 As can be seen from the figure the linear model gives notably higher estimates of the expected hours worked at even fairly high levels of education For example at eight years of education the OLS predicted value of hours is about 6175 while the Tobit estimate is about 4239 At 12 years of education the predicted hours are about 7327 and 5983 respectively The two prediction lines cross after 17 years of education but no woman in the sample has more than 17 years of education The increasing slope of the Tobit line clearly indicates the increasing marginal effect of education on expected hours worked estimated expected hours 20 1050 0 years of education 900 750 600 450 150 300 12 16 0 4 8 FiguRE 173 Estimated expected values of hours with respect to education for the linear and Tobit models Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 543 172b Specification Issues in Tobit Models The Tobit model and in particular the formulas for the expectations in 1724 and 1725 rely crucially on normality and homoskedasticity in the underlying latent variable model When E1y0x2 5 b0 1 b1x1 1 p 1 bk xk we know from Chapter 5 that conditional normality of y does not play a role in unbiasedness consistency or large sample inference Heteroskedasticity does not affect unbiasedness or consistency of OLS although we must compute robust standard errors and test sta tistics to perform approximate inference In a Tobit model if any of the assumptions in 1718 fail then it is hard to know what the Tobit MLE is estimating Nevertheless for moderate departures from the assumptions the Tobit model is likely to provide good estimates of the partial effects on the con ditional means It is possible to allow for more general assumptions in 1718 but such models are much more complicated to estimate and interpret One potentially important limitation of the Tobit model at least in certain applications is that the expected value conditional on y 0 is closely linked to the probability that y 0 This is clear from equations 1726 and 1729 In particular the effect of xj on P1y 00x2 is proportional to bj as is the effect on E1y0y 0x2 where both functions multiplying bj are positive and depend on x only through xbs This rules out some interesting possibilities For example consider the relation ship between amount of life insurance coverage and a persons age Young people may be less likely to have life insurance at all so the probability that y 0 increases with age at least up to a point Conditional on having life insurance the value of policies might decrease with age since life insur ance becomes less important as people near the end of their lives This possibility is not allowed for in the Tobit model One way to informally evaluate whether the Tobit model is appropriate is to estimate a pro bit model where the binary outcome say w equals one if y 0 and w 5 0 if y 5 0 Then from 1721 w follows a probit model where the coefficient on xj is gj 5 bjs This means we can esti mate the ratio of bj to s by probit for each j If the Tobit model holds the probit estimate g j should be close to b j s where b j and s are the Tobit estimates These will never be identical because of sampling error But we can look for certain problematic signs For example if g j is significant and negative but b j is positive the Tobit model might not be appropriate Or if g j and b j are the same sign but 0b j s 0 is much larger or smaller than 0g j0 this could also indicate problems We should not worry too much about sign changes or magnitude differences on explanatory variables that are insig nificant in both models In the annual hours worked example s 5 112202 When we divide the Tobit coefficient on nwi feinc by s we obtain 2881112202 20079 the probit coefficient on nwifeinc is about 2012 which is different but not dramatically so On kidslt6 the coefficient estimate over s is about 2797 compared with the probit estimate of 2868 Again this is not a huge difference but it indicates that having small children has a larger effect on the initial labor force participation decision than on how many hours a woman chooses to work once she is in the labor force Tobit effectively averages these two effects together We do not know whether the effects are statistically different but they are of the same order of magnitude What happens if we conclude that the Tobit model is inappropriate There are models usually called hurdle or twopart models that can be used when Tobit seems unsuitable These all have the property that P1y 00x2 and E1y0y 0x2 depend on different parameters so xj can have dissimilar effects on these two functions See Wooldridge 2010 Chapter 17 for a description of these models 173 The Poisson Regression Model Another kind of nonnegative dependent variable is a count variable which can take on nonnegative integer values 50 1 2 p6 We are especially interested in cases where y takes on relatively few val ues including zero Examples include the number of children ever born to a woman the number of times someone is arrested in a year or the number of patents applied for by a firm in a year For the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 544 same reasons discussed for binary and Tobit responses a linear model for E1y0x1 p xk2 might not provide the best fit over all values of the explanatory variables Nevertheless it is always informative to start with a linear model as we did in Example 35 As with a Tobit outcome we cannot take the logarithm of a count variable because it takes on the value zero A profitable approach is to model the expected value as an exponential function E1y0x1 x2 p xk2 5 exp1b0 1 b1x1 1 p 1 bk xk2 1731 Because exp1 2 is always positive 1731 ensures that predicted values for y will also be positive The exponential function is graphed in Figure A5 of Appendix A Although 1731 is more complicated than a linear model we basically already know how to interpret the coefficients Taking the log of equation 1731 shows that log3E1y0x1 x2 p xk2 4 5 b0 1 b1x1 1 p 1 bk xk 1732 so that the log of the expected value is linear Therefore using the approximation properties of the log function that we have used often in previous chapters DE1y0x2 1100bj2Dxj In other words 100bj is roughly the percentage change in E1y0x2 given a oneunit increase in xj Sometimes a more accurate estimate is needed and we can easily find one by looking at discrete changes in the expected value Keep all explanatory variables except xk fixed and let x0 k be the initial value and x1 k the subsequent value Then the proportionate change in the expected value is 3exp1b0 1 xk21bk21 1 bk x1 k2exp1b0 1 xk21bk21 1 bkx0 k2 4 2 1 5 exp1bkDxk2 2 1 where xk21bk21 is shorthand for b1x1 1 p 1 bk21xk21 and Dxk 5 x1 k 2 x0 k When Dxk 5 1for example if xk is a dummy variable that we change from zero to onethen the change is exp1bk2 2 1 Given b k we obtain exp1b k2 2 1 and multiply this by 100 to turn the proportionate change into a percentage change If say xj 5 log1zj2 for some variable zj 0 then its coefficient bj is interpreted as an elasticity with respect to zj Technically it is an elasticity of the expected value of y with respect to zj because we cannot compute the percentage change in cases where y 5 0 For our purposes the distinction is unimportant The bottom line is that for practical purposes we can interpret the coefficients in equa tion 1731 as if we have a linear model with log1y2 as the dependent variable There are some subtle differences that we need not study here Because 1731 is nonlinear in its parametersremember exp1 2 is a nonlinear functionwe cannot use linear regression methods We could use nonlinear least squares which just as with OLS minimizes the sum of squared residuals It turns out however that all standard count data distribu tions exhibit heteroskedasticity and nonlinear least squares does not exploit this see Wooldridge 2010 Chapter 12 Instead we will rely on maximum likelihood and the important related method of quasimaximum likelihood estimation In Chapter 4 we introduced normality as the standard distributional assumption for linear regres sion The normality assumption is reasonable for roughly continuous dependent variables that can take on a large range of values A count variable cannot have a normal distribution because the nor mal distribution is for continuous variables that can take on all values and if it takes on very few values the distribution can be very different from normal Instead the nominal distribution for count data is the Poisson distribution Because we are interested in the effect of explanatory variables on y we must look at the Poisson distribution conditional on x The Poisson distribution is entirely determined by its mean so we only need to specify E1y0x2 We assume this has the same form as 1731 which we write in shorthand as exp1xb2 Then the probability that y equals the value h conditional on x is P1y 5 h0x2 5 exp32exp1xb2 43exp1xb2 4hh h 5 0 1 p Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 545 where h denotes factorial see Appendix B This distribution which is the basis for the Poisson regression model allows us to find conditional probabilities for any values of the explanatory vari ables For example P1y 5 00x2 5 exp32exp1xb2 4 Once we have estimates of the bj we can plug them into the probabilities for various values of x Given a random sample 51xi yi2 i 5 1 2 p n6 we can construct the loglikelihood function 1b2 5 a n i51 i1b2 5 a n i51 5yixib 2 exp1xib2 6 1733 where we drop the term log1yi2 because it does not depend on b This loglikelihood function is simple to maximize although the Poisson MLEs are not obtained in closed form The standard errors of the Poisson estimates b j are easy to obtain after the loglikelihood function has been maximized the formula is in Appendix 17B These are reported along with the b j by any software package As with the probit logit and Tobit models we cannot directly compare the magni tudes of the Poisson estimates of an exponential function with the OLS estimates of a lin ear function Nevertheless a rough comparison is possible at least for continuous explanatory variables If 1731 holds then the partial effect of xj with respect to E1y0x1 x2 p xk2 is E1y0x1 x2 p xk2xj 5 exp1b0 1 b1x1 1 p 1 bkxk2 bj This expression follows from the chain rule in calculus because the derivative of the exponential function is just the exponential function If we let g j denote an OLS slope coefficient from the regression y on x1 x2 xk then we can roughly compare the magnitude of the g j and the average partial effect for an exponential regression function Interestingly the APE scale factor in this case n21 a i51 n exp1b 0 1 b 1xi1 1 p 1 bk xik2 5 n21 a n i51yi is simply the sample average y of yi where we define the fitted values as y1 5 exp1b 0 1 xib 2 In other words for Poisson regression with an exponential mean function the average of the fitted val ues is the same as the average of the original outcomes on yijust as in the linear regression case This makes it simple to scale the Poisson estimates b j to make them comparable to the correspond ing OLS estimates g j for a continuous explanatory variable we can compare g j to y b j Although Poisson MLE analysis is a natural first step for count data it is often much too restrictive All of the probabilities and higher moments of the Poisson distribution are determined entirely by the mean In particular the variance is equal to the mean Var1y0x2 5 E1y0x2 1734 This is restrictive and has been shown to be violated in many applications Fortunately the Poisson distribution has a very nice robustness property whether or not the Poisson distribution holds we still get consistent asymptotically normal estimators of the bj See Wooldridge 2010 Chapter 18 for details This is analogous to the OLS estimator which is consistent and asymptotically normal whether or not the normality assumption holds yet OLS is the MLE under normality When we use Poisson MLE but we do not assume that the Poisson distribution is entirely cor rect we call the analysis quasimaximum likelihood estimation QMLE The Poisson QMLE is very handy because it is programmed in many econometrics packages However unless the Poisson variance assumption 1734 holds the standard errors need to be adjusted A simple adjustment to the standard errors is available when we assume that the variance is pro portional to the mean Var1y0x2 5 s2E1y0x2 1735 where s2 0 is an unknown parameter When s2 5 1 we obtain the Poisson variance assumption When s2 1 the variance is greater than the mean for all x this is called overdispersion because the variance is larger than in the Poisson case and it is observed in many applications of count regres sions The case s2 1 called underdispersion is less common but is allowed in 1735 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 546 Under 1735 it is easy to adjust the usual Poisson MLE standard errors Let b j denote the Poisson QMLE and define the residuals as u i 5 yi 2 yi where yi 5 exp1b 0 1 b 1xi1 1 p 1 b kxik2 is the fitted value As usual the residual for observation i is the difference between yi and its fitted value A con sistent estimator of s2 is 1n 2 k 2 12 21 a n i51u 2 iyi where the division by yi is the proper heteroske dasticity adjustment and n 2 k 2 1 is the df given n observations and k 1 1 estimates b 0 b 1 p b k Letting s be the positive square root of s 2 we multiply the usual Poisson standard errors by s If s is notably greater than one the corrected standard errors can be much bigger than the nominal gener ally incorrect Poisson MLE standard errors Even 1735 is not entirely general Just as in the linear model we can obtain standard errors for the Poisson QMLE that do not restrict the variance at all See Wooldridge 2010 Chapter 18 for further explanation Under the Poisson distributional assumption we can use the likelihood ratio statistic to test exclusion restrictions which as always has the form in 1712 If we have q exclusion restrictions the statistic is dis tributed approximately as x2 q under the null Under the less restrictive assumption 1735 a simple adjust ment is available and then we call the statistic the quasilikelihood ratio statistic we divide 1712 by s 2 where s 2 is obtained from the unrestricted model ExamplE 173 poisson Regression for Number of arrests We now apply the Poisson regression model to the arrest data in CRIME1 used among other places in Example 91 The dependent variable narr86 is the number of times a man is arrested during 1986 This variable is zero for 1970 of the 2725 men in the sample and only eight values of narr86 are greater than five Thus a Poisson regression model is more appropriate than a linear regression model Table 175 also presents the results of OLS estimation of a linear regression model The standard errors for OLS are the usual ones we could certainly have made these robust to het eroskedasticity The standard errors for Poisson regression are the usual maximum likelihood stand ard errors Because s 5 1232 the standard errors for Poisson regression should be inflated by this factor so each corrected standard error is about 23 higher For example a more reliable standard error for tottime is 12310152 0185 which gives a t statistic of about 13 The adjustment to the standard errors reduces the significance of all variables but several of them are still very statistically significant The OLS and Poisson coefficients are not directly comparable and they have very different meanings For example the coefficient on pcnv implies that if Dpcnv 5 10 the expected number of arrests falls by 013 pcnv is the proportion of prior arrests that led to conviction The Poisson coefficient implies that Dpcnv 5 10 reduces expected arrests by about 4 4021102 5 0402 and we multiply this by 100 to get the percentage effect As a policy matter this suggests we can reduce overall arrests by about 4 if we can increase the probability of conviction by 1 The Poisson coefficient on black implies that other factors being equal the expected number of arrests for a black man is estimated to be about 100 3exp16612 2 14 937 higher than for a white man with the same values for the other explanatory variables As with the Tobit application in Table 173 we report an Rsquared for Poisson regression the squared correlation coefficient between yi and yi 5 exp1b 0 1 b 1xi1 1 p 1 b kxik2 The motivation for this goodnessoffit measure is the same as for the Tobit model We see that the exponential regression model estimated by Poisson QMLE fits slightly better Remember that the OLS estimates are chosen to maximize the Rsquared but the Poisson estimates are not They are selected to maximize the log likelihood function Suppose that we obtain s 2 5 2 How will the adjusted standard errors compare with the usual Poisson MLE standard errors How will the quasiLR statistic compare with the usual LR statistic Exploring FurthEr 174 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 547 Other count data regression models have been proposed and used in applications which generalize the Poisson distribution in a variety of ways If we are interested in the effects of the xj on the mean response there is little reason to go beyond Poisson regression it is simple often gives good results and has the robustness property discussed earlier In fact we could apply Poisson regression to a y that is a Tobitlike outcome provided 1731 holds This might give good estimates of the mean effects Extensions of Poisson regression are more useful when we are interested in estimating prob abilities such as P1y 10x2 See for example Cameron and Trivedi 1998 174 Censored and Truncated Regression Models The models in Sections 171 172 and 173 apply to various kinds of limited dependent variables that arise frequently in applied econometric work In using these methods it is important to remember that we use a probit or logit model for a binary response a Tobit model for a corner solution out come or a Poisson regression model for a count response because we want models that account for important features of the distribution of y There is no issue of data observability For example in the Tobit application to womens labor supply in Example 172 there is no problem with observing hours worked it is simply the case that a nontrivial fraction of married women in the population choose not to work for a wage In the Poisson regression application to annual arrests we observe the dependent variable for every young man in a random sample from the population but the dependent variable can be zero as well as other small integer values TAblE 175 Determinants of Number of Arrests for Young Men Dependent Variable narr86 Independent Variables Linear OLS Exponential Poisson QMLE pcnv 2132 040 2402 085 avgsen 2011 012 2024 020 tottime 012 009 024 015 ptime86 2041 009 2099 021 qemp86 2051 014 2038 029 inc86 20015 0003 20081 0010 black 327 045 661 074 hispan 194 040 500 074 born60 2022 033 2051 064 constant 577 038 2600 067 Loglikelihood value Rsquared s 073 829 2224876 077 1232 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 548 Unfortunately the distinction between lumpiness in an outcome variable such as taking on the value zero for a nontrivial fraction of the population and problems of data censoring can be confus ing This is particularly true when applying the Tobit model In this book the standard Tobit model described in Section 172 is only for corner solution outcomes But the literature on Tobit models usually treats another situation within the same framework the response variable has been censored above or below some threshold Typically the censoring is due to survey design and in some cases institutional constraints Rather than treat data censoring problems along with corner solution out comes we solve data censoring by applying a censored regression model Essentially the problem solved by a censored regression model is one of missing data on the response variable y Although we are able to randomly draw units from the population and obtain information on the explanatory vari ables for all units the outcome on yi is missing for some i Still we know whether the missing values are above or below a given threshold and this knowledge provides useful information for estimating the parameters A truncated regression model arises when we exclude on the basis of y a subset of the popula tion in our sampling scheme In other words we do not have a random sample from the underlying population but we know the rule that was used to include units in the sample This rule is determined by whether y is above or below a certain threshold We explain more fully the difference between cen sored and truncated regression models later 174a Censored Regression Models While censored regression models can be defined without distributional assumptions in this subsec tion we study the censored normal regression model The variable we would like to explain y follows the classical linear model For emphasis we put an i subscript on a random draw from the population yi 5 b0 1 xib 1 ui ui0xi ci Normal10 s22 1736 wi 5 min1yici2 1737 Rather than observing yi we observe it only if it is less than a censoring value ci Notice that 1736 includes the assumption that ui is independent of ci For concreteness we explicitly consider censor ing from above or right censoring the problem of censoring from below or left censoring is handled similarly One example of right data censoring is top coding When a variable is top coded we know its value only up to a certain threshold For responses greater than the threshold we only know that the variable is at least as large as the threshold For example in some surveys family wealth is top coded Suppose that respondents are asked their wealth but people are allowed to respond with more than 500000 Then we observe actual wealth for those respondents whose wealth is less than 500000 but not for those whose wealth is greater than 500000 In this case the censoring threshold ci is the same for all i In many situations the censoring threshold changes with individual or family characteristics If we observed a random sample for 1x y2 we would simply estimate b by OLS and statistical inference would be standard We again absorb the intercept into x for simplicity The censoring causes problems Using arguments similar to the Tobit model an OLS regression using only the uncensored Let mvpi be the marginal value product for worker i this is the price of a firms good multiplied by the marginal product of the worker Assume mvpi is a linear function of exogenous variables such as educa tion experience and so on and an unob servable error Under perfect competition and without institutional constraints each worker is paid his or her marginal value product Let minwagei denote the minimum wage for worker i which varies by state We observe wagei which is the larger of mvpi and minwagei Write the appropriate model for the observed wage Exploring FurthEr 175 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 549 observationsthat is those with yi ci produces inconsistent estimators of the bj An OLS regres sion of wi on xi using all observations does not consistently estimate the bj unless there is no cen soring This is similar to the Tobit case but the problem is much different In the Tobit model we are modeling economic behavior which often yields zero outcomes the Tobit model is supposed to reflect this With censored regression we have a data collection problem because for some reason the data are censored Under the assumptions in 1736 and 1737 we can estimate b and s2 by maximum like lihood given a random sample on 1xi wi2 For this we need the density of wi given 1xi ci2 For uncensored observations wi 5 yi and the density of wi is the same as that for yi Normal1xibs22 For censored observations we need the probability that wi equals the censoring value ci given xi P1wj 5 ci0xi2 5 P1yi ci0xi2 5 P1ui ci 2 xib2 5 1 2 F3 1ci 2 xib2s4 We can combine these two parts to obtain the density of wi given xi and ci f1w0xici2 5 1 2 F3 1ci 2 xib2s4 w 5 ci 1738 5 11s2f 3 1w 2 xib2s4 w ci 1739 The loglikelihood for observation i is obtained by taking the natural log of the density for each i We can maximize the sum of these across i with respect to the bj and s to obtain the MLEs It is important to know that we can interpret the bj just as in a linear regression model under ran dom sampling This is much different than Tobit applications to corner solution responses where the expectations of interest are nonlinear functions of the bj An important application of censored regression models is duration analysis A duration is a variable that measures the time before a certain event occurs For example we might wish to explain the number of days before a felon released from prison is arrested For some felons this may never happen or it may happen after such a long time that we must censor the duration in order to analyze the data In duration applications of censored normal regression as well as in top coding we often use the natural log as the dependent variable which means we also take the log of the censoring threshold in 1737 As we have seen throughout this text using the log transformation for the dependent variable causes the parameters to be interpreted as percentage changes Further as with many positive variables the log of a duration typically has a distribution closer to conditional normal than the duration itself ExamplE 174 Duration of Recidivism The file RECID contains data on the time in months until an inmate in a North Carolina prison is arrested after being released from prison call this durat Some inmates participated in a work pro gram while in prison We also control for a variety of demographic variables as well as for measures of prison and criminal history Of 1445 inmates 893 had not been arrested during the period they were followed therefore these observations are censored The censoring times differed among inmates ranging from 70 to 81 months Table 176 gives the results of censored normal regression for logdurat Each of the coeffi cients when multiplied by 100 gives the estimated percentage change in expected duration given a ceteris paribus increase of one unit in the corresponding explanatory variable Several of the coefficients in Table 176 are interesting The variables priors number of prior convictions and tserved total months spent in prison have negative effects on the time until the next arrest occurs This suggests that these variables measure proclivity for criminal activity rather than representing a deterrent effect For example an inmate with one more prior conviction has a duration until next arrest that is almost 14 less A year of time served reduces duration by about 100 1210192 5 228 A somewhat surprising finding is that a man serving time for a felony has an Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 550 estimated expected duration that is almost 56 3exp14442 2 1 564 longer than a man serving time for a nonfelony Those with a history of drug or alcohol abuse have substantially shorter expected durations until the next arrest The variables alcohol and drugs are binary variables Older men and men who were married at the time of incarceration are expected to have significantly longer durations until their next arrest Black men have substantially shorter durations on the order of 42 3exp125432 2 1 2424 The key policy variable workprg does not have the desired effect The point estimate is that other things being equal men who participated in the work program have estimated recidivism dura tions that are about 63 shorter than men who did not participate The coefficient has a small t statistic so we would probably conclude that the work program has no effect This could be due to a selfselection problem or it could be a product of the way men were assigned to the program Of course it may simply be that the program was ineffective In this example it is crucial to account for the censoring especially because almost 62 of the durations are censored If we apply straight OLS to the entire sample and treat the censored durations as if they were uncensored the coefficient estimates are markedly different In fact they are all shrunk toward zero For example the coefficient on priors becomes 2059 1se 5 0092 and that on alcohol becomes 2262 1se 5 0602 Although the directions of the effects are the same the importance of these variables is greatly diminished The censored regression estimates are much more reliable TAblE 176 Censored Regression Estimation of Criminal Recidivism Dependent Variable logdurat Independent Variables Coefficient Standard Error workprg 2063 120 priors 2137 021 tserved 2019 003 felon 444 145 alcohol 2635 144 drugs 2298 133 black 2543 117 married 341 140 educ 023 025 age 0039 0006 constant 4099 348 Loglikelihood value s 2159706 1810 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 551 There are other ways of measuring the effects of each of the explanatory variables in Table 176 on the duration rather than focusing only on the expected duration A treatment of modern duration analysis is beyond the scope of this text For an introduction see Wooldridge 2010 Chapter 22 If any of the assumptions of the censored normal regression model are violatedin particular if there is heteroskedasticity or nonnormality in uithe MLEs are generally inconsistent This shows that the censoring is potentially very costly as OLS using an uncensored sample requires neither normality nor homoskedasticity for consistency There are methods that do not require us to assume a distribution but they are more advanced See Wooldridge 2010 Chapter 19 174b Truncated Regression Models The truncated regression model differs in an important respect from the censored regression model In the case of data censoring we do randomly sample units from the population The censoring problem is that while we always observe the explanatory variables for each randomly drawn unit we observe the outcome on y only when it is not censored above or below a given threshold With data truncation we restrict attention to a subset of the population prior to sampling so there is a part of the popula tion for which we observe no information In particular we have no information on explanatory vari ables The truncated sampling scenario typically arises when a survey targets a particular subset of the population and perhaps due to cost considerations entirely ignores the other part of the population Subsequently researchers might want to use the truncated sample to answer questions about the entire population but one must recognize that the sampling scheme did not generate a random sample from the whole population As an example Hausman and Wise 1977 used data from a negative income tax experiment to study various determinants of earnings To be included in the study a family had to have income less than 15 times the 1967 poverty line where the poverty line depended on family size Hausman and Wise wanted to use the data to estimate an earnings equation for the entire population The truncated normal regression model begins with an underlying population model that satis fies the classical linear model assumptions y 5 b0 1 xb 1 u u0x Normal10 s22 1740 Recall that this is a strong set of assumptions because u must not only be independent of x but also normally distributed We focus on this model because relaxing the assumptions is difficult Under 1740 we know that given a random sample from the population OLS is the most effi cient estimation procedure The problem arises because we do not observe a random sample from the population Assumption MLR2 is violated In particular a random draw 1xi yi2 is observed only if yi ci where ci is the truncation threshold that can depend on exogenous variablesin par ticular the xi In the Hausman and Wise example ci depends on family size This means that if 51xi yi2 i 5 1 p n6 is our observed sample then yi is necessarily less than or equal to ci This dif fers from the censored regression model in a censored regression model we observe xi for any ran domly drawn observation from the population in the truncated model we only observe xi if yi ci To estimate the bj along with s we need the distribution of yi given that yi ci and xi This is written as g1y0xici2 5 f1y0xibs22 F1ci0xibs22 y ci 1741 where f1y0xibs22 denotes the normal density with mean b0 1 xib and variance s2 and F1ci0xibs22 is the normal cdf with the same mean and variance evaluated at ci This expression for the density conditional on yi ci makes intuitive sense it is the population density for y given x divided by the probability that yi is less than or equal to ci given xi P1yi ci0xi2 In effect we renormalize the density by dividing by the area under f 1 0xibs22 that is to the left of ci Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 552 If we take the log of 1741 sum across all i and maximize the result with respect to the bj and s2 we obtain the maximum likelihood estimators This leads to consistent approximately normal estimators The inference including standard errors and loglikelihood statistics is standard and treated in Wooldridge 2010 Chapter 19 We could analyze the data from Example 174 as a truncated sample if we drop all data on an observation whenever it is censored This would give us 552 observations from a truncated normal distribution where the truncation point differs across i However we would never analyze duration data or topcoded data in this way as it eliminates useful information The fact that we know a lower bound for 893 durations along with the explanatory variables is useful information censored regres sion uses this information while truncated regression does not A better example of truncated regression is given in Hausman and Wise 1977 where they emphasize that OLS applied to a sample truncated from above generally produces estimators biased toward zero Intuitively this makes sense Suppose that the relationship of interest is between income and education levels If we only observe people whose income is below a certain threshold we are lopping off the upper end This tends to flatten the estimated line relative to the true regression line in the whole population Figure 174 illustrates the problem when income is truncated from above at 50000 Although we observe the data points represented by the open circles we do not observe the data points represented by the darkened circles A regression analysis using the truncated sample does not lead to consistent estimators Incidentally if the sample in Figure 174 were censored rather than truncatedthat is we had topcoded datawe would observe education levels for all points in Figure 174 but for individuals with incomes above 50000 we would not know the exact income amount We would only know that income was at least 50000 In effect all observations represented by the darkened circles would be brought down to the horizontal line at income 5 50 As with censored regression if the underlying homoskedastic normal assumption in 1740 is violated the truncated normal MLE is biased and inconsistent Methods that do not require these assumptions are available see Wooldridge 2010 Chapter 19 for discussion and references income in thousands of dollars 20 150 50 15 education in years 10 true regression line regression line for truncated population FiguRE 174 A true or population regression line and the incorrect regression line for the truncated population with observed incomes below 50000 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 553 175 Sample Selection Corrections Truncated regression is a special case of a general problem known as nonrandom sample selection But survey design is not the only cause of nonrandom sample selection Often respondents fail to provide answers to certain questions which leads to missing data for the dependent or independent variables Because we cannot use these observations in our estimation we should wonder whether dropping them leads to bias in our estimators Another general example is usually called incidental truncation Here we do not observe y because of the outcome of another variable The leading example is estimating the socalled wage offer function from labor economics Interest lies in how various factors such as education affect the wage an individual could earn in the labor force For people who are in the workforce we observe the wage offer as the current wage But for those currently out of the workforce we do not observe the wage offer Because working may be systematically correlated with unobservables that affect the wage offer using only working peopleas we have in all wage examples so farmight produce biased estimators of the parameters in the wage offer equation Nonrandom sample selection can also arise when we have panel data In the simplest case we have two years of data but due to attrition some people leave the sample This is particularly a prob lem in policy analysis where attrition may be related to the effectiveness of a program 175a When Is OLS on the Selected Sample Consistent In Section 94 we provided a brief discussion of the kinds of sample selection that can be ignored The key distinction is between exogenous and endogenous sample selection In the truncated Tobit case we clearly have endogenous sample selection and OLS is biased and inconsistent On the other hand if our sample is determined solely by an exogenous explanatory variable we have exogenous sample selection Cases between these extremes are less clear and we now provide careful definitions and assumptions for them The population model is y 5 b0 1 b1x1 1 p 1 bkxk 1 u E1u0x1 x2 p xk2 5 0 1742 It is useful to write the population model for a random draw as yi 5 xib 1 ui 1743 where we use xib as shorthand for b0 1 b1xi1 1 b2xi2 1 p 1 bkxik Now let n be the size of a ran dom sample from the population If we could observe yi and each xij for all i we would simply use OLS Assume that for some reason either yi or some of the independent variables are not observed for certain i For at least some observations we observe the full set of variables Define a selection indicator si for each i by si 5 1 if we observe all of 1yi xi2 and si 5 0 otherwise Thus si 5 1 indi cates that we will use the observation in our analysis si 5 0 means the observation will not be used We are interested in the statistical properties of the OLS estimators using the selected sample that is using observations for which si 5 1 Therefore we use fewer than n observations say n1 It turns out to be easy to obtain conditions under which OLS is consistent and even unbiased Effectively rather than estimating 1743 we can only estimate the equation siyi 5 sixib 1 siui 1744 When si 5 1 we simply have 1743 when si 5 0 we simply have 0 5 0 1 0 which clearly tells us nothing about b Regressing si yi on sixi for i 5 1 2 p n is the same as regressing yi on xi using the observations for which si 5 1 Thus we can learn about the consistency of the b j by studying 1744 on a random sample From our analysis in Chapter 5 the OLS estimators from 1744 are consistent if the error term has zero mean and is uncorrelated with each explanatory variable In the population the zero mean assumption is E1su2 5 0 and the zero correlation assumptions can be stated as E3 1sxj2 1su2 4 5 E1sxju2 5 0 1745 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 554 where s xj and u are random variables representing the population we have used the fact that s2 5 s because s is a binary variable Condition 1745 is different from what we need if we observe all vari ables for a random sample E1xju2 5 0 Therefore in the population we need u to be uncorrelated with sxj The key condition for unbiasedness is E1su0sx1 p sxk2 5 0 As usual this is a stronger assump tion than that needed for consistency If s is a function only of the explanatory variables then sxj is just a function of x1 x2 p xk by the conditional mean assumption in 1742 sxj is also uncorrelated with u In fact E1su0sx1 p sxk2 5 sE1u0sx1 p sxk2 5 0 because E1u0x1 p xk2 5 0 This is the case of exogenous sample selection where si 5 1 is determined entirely by xi1 p xik As an example if we are estimat ing a wage equation where the explanatory variables are education experience tenure gender mari tal status and so onwhich are assumed to be exogenouswe can select the sample on the basis of any or all of the explanatory variables If sample selection is entirely random in the sense that si is independent of 1xi ui2 then E1sxju2 5 E1s2E1xju2 5 0 because E1xju2 5 0 under 1742 Therefore if we begin with a random sample and randomly drop observations OLS is still consistent In fact OLS is again unbiased in this case provided there is no perfect multicollinearity in the selected sample If s depends on the explanatory variables and additional random terms that are independent of x and u OLS is also consistent and unbiased For example suppose that IQ score is an explanatory variable in a wage equation but IQ is missing for some people Suppose we think that selection can be described by s 5 1 if IQ v and s 5 0 if IQ v where v is an unobserved random variable that is independent of IQ u and the other explanatory variables This means that we are more likely to observe an IQ that is high but there is always some chance of not observing any IQ Conditional on the explanatory variables s is independent of u which means that E1u0x1 p xk s2 5 E1u0x1 p xk2 and the last expectation is zero by assumption on the population model If we add the homoskedasticity assumption E1u20xs2 5 E1u22 5 s2 then the usual OLS standard errors and test statistics are valid So far we have shown several situations where OLS on the selected sample is unbiased or at least consistent When is OLS on the selected sample inconsistent We already saw one example regression using a truncated sample When the truncation is from above si 5 1 if yi ci where ci is the truncation threshold Equivalently si 5 1 if ui ci 2 xib Because si depends directly on ui si and ui will not be uncorrelated even conditional on xi This is why OLS on the selected sample does not consistently estimate the bj There are less obvious ways that s and u can be correlated we con sider this in the next subsection The results on consistency of OLS extend to instrumental variables estimation If the IVs are denoted zh in the population the key condition for consistency of 2SLS is E1szhu2 5 0 which holds if E1u0zs2 5 0 Therefore if selection is determined entirely by the exogenous variables z or if s depends on other factors that are independent of u and z then 2SLS on the selected sample is gener ally consistent We do need to assume that the explanatory and instrumental variables are appropri ately correlated in the selected part of the population Wooldridge 2010 Chapter 19 contains precise statements of these assumptions It can also be shown that when selection is entirely a function of the exogenous variables MLE of a nonlinear modelsuch as a logit or probit modelproduces consistent asymptotically normal estimators and the usual standard errors and test statistics are valid Again see Wooldridge 2010 Chapter 19 175b Incidental Truncation As we mentioned earlier a common form of sample selection is called incidental truncation We again start with the population model in 1742 However we assume that we will always observe the explanatory variables xj The problem is we only observe y for a subset of the population The rule determining whether we observe y does not depend directly on the outcome of y A leading example Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 555 is when y 5 log1wageo2 where wageo is the wage offer or the hourly wage that an individual could receive in the labor market If the person is actually working at the time of the survey then we observe the wage offer because we assume it is the observed wage But for people out of the workforce we cannot observe wageo Therefore the truncation of wage offer is incidental because it depends on another variable namely labor force participation Importantly we would generally observe all other information about an individual such as education prior experience gender marital status and so on The usual approach to incidental truncation is to add an explicit selection equation to the popula tion model of interest y 5 xb 1 u E1u0x2 5 0 1746 s 5 13zg 1 v 04 1747 where s 5 1 if we observe y and zero otherwise We assume that elements of x and z are always observed and we write xb 5 b0 1 b1x1 1 p 1 bkxk and zg 5 g0 1 g1z1 1 p 1 gmzm The equation of primary interest is 1746 and we could estimate b by OLS given a random sample The selection equation 1747 depends on observed variables zh and an unobserved error v A standard assumption which we will make is that z is exogenous in 1746 E1u0x z2 5 0 In fact for the following proposed methods to work well we will require that x be a strict subset of z any xj is also an element of z and we have some elements of z that are not also in x We will see later why this is crucial The error term v in the sample selection equation is assumed to be independent of z and there fore x We also assume that v has a standard normal distribution We can easily see that correlation between u and v generally causes a sample selection problem To see why assume that 1u v2 is inde pendent of z Then taking the expectation of 1746 conditional on z and v and using the fact that x is a subset of z gives E1y0zv2 5 xb 1 E1u0zv2 5 xb 1 E1u0v2 where E1u0zv2 5 E1u0v2 because 1u v2 is independent of z Now if u and v are jointly normal with zero mean then E1u0v2 5 rv for some parameter r Therefore E1y0zv2 5 xb 1 rv We do not observe v but we can use this equation to compute E1y0zs2 and then specialize this to s 5 1 We now have E1y0zs2 5 xb 1 rE1v0zs2 Because s and v are related by 1747 and v has a standard normal distribution we can show that E1v0z s2 is simply the inverse Mills ratio l1zg2 when s 5 1 This leads to the important equation E1y0zs 5 12 5 xb 1 rl1zg2 1748 Equation 1748 shows that the expected value of y given z and observability of y is equal to xb plus an additional term that depends on the inverse Mills ratio evaluated at xg Remember we hope to estimate b This equation shows that we can do so using only the selected sample provided we include the term l1zg2 as an additional regressor If r 5 0 l1zg2 does not appear and OLS of y on x using the selected sample consistently esti mates b Otherwise we have effectively omitted a variable l1zg2 which is generally correlated with x When does r 5 0 The answer is when u and v are uncorrelated Because g is unknown we cannot evaluate l1zig2 for each i However from the assumptions we have made s given z follows a probit model P1s 5 10z2 5 F1zg2 1749 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 556 Therefore we can estimate g by probit of si on zi using the entire sample In a second step we can estimate b We summarize the procedure which has recently been dubbed the Heckit method in econometrics literature after the work of Heckman 1976 Sample Selection Correction i Using all n observations estimate a probit model of si on zi and obtain the estimates g h Compute the inverse Mills ratio l i 5 l1zig 2 for each i Actually we need these only for the i with si 5 1 ii Using the selected sample that is the observations for which si 5 1 say n1 of them run the regression of yi on xi l i 1750 The b j are consistent and approximately normally distributed A simple test of selection bias is available from regression 1750 Namely we can use the usual t statistic on l i as a test of H0 r 5 0 Under H0 there is no sample selection problem When r 2 0 the usual OLS standard errors reported from 1750 are not correct This is because they do not account for estimation of g which uses the same observations in regression 1750 and more Some econometrics packages compute corrected standard errors Unfortunately it is not as simple as a heteroskedasticity adjustment See Wooldridge 2010 Chapter 6 for further discussion In many cases the adjustments do not lead to important differences but it is hard to know that before hand unless r is small and insignificant We recently mentioned that x should be a strict subset of z This has two implications First any element that appears as an explanatory variable in 1746 should also be an explanatory variable in the selection equation Although in rare cases it makes sense to exclude elements from the selection equation including all elements of x in z is not very costly excluding them can lead to inconsistency if they are incorrectly excluded A second major implication is that we have at least one element of z that is not also in x This means that we need a variable that affects selection but does not have a partial effect on y This is not absolutely necessary to apply the procedurein fact we can mechanically carry out the two steps when z 5 xbut the results are usually less than convincing unless we have an exclusion restriction in 1746 The reason for this is that while the inverse Mills ratio is a nonlinear function of z it is often well approximated by a linear function If z 5 x l i can be highly correlated with the elements of xi As we know such multicollinearity can lead to very high standard errors for the b j Intuitively if we do not have a variable that affects selection but not y it is extremely difficult if not impossible to distinguish sample selection from a misspecified functional form in 1746 ExamplE 175 Wage Offer Equation for married Women We apply the sample selection correction to the data on married women in MROZ Recall that of the 753 women in the sample 428 worked for a wage during the year The wage offer equation is standard with logwage as the dependent variable and educ exper and exper2 as the explanatory variables In order to test and correct for sample selection biasdue to unobservability of the wage offer for nonworking womenwe need to estimate a probit model for labor force participation In addition to the education and experience variables we include the factors in Table 171 other income age number of young children and number of older children The fact that these four variables are excluded from the wage offer equation is an assumption we assume that given the productivity fac tors nwifeinc age kidslt6 and kidsge6 have no effect on the wage offer It is clear from the probit results in Table 171 that at least age and kidslt6 have a strong effect on labor force participation Table 177 contains the results from OLS and Heckit The standard errors reported for the Heckit results are just the usual OLS standard errors from regression 1750 There is no evidence of a sample selection problem in estimating the wage offer equation The coefficient on l has a very small Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 557 t statistic 239 so we fail to reject H0 r 5 0 Just as importantly there are no practically large dif ferences in the estimated slope coefficients in Table 177 The estimated returns to education differ by only onetenth of a percentage point TAblE 177 Wage Offer Equation for Married Women Dependent Variable logwage Independent Variables OLS Heckit educ 108 014 109 016 exper 042 012 044 016 exper 2 200081 00039 200086 00044 constant 2522 199 2578 307 l 032 134 Sample size Rsquared 428 157 428 157 An alternative to the preceding twostep estimation method is full MLE This is more compli cated as it requires obtaining the joint distribution of y and s It often makes sense to test for sample selection using the previous procedure if there is no evidence of sample selection there is no reason to continue If we detect sample selection bias we can either use the twostep estimates or estimate the regression and selection equations jointly by MLE See Wooldridge 2010 Chapter 19 In Example 175 we know more than just whether a woman worked during the year we know how many hours each woman worked It turns out that we can use this information in an alternative sample selection procedure In place of the inverse Mills ratio l i we use the Tobit residuals say vi which are computed as vi 5 yi 2 xib whenever yi 0 It can be shown that the regression in 1750 with vi in place of l i also produces consistent estimates of the bj and the standard t statistic on vi is a valid test for sample selection bias This approach has the advantage of using more information but it is less widely applicable See Wooldridge 2010 Chapter 19 There are many more topics concerning sample selection One worth mentioning is models with endogenous explanatory variables in addition to possible sample selection bias Write a model with a single endogenous explanatory variable as y1 5 a1y2 1 z1b1 1 u1 1751 where y1 is only observed when s 5 1 and y2 may only be observed along with y1 An example is when y1 is the percentage of votes received by an incumbent and y2 is the percentage of total expen ditures accounted for by the incumbent For incumbents who do not run we cannot observe y1 or y2 If we have exogenous factors that affect the decision to run and that are correlated with campaign expenditures we can consistently estimate a1 and the elements of b1 by instrumental variables To be convincing we need two exogenous variables that do not appear in 1751 Effectively one should affect the selection decision and one should be correlated with y2 the usual requirement for estimat ing 1751 by 2SLS Briefly the method is to estimate the selection equation by probit where all exogenous variables appear in the probit equation Then we add the inverse Mills ratio to 1751 and estimate the equation by 2SLS The inverse Mills ratio acts as its own instrument as it depends Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 558 PART 3 Advanced Topics only on exogenous variables We use all exogenous variables as the other instruments As before we can use the t statistic on l i as a test for selection bias See Wooldridge 2010 Chapter 19 for further information Summary In this chapter we have covered several advanced methods that are often used in applications especially in microeconomics Logit and probit models are used for binary response variables These models have some advantages over the linear probability model fitted probabilities are between zero and one and the partial effects diminish The primary cost to logit and probit is that they are harder to interpret The Tobit model is applicable to nonnegative outcomes that pile up at zero but also take on a broad range of positive values Many individual choice variables such as labor supply amount of life insurance and amount of pension fund invested in stocks have this feature As with logit and probit the expected val ues of y given xeither conditional on y 0 or unconditionallydepend on x and b in nonlinear ways We gave the expressions for these expectations as well as formulas for the partial effects of each xj on the expectations These can be estimated after the Tobit model has been estimated by maximum likelihood When the dependent variable is a count variablethat is it takes on nonnegative integer valuesa Poisson regression model is appropriate The expected value of y given the xj has an exponential form This gives the parameter interpretations as semielasticities or elasticities depending on whether xj is in level or logarithmic form In short we can interpret the parameters as if they are in a linear model with logy as the dependent variable The parameters can be estimated by MLE However because the Poisson distribu tion imposes equality of the variance and mean it is often necessary to compute standard errors and test statistics that allow for over or underdispersion These are simple adjustments to the usual MLE standard errors and statistics Censored and truncated regression models handle specific kinds of missing data problems In cen sored regression the dependent variable is censored above or below a threshold We can use information on the censored outcomes because we always observe the explanatory variables as in duration applications or top coding of observations A truncated regression model arises when a part of the population is excluded entirely we observe no information on units that are not covered by the sampling scheme This is a special case of a sample selection problem Section 175 gave a systematic treatment of nonrandom sample selection We showed that exogenous sample selection does not affect consistency of OLS when it is applied to the subsample but endogenous sample selection does We showed how to test and correct for sample selection bias for the general problem of incidental truncation where observations are missing on y due to the outcome of another variable such as labor force participation Heckmans method is relatively easy to implement in these situations Key Terms Average Marginal Effect AME Average Partial Effect APE Binary Response Models Censored Normal Regression Model Censored Regression Model Corner Solution Response Count Variable Duration Analysis Exogenous Sample Selection Heckit Method Incidental Truncation Inverse Mills Ratio Latent Variable Model Likelihood Ratio Statistic Limited Dependent Variable LDV Logit Model LogLikelihood Function Maximum Likelihood Estimation MLE Nonrandom Sample Selection Overdispersion Partial Effect at the Average PEA Percent Correctly Predicted Poisson Distribution Poisson Regression Model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 559 Problems 1 i For a binary response y let y be the proportion of ones in the sample which is equal to the sample average of the yj Let q 0 be the percent correctly predicted for the outcome y 5 0 and let q 1 be the percent correctly predicted for the outcome y 5 1 If p is the overall percent correctly predicted show that p is a weighted average of q 0 and q 1 p 5 11 2 y2q 0 1 yq 1 ii In a sample of 300 suppose that y 5 70 so that there are 210 outcomes with yi 5 1 and 90 with yi 5 0 Suppose that the percent correctly predicted when y 5 0 is 80 and the percent correctly predicted when y 5 1 is 40 Find the overall percent correctly predicted 2 Let grad be a dummy variable for whether a studentathlete at a large university graduates in five years Let hsGPA and SAT be high school grade point average and SAT score respectively Let study be the number of hours spent per week in an organized study hall Suppose that using data on 420 studentathletes the following logit model is obtained P 1grad 5 10hsGPASATstudy2 5 L12117 1 24 hsGPA 1 00058 SAT 1 073 study2 where L1z2 5 exp1z231 1 exp1z2 4 is the logit function Holding hsGPA fixed at 30 and SAT fixed at 1200 compute the estimated difference in the graduation probability for someone who spent 10 hours per week in study hall and someone who spent 5 hours per week 3 Requires calculus i Suppose in the Tobit model that x1 5 log1z12 and this is the only place z1 appears in x Show that E1y0y 0x2 z1 5 1b1z12 51 2 l1xbs2 3xbs 1 l1xbs2 46 1752 where b1 is the coefficient on log1z12 ii If x1 5 z1 and x2 5 z2 1 show that E1y0y 0x2 z1 5 1b1 1 2b2z12 51 2 l1xbs2 3xbs 1 l1xbs2 46 where b1 is the coefficient on z1 and b2 is the coefficient on z2 1 4 Let mvpi be the marginal value product for worker i which is the price of a firms good multiplied by the marginal product of the worker Assume that log1mvpi2 5 b0 1 b1xi1 1 p 1 bkxik 1 ui wagei 5 max1mvpiminwagei2 where the explanatory variables include education experience and so on and minwagei is the mini mum wage relevant for person i Write log1wagei2 in terms of log1mvpi2 and log1minwagei2 Probit Model Pseudo RSquared QuasiLikelihood Ratio Statistic QuasiMaximum Likelihood Estimation QMLE Response Probability Selected Sample Tobit Model Top Coding Truncated Normal Regression Model Truncated Regression Model Wald Statistic Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 560 PART 3 Advanced Topics 5 Requires calculus Let patents be the number of patents applied for by a firm during a given year Assume that the conditional expectation of patents given sales and RD is E1patents0salesRD2 5 exp3b0 1 b1log1sales2 1 b2RD 1 b3RD24 where sales is annual firm sales and RD is total spending on research and development over the past 10 years i How would you estimate the bj Justify your answer by discussing the nature of patents ii How do you interpret b1 iii Find the partial effect of RD on EpatentssalesRD 6 Consider a family saving function for the population of all families in the United States sav 5 b0 1 b1inc 1 b2hhsize 1 b3educ 1 b4age 1 u where hhsize is household size educ is years of education of the household head and age is age of the household head Assume that E1uinchhsizeeducage2 5 0 i Suppose that the sample includes only families whose head is over 25 years old If we use OLS on such a sample do we get unbiased estimators of the bj Explain ii Now suppose our sample includes only married couples without children Can we estimate all of the parameters in the saving equation Which ones can we estimate iii Suppose we exclude from our sample families that save more than 25000 per year Does OLS produce consistent estimators of the bj 7 Suppose you are hired by a university to study the factors that determine whether students admitted to the university actually come to the university You are given a large random sample of students who were admitted the previous year You have information on whether each student chose to attend high school performance family income financial aid offered race and geographic variables Someone says to you Any analysis of that data will lead to biased results because it is not a random sample of all college applicants but only those who apply to this university What do you think of this criticism Computer Exercises C1 Use the data in PNTSPRD for this exercise i The variable favwin is a binary variable if the team favored by the Las Vegas point spread wins A linear probability model to estimate the probability that the favored team wins is P1 favwin 5 10spread2 5 b0 1 b1spread Explain why if the spread incorporates all relevant information we expect b0 5 5 ii Estimate the model from part i by OLS Test H0 b0 5 5 against a twosided alternative Use both the usual and heteroskedasticityrobust standard errors iii Is spread statistically significant What is the estimated probability that the favored team wins when spread 5 10 iv Now estimate a probit model for P1favwin 5 10spread2 Interpret and test the null hypothesis that the intercept is zero Hint Remember that F102 5 5 v Use the probit model to estimate the probability that the favored team wins when spread 5 10 Compare this with the LPM estimate from part iii vi Add the variables favhome fav25 and und25 to the probit model and test joint significance of these variables using the likelihood ratio test How many df are in the chisquare distribution Interpret this result focusing on the question of whether the spread incorporates all observable information prior to a game Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 561 C2 Use the data in LOANAPP for this exercise see also Computer Exercise C8 in Chapter 7 i Estimate a probit model of approve on white Find the estimated probability of loan approval for both whites and nonwhites How do these compare with the linear probability estimates ii Now add the variables hrat obrat loanprc unem male married dep sch cosign chist pubrec mortlat1 mortlat2 and vr to the probit model Is there statistically significant evidence of discrimination against nonwhites iii Estimate the model from part ii by logit Compare the coefficient on white to the probit estimate iv Use equation 1717 to estimate the sizes of the discrimination effects for probit and logit C3 Use the data in FRINGE for this exercise i For what percentage of the workers in the sample is pension equal to zero What is the range of pension for workers with nonzero pension benefits Why is a Tobit model appropriate for modeling pension ii Estimate a Tobit model explaining pension in terms of exper age tenure educ depends married white and male Do whites and males have statistically significant higher expected pension benefits iii Use the results from part ii to estimate the difference in expected pension benefits for a white male and a nonwhite female both of whom are 35 years old are single with no dependents have 16 years of education and have 10 years of experience iv Add union to the Tobit model and comment on its significance v Apply the Tobit model from part iv but with peratio the pensionearnings ratio as the dependent variable Notice that this is a fraction between zero and one but though it often takes on the value zero it never gets close to being unity Thus a Tobit model is fine as an approximation Does gender or race have an effect on the pensionearnings ratio C4 In Example 91 we added the quadratic terms pcnv2 ptime862 and inc862 to a linear model for narr86 i Use the data in CRIME1 to add these same terms to the Poisson regression in Example 173 ii Compute the estimate of s2 given by s 2 5 1n 2 k 2 12 21g n i51 u 2 iyi Is there evidence of overdispersion How should the Poisson MLE standard errors be adjusted iii Use the results from parts i and ii and Table 175 to compute the quasilikelihood ratio statistic for joint significance of the three quadratic terms What do you conclude C5 Refer to Table 131 in Chapter 13 There we used the data in FERTIL1 to estimate a linear model for kids the number of children ever born to a woman i Estimate a Poisson regression model for kids using the same variables in Table 131 Interpret the coefficient on y82 ii What is the estimated percentage difference in fertility between a black woman and a nonblack woman holding other factors fixed iii Obtain s Is there evidence of over or underdispersion iv Compute the fitted values from the Poisson regression and obtain the Rsquared as the squared correlation between kidsi and kidsi Compare this with the Rsquared for the linear regression model C6 Use the data in RECID to estimate the model from Example 174 by OLS using only the 552 uncen sored durations Comment generally on how these estimates compare with those in Table 176 C7 Use the MROZ data for this exercise i Using the 428 women who were in the workforce estimate the return to education by OLS including exper exper2 nwifeinc age kidslt6 and kidsge6 as explanatory variables Report your estimate on educ and its standard error ii Now estimate the return to education by Heckit where all exogenous variables show up in the secondstage regression In other words the regression is logwage on educ exper exper2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 562 PART 3 Advanced Topics nwifeinc age kidslt6 kidsge6 and l Compare the estimated return to education and its standard error to that from part i iii Using only the 428 observations for working women regress l on educ exper exper2 nwifeinc age kidslt6 and kidsge6 How big is the Rsquared How does this help explain your findings from part ii Hint Think multicollinearity C8 The file JTRAIN2 contains data on a job training experiment for a group of men Men could enter the program starting in January 1976 through about mid1977 The program ended in December 1977 The idea is to test whether participation in the job training program had an effect on unemployment prob abilities and earnings in 1978 i The variable train is the job training indicator How many men in the sample participated in the job training program What was the highest number of months a man actually participated in the program ii Run a linear regression of train on several demographic and pretraining variables unem74 unem75 age educ black hisp and married Are these variables jointly significant at the 5 level iii Estimate a probit version of the linear model in part ii Compute the likelihood ratio test for joint significance of all variables What do you conclude iv Based on your answers to parts ii and iii does it appear that participation in job training can be treated as exogenous for explaining 1978 unemployment status Explain v Run a simple regression of unem78 on train and report the results in equation form What is the estimated effect of participating in the job training program on the probability of being unemployed in 1978 Is it statistically significant vi Run a probit of unem78 on train Does it make sense to compare the probit coefficient on train with the coefficient obtained from the linear model in part v vii Find the fitted probabilities from parts v and vi Explain why they are identical Which approach would you use to measure the effect and statistical significance of the job training program viii Add all of the variables from part ii as additional controls to the models from parts v and vi Are the fitted probabilities now identical What is the correlation between them ix Using the model from part viii estimate the average partial effect of train on the 1978 unemployment probability Use 1717 with ck 5 0 How does the estimate compare with the OLS estimate from part viii C9 Use the data in APPLE for this exercise These are telephone survey data attempting to elicit the demand for a fictional ecologically friendly apple Each family was randomly presented with a set of prices for regular apples and the ecolabeled apples They were asked how many pounds of each kind of apple they would buy i Of the 660 families in the sample how many report wanting none of the ecolabeled apples at the set price ii Does the variable ecolbs seem to have a continuous distribution over strictly positive values What implications does your answer have for the suitability of a Tobit model for ecolbs iii Estimate a Tobit model for ecolbs with ecoprc regprc faminc and hhsize as explanatory variables Which variables are significant at the 1 level iv Are faminc and hhsize jointly significant v Are the signs of the coefficients on the price variables from part iii what you expect Explain vi Let b1 be the coefficient on ecoprc and let b2 be the coefficient on regprc Test the hypothesis H0 2b1 5 b2 against the twosided alternative Report the pvalue of the test You might want to refer to Section 44 if your regression package does not easily compute such tests vii Obtain the estimates of E1ecolbs0x2 for all observations in the sample See equation 1725 Call these ecolbsi What are the smallest and largest fitted values Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 563 viii Compute the squared correlation between ecolbsi and ecolbsi ix Now estimate a linear model for ecolbs using the same explanatory variables from part iii Why are the OLS estimates so much smaller than the Tobit estimates In terms of goodnessof fit is the Tobit model better than the linear model x Evaluate the following statement Because the Rsquared from the Tobit model is so small the estimated price effects are probably inconsistent C10 Use the data in SMOKE for this exercise i The variable cigs is the number of cigarettes smoked per day How many people in the sample do not smoke at all What fraction of people claim to smoke 20 cigarettes a day Why do you think there is a pileup of people at 20 cigarettes ii Given your answers to part i does cigs seem a good candidate for having a conditional Poisson distribution iii Estimate a Poisson regression model for cigs including logcigpric logincome white educ age and age2 as explanatory variables What are the estimated price and income elasticities iv Using the maximum likelihood standard errors are the price and income variables statistically significant at the 5 level v Obtain the estimate of s2 described after equation 1735 What is s How should you adjust the standard errors from part iv vi Using the adjusted standard errors from part v are the price and income elasticities now statistically different from zero Explain vii Are the education and age variables significant using the more robust standard errors How do you interpret the coefficient on educ viii Obtain the fitted values yi from the Poisson regression model Find the minimum and maximum values and discuss how well the exponential model predicts heavy cigarette smoking ix Using the fitted values from part viii obtain the squared correlation coefficient between yi and yi x Estimate a linear model for cigs by OLS using the explanatory variables and same functional forms as in part iii Does the linear model or exponential model provide a better fit Is either Rsquared very large C11 Use the data in CPS91 for this exercise These data are for married women where we also have infor mation on each husbands income and demographics i What fraction of the women report being in the labor force ii Using only the data for working womenyou have no choiceestimate the wage equation log1wage2 5 b0 1 b1educ 1 b2exper 1 b3exper2 1 b4black 1 b5hispanic 1 u by ordinary least squares Report the results in the usual form Do there appear to be significant wage differences by race and ethnicity iii Estimate a probit model for inlf that includes the explanatory variables in the wage equation from part ii as well as nwifeinc and kidlt6 Do these last two variables have coefficients of the expected sign Are they statistically significant iv Explain why for the purposes of testing and possibly correcting the wage equation for selection into the workforce it is important for nwifeinc and kidlt6 to help explain inlf What must you assume about nwifeinc and kidlt6 in the wage equation v Compute the inverse Mills ratio for each observation and add it as an additional regressor to the wage equation from part ii What is its twosided pvalue Do you think this is particularly small with 3286 observations vi Does adding the inverse Mills ratio change the coefficients in the wage regression in important ways Explain Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 564 PART 3 Advanced Topics C12 Use the data in CHARITY to answer these questions i The variable respond is a binary variable equal to one if an individual responded with a donation to the most recent request The database consists only of people who have responded at least once in the past What fraction of people responded most recently ii Estimate a probit model for respond using resplast weekslast propresp mailsyear and avggift as explanatory variables Which of the explanatory variables is statistically significant iii Find the average partial effect for mailsyear and compare it with the coefficient from a linear probability model iv Using the same explanatory variables estimate a Tobit model for gift the amount of the most recent gift in Dutch guilders Now which explanatory variable is statistically significant v Compare the Tobit APE for mailsyear with that from a linear regression Are they similar vi Are the estimates from parts ii and iv entirely compatible with at Tobit model Explain C13 Use the data in HTV to answer this question i Using OLS on the full sample estimate a model for logwage using explanatory variables educ abil exper nc west south and urban Report the estimated return to education and its standard error ii Now estimate the equation from part i using only people with educ 16 What percentage of the sample is lost Now what is the estimated return to a year of schooling How does it compare with part i iii Now drop all observations with wage 20 so that everyone remaining in the sample earns less than 20 an hour Run the regression from part i and comment on the coefficient on educ Because the normal truncated regression model assumes that y is continuous it does not matter in theory whether we drop observations with wage 20 or wage 20 In practice including in this application it can matter slightly because there are some people who earn exactly 20 per hour iv Using the sample in part iii apply truncated regression with the upper truncation point being log20 Does truncated regression appear to recover the return to education in the full population assuming the estimate from i is consistent Explain C14 Use the data in HAPPINESS for this question See also Computer Exercise C15 in Chapter 13 i Estimate a probit probability model relating vhappy to occattend and regattend and include a full set of year dummies Find the average partial effects for occattend and regattend How do these compare with those from estimating a linear probability model ii Define a variable highinc equal to one if family income is above 25000 Include highinc unem10 educ and teens to the probit estimation in part ii Is the APE of regattend affected much What about its statistical significance iii Discuss the APEs and statistical significance of the four new variables in part ii Do the estimates make sense iv Controlling for the factors in part ii do there appear to be differences in happiness by gender or race Justify your answer C15 Use the data set in ALCOHOL obtained from Terza 2002 to answer this question The data on 9822 men includes labor market information whether the man abuses alcohol and demographic and background variables In this question you will study the effects of alcohol abuse on employ which is a binary variable equal to one if the man has a job If employ 5 0 the man is either unemployed or not in the workforce i What fraction of the sample is employed at the time of the interview What fraction of the sample has abused alcohol ii Run the simple regression of employ on abuse and report the results in the usual form obtaining the heteroskedasticityrobust standard errors Interpret the estimated equation Is the relationship as you expected Is it statistically significant Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 565 iii Run a probit of employ on abuse Do you get the same sign and statistical significance as in part ii How does the average partial effect for the probit compare with that for the linear probability model iv Obtain the fitted values for the LPM estimated in part ii and report what they are when abuse 5 0 and when abuse 5 1 How do these compare to the probit fitted values and why v To the LPM in part ii add the variables age agesq educ educsq married famsize white northeast midwest south centcity outercity qrt1 qrt2 and qrt3 What happens to the coefficient on abuse and its statistical significance vi Estimate a probit model using the variables in part v Find the APE of abuse and its t statistic Is the estimated effect now identical to that for the linear model Is it close vii Variables indicating the overall health of each man are also included in the data set Is it obvious that such variables should be included as controls Explain viii Why might abuse be properly thought of as endogenous in the employ equation Do you think the variables mothalc and fathalc indicating whether a mans mother or father were alcoholics are sensible instrumental variables for abuse ix Estimate the LPM underlying part v by 2SLS where mothalc and fathalc act as IVs for abuse Is the difference between the 2SLS and OLS coefficients practically large x Use the test described in Section 155 to test whether abuse is endogenous in the LPM C16 Use the data in CRIME1 to answer this question i For the OLS estimates reported in Table 175 find the heteroskedasticityrobust standard errors In terms of statistical significance of the coefficients are there any notable changes ii Obtain the fully robust standard errorsthat is those that do not even require assumption 1735for the Poisson regression estimates in the second column This requires that you have a statistical package that computes the fully robust standard errors Compare the fully robust 95 confidence interval for bpcnv with that obtained using the standard error in Table 175 iii Compute the average partial effects for each variable in the Poisson regression model Use the formula for binary explanatory variables for black hispan and born60 Compare the APEs for qemp86 and inc86 with the corresponding OLS coefficients iv If your statistical package reports the robust standard errors for the APEs in part iii compare the robust t statistic for the OLS estimate of bpcnv with the robust t statistic for the APE of pcnv in the Poisson regression APPEndix 17A 17A1 Maximum Likelihood Estimation with Explanatory Variables Appendix C provides a review of maximum likelihood estimation MLE in the simplest case of estimating the parameters in an unconditional distribution But most models in econometrics have explanatory variables whether we estimate those models by OLS or MLE The latter is indispens able for nonlinear models and here we provide a very brief description of the general approach All of the models covered in this chapter can be put in the following form Let f1y0xb2 denote the density function for a random draw yi from the population conditional on xi 5 x The maximum likelihood estimator MLE of b maximizes the loglikelihood function max b a n i51 log f1yi0xi b2 1753 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 566 PART 3 Advanced Topics where the vector b is the dummy argument in the maximization problem In most cases the MLE which we write as b is consistent and has an approximate normal distribution in large samples This is true even though we cannot write down a formula for b except in very special circumstances For the binary response case logit and probit the conditional density is determined by two values f 110xb2 5 P1yi 5 10xi2 5 G1xib2 and f 100xb2 5 P1yi 5 00xi2 5 1 2 G1xib2 In fact a succinct way to write the density is f1y0xb2 5 31 2 G1xb2 4112y23G1xb2 4y for y 5 0 1 Thus we can write 1753 as max b a n i51 5 11 2 yi2log31 2 G1xib2 4 1 yilog3G1xib2 46 1754 Generally the solutions to 1754 are quickly found by modern computers using iterative meth ods to maximize a function The total computation time even for fairly large data sets is typically quite low The loglikelihood function for the Tobit model and for censored and truncated regression are only slightly more complicated depending on an additional variance parameter in addition to b They are easily derived from the densities obtained in the text See Wooldridge 2010 for details APPEndix 17B 17B1 Asymptotic Standard Errors in Limited Dependent Variable Models Derivations of the asymptotic standard errors for the models and methods introduced in this chapter are well beyond the scope of this text Not only do the derivations require matrix algebra but they also require advanced asymptotic theory of nonlinear estimation The background needed for a care ful analysis of these methods and several derivations are given in Wooldridge 2010 It is instructive to see the formulas for obtaining the asymptotic standard errors for at least some of the methods Given the binary response model P1y 5 10x2 5 G1xb2 where G1 2 is the logit or probit function and b is the k 3 1 vector of parameters the asymptotic variance matrix of b is estimated as Avar1b 2 a a n i51 3g1xib 2 42xrixi G1xib 2 31 2 G1xib 2 4 b 21 1755 which is a k 3 k matrix See Appendix D for a summary of matrix algebra Without the terms involving g1 2 and G1 2 this formula looks a lot like the estimated variance matrix for the OLS estimator minus the term s 2 The expression in 1755 accounts for the nonlinear nature of the response probabilitythat is the nonlinear nature of G1 2as well as the particular form of hetero skedasticity in a binary response model Var1y0x2 5 G1xb2 31 2 G1xb2 4 The square roots of the diagonal elements of 1755 are the asymptotic standard errors of the b j and they are routinely reported by econometrics software that supports logit and probit analysis Once we have these asymptotic t statistics and confidence intervals are obtained in the usual ways The matrix in 1755 is also the basis for Wald tests of multiple restrictions on b see Wooldridge 2010 Chapter 15 The asymptotic variance matrix for Tobit is more complicated but has a similar structure Note that we can obtain a standard error for s as well The asymptotic variance for Poisson regression allowing for s2 2 1 in 1735 has a form much like 1755 Avar1b 2 5 s 2a a n i51 exp1xib 2xrixib 21 1756 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 567 The square roots of the diagonal elements of this matrix are the asymptotic standard errors If the Poisson assumption holds we can drop s 2 from the formula because s2 5 1 The formula for the fully robust variance matrix estimator is obtained in Wooldridge 2010 Chapter 18 Avar1b 2 5 c a n i51 exp1xib 2xrixid 21 a a n i51 u 2 ixrixib c a n i51 exp1xib 2xrixid 21 where u i 5 yi 2 exp1xib 2 are the residuals from the Poisson regression This expression has a struc ture similar to the heteroskedasticityrobust standard variance matrix estimator for OLS and it is computed routinely by many software packages to obtain the fully robust standard errors Asymptotic standard errors for censored regression truncated regression and the Heckit sample selection correction are more complicated although they share features with the previous formulas See Wooldridge 2010 for details Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 568 I n this chapter we cover some more advanced topics in time series econometrics In Chapters 10 11 and 12 we emphasized in several places that using time series data in regression analysis requires some care due to the trending persistent nature of many economic time series In addition to studying topics such as infinite distributed lag models and forecasting we also discuss some recent advances in analyzing time series processes with unit roots In Section 181 we describe infinite distributed lag models which allow a change in an explana tory variable to affect all future values of the dependent variable Conceptually these models are straightforward extensions of the finite distributed lag models in Chapter 10 but estimating these models poses some interesting challenges In Section 182 we show how to formally test for unit roots in a time series process Recall from Chapter 11 that we excluded unit root processes to apply the usual asymptotic theory Because the presence of a unit root implies that a shock today has a longlasting impact determining whether a process has a unit root is of interest in its own right We cover the notion of spurious regression between two time series processes each of which has a unit root in Section 183 The main result is that even if two unit root series are independent it is quite likely that the regression of one on the other will yield a statistically significant t statistic This emphasizes the potentially serious consequences of using standard inference when the dependent and independent variables are integrated processes The notion of cointegration applies when two series are I1 but a linear combination of them is I0 in this case the regression of one on the other is not spurious but instead tells us something Advanced Time Series Topics c h a p t e r 18 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 569 about the longrun relationship between them Cointegration between two series also implies a particular kind of model called an error correction model for the shortterm dynamics We cover these models in Section 184 In Section 185 we provide an overview of forecasting and bring together all of the tools in this and previous chapters to show how regression methods can be used to forecast future outcomes of a time series The forecasting literature is vast so we focus only on the most common regressionbased methods We also touch on the related topic of Granger causality 181 Infinite Distributed Lag Models Let 51yt zt2 t 5 p 22 21 0 1 2 p6 be a bivariate time series process which is only partially observed An infinite distributed lag IDL model relating yt to current and all past values of z is yt 5 a 1 d0zt 1 d1zt21 1 d2zt22 1 p 1 ut 181 where the sum on lagged z extends back to the indefinite past This model is only an approximation to reality as no economic process started infinitely far into the past Compared with a finite distributed lag model an IDL model does not require that we truncate the lag at a particular value In order for model 181 to make sense the lag coefficients dj must tend to zero as j S This is not to say that d2 is smaller in magnitude than d1 it only means that the impact of zt2j on yt must eventually become small as j gets large In most applications this makes economic sense as well the distant past of z should be less important for explaining y than the recent past of z Even if we decide that 181 is a useful model we clearly cannot estimate it without some restrictions For one we only observe a finite history of data Equation 181 involves an infinite number of parameters d0 d1 d2 p which cannot be estimated without restrictions Later we place restrictions on the dj that allow us to estimate 181 As with finite distributed lag FDL models the impact propensity in 181 is simply d0 see Chapter 10 Generally the dh have the same interpretation as in an FDL Suppose that zs 5 0 for all s 0 and that z0 5 1 and zs 5 0 for all s 1 in other words at time t 5 0 z increases temporarily by one unit and then reverts to its initial level of zero For any h 0 we have yh 5 a 1 dh 1 uh for all h 0 and so E1yh2 5 a 1 dh 182 where we use the standard assumption that uh has zero mean It follows that dh is the change in E1yh2 given a oneunit temporary change in z at time zero We just said that dh must be tending to zero as h gets large for the IDL to make sense This means that a temporary change in z has no longrun effect on expected y E1yh2 5 a 1 dh S a as h S We assumed that the process z starts at zs 5 0 and that the oneunit increase occurred at t 5 0 These were only for the purpose of illustration More generally if z temporarily increases by one unit from any initial level at time t then dh measures the change in the expected value of y after h peri ods The lag distribution which is dh plotted as a function of h shows the expected path that future outcomes on y follow given the oneunit temporary increase in z The longrun propensity in model 181 is the sum of all of the lag coefficients LRP 5 d0 1 d1 1 d2 1 d3 1 p 183 where we assume that the infinite sum is well defined Because the dj must converge to zero the LRP can often be well approximated by a finite sum of the form d0 1 d1 1 p 1 dp for sufficiently large p To interpret the LRP suppose that the process zt is steady at zs 5 0 for s 0 At t 5 0 the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 570 process permanently increases by one unit For example if zt is the percentage change in the money supply and yt is the inflation rate then we are interested in the effects of a permanent increase of one percentage point in money supply growth Then by substituting zs 5 0 for s 0 and zt 5 1 for t 0 we have yh 5 a 1 d0 1 d1 1 p 1 dh 1 uh where h 0 is any horizon Because ut has a zero mean for all t we have E1yh2 5 a 1 d0 1 d1 1 p 1 dh 184 It is useful to compare 184 and 182 As the horizon increases that is as h S the righthand side of 184 is by definition the longrun propensity plus a Thus the LRP measures the longrun change in the expected value of y given a oneunit permanent increase in z The previous derivation of the LRP and the interpretation of dj used the fact that the errors have a zero mean as usual this is not much of an assump tion provided an intercept is included in the model A closer examination of our reasoning shows that we assumed that the change in z during any time period had no effect on the expected value of ut This is the infinite distributed lag version of the strict exogeneity assumption that we introduced in Chapter 10 in particular Assumption TS3 Formally E1ut0 p zt22 zt21 zt zt11 p2 5 0 185 so that the expected value of ut does not depend on the z in any time period Although 185 is natu ral for some applications it rules out other important possibilities In effect 185 does not allow feedback from yt to future z because zt1h must be uncorrelated with ut for h 0 In the inflation money supply growth example where yt is inflation and zt is money supply growth 185 rules out future changes in money supply growth that are tied to changes in todays inflation rate Given that money supply policy often attempts to keep interest rates and inflation at certain levels this might be unrealistic One approach to estimating the dj which we cover in the next subsection requires a strict exog eneity assumption in order to produce consistent estimators of the dj A weaker assumption is E1ut0zt zt21 p2 5 0 186 Under 186 the error is uncorrelated with current and past z but it may be correlated with future z this allows zt to be a variable that follows policy rules that depend on past y Sometimes 186 is sufficient to estimate the dj we explain this in the next subsection One thing to remember is that neither 185 nor 186 says anything about the serial correlation properties of 5ut6 This is just as in finite distributed lag models If anything we might expect the 5ut6 to be serially correlated because 181 is not generally dynamically complete in the sense dis cussed in Section 114 We will study the serial correlation problem later How do we interpret the lag coefficients and the LRP if 186 holds but 185 does not The answer is the same way as before We can still do the previous thought or counterfactual experi ment even though the data we observe are generated by some feedback between yt and future z For example we can certainly ask about the longrun effect of a permanent increase in money supply growth on inflation even though the data on money supply growth cannot be characterized as strictly exogenous Suppose that zs 5 0 for s 0 and that z0 5 1 z1 5 1 and zs 5 0 for s 1 Find E1y212 E1y02 and E1yh2 for h 1 What happens as h S Exploring FurthEr 181 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 571 181a The Geometric or Koyck Distributed Lag Because there are generally an infinite number of dj we cannot consistently estimate them without some restrictions The simplest version of 181 which still makes the model depend on an infinite number of lags is the geometric or Koyck distributed lag In this model the dj depend on only two parameters dj 5 gr j 0r0 1 j 5 0 1 2 p 187 The parameters g and r may be positive or negative but r must be less than one in absolute value This ensures that dj S 0 as j S In fact this convergence happens at a very fast rate For example with r 5 5 and j 5 10 rj 5 11024 001 The impact propensity IP in the GDL is simply d0 5 g so the sign of the IP is deter mined by the sign of g If g 0 say and r 0 then all lag coefficients are positive If r 0 the lag coefficients alternate in sign rj is negative for odd j The longrun propensity is more difficult to obtain but we can use a standard result on the sum of a geometric series for 0r0 1 1 1 r 1 r2 1 p 1 rj 1 p 5 111 2 r2 and so LRP 5 g11 2 r2 The LRP has the same sign as g If we plug 187 into 181 we still have a model that depends on the z back to the indefi nite past Nevertheless a simple subtraction yields an estimable model Write the IDL at times t and t 2 1 as yt 5 a 1 gzt 1 grzt21 1 gr2zt22 1 p 1 ut 188 and yt21 5 a 1 gzt21 1 grzt22 1 gr2zt23 1 p 1 ut21 189 If we multiply the second equation by r and subtract it from the first all but a few of the terms cancel yt 2 ryt21 5 11 2 r2a 1 gzt 1 ut 2 rut21 which we can write as yt 5 a0 1 gzt 1 ryt21 1 ut 2 rut21 1810 where a0 5 11 2 r2a This equation looks like a standard model with a lagged dependent variable where zt appears contemporaneously Because g is the coefficient on zt and r is the coefficient on yt21 it appears that we can estimate these parameters If for some reason we are interested in a we can always obtain a 5 a 011 2 r 2 after estimating r and a0 The simplicity of 1810 is somewhat misleading The error term in this equation ut 2 rut21 is generally correlated with yt21 From 189 it is pretty clear that ut21 and yt21 are correlated Therefore if we write 1810 as yt 5 a0 1 gzt 1 ryt21 1 vt 1811 where vt ut 2 rut21 then we generally have correlation between vt and yt21 Without further assumptions OLS estimation of 1811 produces inconsistent estimates of g and r One case where vt must be correlated with yt21 occurs when ut is independent of zt and all past values of z and y Then 188 is dynamically complete so ut is uncorrelated with yt21 From 189 the covariance between vt and yt21 is 2rVar1ut212 5 2rs2 u which is zero only if r 5 0 We can easily see that vt is serially correlated because 5ut6 is serially uncorrelated E1vtvt212 5 E1utut212 2 rE1u2 t212 2 rE1utut222 1 r2E1ut21ut222 5 2rs2 u For j 1 E1vtvt2j2 5 0 Thus 5vt6 is a moving average process of order one see Section 111 This and equation 1811 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 572 gives an example of a modelwhich is derived from the original model of interestthat has a lagged dependent variable and a particular kind of serial correlation If we make the strict exogeneity assumption 185 then zt is uncorrelated with ut and ut21 and therefore with vt Thus if we can find a suitable instrumental variable for yt21 then we can estimate 1811 by IV What is a good IV candidate for yt21 By assumption ut and ut21 are both uncorrelated with zt21 so vt is uncorrelated with zt21 If g 2 0 zt21 and yt21 are correlated even after partialling out zt Therefore we can use instruments 1zt zt212 to estimate 1811 Generally the standard errors need to be adjusted for serial correlation in the 5vt6 as we discussed in Section 157 An alternative to IV estimation exploits the fact that 5ut6 may contain a specific kind of serial correlation In particular in addition to 186 suppose that 5ut6 follows the AR1 model ut 5 rut21 1 et 1812 E1et0zt yt21 zt21 p2 5 0 1813 It is important to notice that the r appearing in 1812 is the same parameter multiplying yt21 in 1811 If 1812 and 1813 hold we can write equation 1810 as yt 5 a0 1 gzt 1 ryt21 1 et 1814 which is a dynamically complete model under 1813 From Chapter 11 we can obtain consist ent asymptotically normal estimators of the parameters by OLS This is very convenient as there is no need to deal with serial correlation in the errors If et satisfies the homoskedasticity assumption Var1et0zt yt212 5 s2 e the usual inference applies Once we have estimated g and r we can easily esti mate the LRP LRP 5 g11 2 r 2 Many econometrics packages have simple commands that allow one to obtain a standard error for the estimated LRP The simplicity of this procedure relies on the potentially strong assumption that 5ut6 follows an AR1 process with the same r appearing in 187 This is usually no worse than assuming the 5ut6 are serially uncorrelated Nevertheless because consistency of the estimators relies heavily on this assumption it is a good idea to test it A simple test begins by specifying 5ut6 as an AR1 process with a different parameter say ut 5 lut21 1 et McClain and Wooldridge 1995 devised a simple Lagrange multiplier test of H0 l 5 r that can be computed after OLS estimation of 1814 The geometric distributed lag model extends to multiple explanatory variablesso that we have an infinite DL in each explanatory variablebut then we must be able to write the coefficient on zt2j h as ghr j In other words though gh is different for each explanatory variable r is the same Thus we can write yt 5 a0 1 g1zt1 1 p 1 gkztk 1 ryt21 1 vt 1815 The same issues that arose in the case with one z arise in the case with many z Under the natu ral extension of 1812 and 1813just replace zt with zt 5 1zt1 p ztk2OLS is consistent and asymptotically normal Or an IV method can be used 181b Rational Distributed Lag Models The geometric DL implies a fairly restrictive lag distribution When g 0 and r 0 the dj are positive and monotonically declining to zero It is possible to have more general infinite distributed lag models The GDL is a special case of what is generally called a rational distributed lag RDL model A general treatment is beyond our scopeHarvey 1990 is a good referencebut we can cover one simple useful extension Such an RDL model is most easily described by adding a lag of z to equation 1811 yt 5 a0 1 g0zt 1 ryt21 1 g1zt21 1 vt 1816 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 573 where vt 5 ut 2 rut21 as before By repeated substitution it can be shown that 1816 is equivalent to the infinite distributed lag model yt 5 a 1 g01zt 1 rzt21 1 r2zt22 1 p2 1 g11zt21 1 rzt22 1 r2zt23 1 p2 1 ut 5 a 1 g0zt 1 1rg0 1 g12zt21 1 r1rg0 1 g12zt22 1 r21rg0 1 g12zt23 1 p 1 ut where we again need the assumption 0r0 1 From this last equation we can read off the lag distri bution In particular the impact propensity is g0 while the coefficient on zt2h is rh211rg0 1 g12 for h 1 Therefore this model allows the impact propensity to differ in sign from the other lag coef ficients even if r 0 However if r 0 the dh have the same sign as 1rg0 1 g12 for all h 1 The lag distribution is plotted in Figure 181 for r 5 5 g0 5 21 and g1 5 1 The easiest way to compute the longrun propensity is to set y and z at their longrun values for all t say yp and zp and then find the change in yp with respect to zp see also Problem 3 in Chapter 10 We have yp 5 a0 1 g0zp 1 ryp 1 g1zp and solving gives yp 5 a011 2 r2 1 1g0 1 g1211 2 r2zp Now we use the fact that LRP 5 DypDzp LRP 5 1g0 1 g1211 2 r2 Because 0r0 1 the LRP has the same sign as g0 1 g1 and the LRP is zero if and only if g0 1 g1 5 0 as in Figure 181 ExamplE 181 Housing Investment and Residential price Inflation We estimate both the basic geometric and the rational distributed lag models by applying OLS to 1814 and 1816 respectively The dependent variable is loginvpc after a linear time trend has been removed that is we linearly detrend loginvpc For zt we use the growth in the price index This allows us to estimate how residential price inflation affects movements in housing investment around its trend The results of the estimation using the data in HSEINV are given in Table 181 FiguRE 181 Lag distribution for the rational distributed lag 1816 with r 5 5 g0 5 21 and g1 5 1 coefficient 5 5 10 lag 21 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 574 The geometric distributed lag model is clearly rejected by the data as gprice21 is very signifi cant The adjusted Rsquareds also show that the RDL model fits much better The two models give very different estimates of the longrun propensity If we incorrectly use the GDL the estimated LRP is almost five a permanent one percentage point increase in residential price inflation increases longterm housing investment by 47 above its trend value Economically this seems implausible The LRP estimated from the rational distributed lag model is below one In fact we cannot reject the null hypothesis H0 g0 1 g1 5 0 at any reasonable significance level 1pvalue 5 832 so there is no evidence that the LRP is different from zero This is a good exam ple of how misspecifying the dynamics of a model by omitting relevant lags can lead to erroneous conclusions 182 Testing for Unit Roots We now turn to the important problem of testing whether a time series follows a Unit Roots In Chapter 11 we gave some vague necessarily informal guidelines to decide whether a series is I1 or not In many cases it is useful to have a formal test for a unit root As we will see such tests must be applied with caution The simplest approach to testing for a unit root begins with an AR1 model yt 5 a 1 ryt21 1 et t 5 1 2 p 1817 where y0 is the observed initial value Throughout this section we let 5et6 denote a process that has zero mean given past observed y E1et0yt21 yt22 p y02 5 0 1818 Under 1818 5et6 is said to be a martingale difference sequence with respect to 5yt21 yt22 p6 If 5et6 is assumed to be iid with zero mean and is independent of y0 then it also satisfies 1818 If 5yt6 follows 1817 it has a unit root if and only if r 5 1 If a 5 0 and r 5 1 5yt6 follows a random walk without drift with the innovations et satisfying 1818 If a 2 0 and r 5 1 5yt6 is a random walk with drift which means that E1yt2 is a linear function of t A unit root process with drift behaves very differently from one without drift Nevertheless it is common to leave a unspecified TAblE 181 Distributed Lag Models for Housing Investment Dependent Variable loginvpc detrended Independent Variables GeometricDL RationalDL gprice 3095 933 3256 970 y21 340 132 547 152 gprice21 22936 973 constant 2010 018 006 017 Longrun propensity 4689 706 Sample size Adjusted Rsquared 41 375 40 504 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 575 under the null hypothesis and this is the approach we take Therefore the null hypothesis is that 5yt6 has a unit root H0 r 5 1 1819 In almost all cases we are interested in the onesided alternative H1 r 1 1820 In practice this means 0 r 1 as r 0 for a series that we suspect has a unit root would be very rare The alternative H1 r 1 is not usually considered since it implies that yt is explosive In fact if a 0 yt has an exponential trend in its mean when r 1 When 0r0 1 5yt6 is a stable AR1 process which means it is weakly dependent or asymp totically uncorrelated Recall from Chapter 11 that Corr1yt yt1h2 5 rh S 0 when 0r0 1 Therefore testing 1819 in model 1817 with the alternative given by 1820 is really a test of whether 5yt6 is I1 against the alternative that 5yt6 is I0 We do not take the null to be I0 in this setup because 5yt6 is I0 for any value of r strictly between 1 and 1 something that classical hypothesis test ing does not handle easily There are tests where the null hypothesis is I0 against the alternative of I1 but these take a different approach See for example Kwiatkowski Phillips Schmidt and Shin 1992 A convenient equation for carrying out the unit root test is to subtract yt21 from both sides of 1817 and to define u 5 r 2 1 Dyt 5 a 1 uyt21 1 et 1821 Under 1818 this is a dynamically complete model and so it seems straightforward to test H0 u 5 0 against H1 u 0 The problem is that under H0 yt21 is I1 and so the usual central limit theorem that underlies the asymptotic standard normal distribution for the t statistic does not apply the t statis tic does not have an approximate standard normal distribution even in large sample sizes The asymp totic distribution of the t statistic under H0 has come to be known as the DickeyFuller distribution after Dickey and Fuller 1979 Although we cannot use the usual critical values we can use the usual t statistic for u in 1821 at least once the appropriate critical values have been tabulated The resulting test is known as the DickeyFuller DF test for a unit root The theory used to obtain the asymptotic critical values is rather complicated and is covered in advanced texts on time series econometrics See for example Banerjee Dolado Galbraith and Hendry 1993 or BDGH for short By contrast using these results is very easy The critical values for the t statistic have been tabulated by several authors beginning with the original work by Dickey and Fuller 1979 Table 182 contains the large sample critical values for various significance levels taken from BDGH 1993 Table 42 Critical values adjusted for small sample sizes are available in BDGH We reject the null hypothesis H0 u 5 0 against H1 u 0 if tu c where c is one of the negative values in Table 182 For example to carry out the test at the 5 significance level we reject if tu 2286 This requires a t statistic with a much larger magnitude than if we used the standard nor mal critical value which would be 165 If we use the standard normal critical value to test for a unit root we would reject H0 much more often than 5 of the time when H0 is true TAblE 182 Asymptotic Critical Values for Unit Root t Test No Time Trend Significance level 1 25 5 10 Critical value 2343 2312 2286 2257 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 576 ExamplE 182 Unit Root Test for Threemonth TBill Rates We use the quarterly data in INTQRT to test for a unit root in threemonth Tbill rates When we estimate 1820 we obtain Dr3t 5 625 2 091 r3t21 12612 10372 1822 n 5 123 R2 5 048 where we keep with our convention of reporting standard errors in parentheses below the estimates We must remember that these standard errors cannot be used to construct usual confidence inter vals or to carry out traditional t tests because these do not behave in the usual ways when there is a unit root The coefficient on r3t21 shows that the estimate of r is r 5 1 1 u 5 909 While this is less than unity we do not know whether it is statistically less than one The t statistic on r3t21 is 2091037 5 2246 From Table 182 the 10 critical value is 2257 therefore we fail to reject H0 r 5 1 against H1 r 1 at the 10 significance level As with other hypothesis tests when we fail to reject H0 we do not say that we accept H0 Why Suppose we test H0 r 5 9 in the previous example using a standard t testwhich is asymptotically valid because yt is I0 under H0 Then we obtain t 5 001037 which is very small and provides no evidence against r 5 9 Yet it makes no sense to accept r 5 1 and r 5 9 When we fail to reject a unit root as in the previous example we should only conclude that the data do not provide strong evidence against H0 In this example the test does provide some evidence against H0 because the t statistic is close to the 10 critical value Ideally we would compute a pvalue but this requires special software because of the nonnormal distribution In addition though r 91 implies a fair amount of persistence in 5r3t6 the correlation between observations that are 10 periods apart for an AR1 model with r 5 9 is about 35 rather than almost one if r 5 1 What happens if we now want to use r3t as an explanatory variable in a regression analysis The outcome of the unit root test implies that we should be extremely cautious if r3t does have a unit root the usual asymptotic approximations need not hold as we discussed in Chapter 11 One solution is to use the first difference of r3t in any analysis As we will see in Section 184 that is not the only possibility We also need to test for unit roots in models with more complicated dynamics If 5yt6 follows 1817 with r 5 1 then Dyt is serially uncorrelated We can easily allow 5Dyt6 to follow an AR model by augmenting equation 1821 with additional lags For example Dyt 5 a 1 uyt21 1 g1Dyt21 1 et 1823 where 0g10 1 This ensures that under H0 u 5 0 5Dyt6 follows a stable AR1 model Under the alternative H1 u 0 it can be shown that 5yt6 follows a stable AR2 model More generally we can add p lags of Dyt to the equation to account for the dynamics in the process The way we test the null hypothesis of a unit root is very similar we run the regression of Dyt on yt21 Dyt21 c Dyt2p 1824 and carry out the t test on u the coefficient on yt21 just as before This extended version of the DickeyFuller test is usually called the augmented DickeyFuller test because the regression has been augmented with the lagged changes Dyt2h The critical values and rejection rule are the same as before The inclusion of the lagged changes in 1824 is intended to clean up any serial correla tion in Dyt The more lags we include in 1824 the more initial observations we lose If we include too many lags the small sample power of the test generally suffers But if we include too few lags the size of the test will be incorrect even asymptotically because the validity of the critical values in Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 577 Table 182 relies on the dynamics being completely modeled Often the lag length is dictated by the frequency of the data as well as the sample size For annual data one or two lags usually suffice For monthly data we might include 12 lags But there are no hard rules to follow in any case Interestingly the t statistics on the lagged changes have approximate t distributions The F statis tics for joint significance of any group of terms Dyt2h are also asymptotically valid These maintain the homoskedasticity assumption discussed in Section 115 Therefore we can use standard tests to determine whether we have enough lagged changes in 1824 ExamplE 183 Unit Root Test for annual US Inflation We use annual data on US inflation based on the CPI to test for a unit root in inflation see PHILLIPS restricting ourselves to the years from 1948 through 1996 Allowing for one lag of Dinft in the augmented DickeyFuller regression gives Dinft 5 136 2 310 inft21 1 138 Dinft21 15172 11032 11262 n 5 47 R2 5 172 The t statistic for the unit root test is 2310103 5 2301 Because the 5 critical value is 2286 we reject the unit root hypothesis at the 5 level The estimate of r is about 690 Together this is reasonably strong evidence against a unit root in inflation The lag Dinft21 has a t statistic of about 110 so we do not need to include it but we could not know this ahead of time If we drop Dinft21 the evidence against a unit root is slightly stronger u 5 2335 1r 5 6652 and tu 5 2313 For series that have clear time trends we need to modify the test for unit roots A trendstationary processwhich has a linear trend in its mean but is I0 about its trendcan be mistaken for a unit root process if we do not control for a time trend in the DickeyFuller regression In other words if we carry out the usual DF or augmented DF test on a trending but I0 series we will probably have little power for rejecting a unit root To allow for series with time trends we change the basic equation to Dyt 5 a 1 dt 1 uyt21 1 et 1825 where again the null hypothesis is H0 u 5 0 and the alternative is H1 u 0 Under the alternative 5yt6 is a trendstationary process If yt has a unit root then Dyt 5 a 1 dt 1 et and so the change in yt has a mean linear in t unless d 5 0 It can be shown that E1yt2 is actually a quadratic in t It is unusual for the first difference of an economic series to have a linear trend so a more appropriate null hypothesis is probably H0 u 5 0 d 5 0 Although it is possible to test this joint hypothesis using an F testbut with modified critical valuesit is common to test H0 u 5 0 using only a t test We follow that approach here See BDGH 1993 Section 44 for more details on the joint test When we include a time trend in the regression the critical values of the test change Intuitively this occurs because detrending a unit root process tends to make it look more like an I0 process Therefore we require a larger magnitude for the t statistic in order to reject H0 The DickeyFuller critical values for the t test that includes a time trend are given in Table 183 they are taken from BDGH 1993 Table 42 TAblE 183 Asymptotic Critical Values for Unit Root t Test Linear Time Trend Significance level 1 25 5 10 Critical value 2396 2366 2341 2312 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 578 For example to reject a unit root at the 5 level we need the t statistic on u to be less than 341 as compared with 286 without a time trend We can augment equation 1825 with lags of Dyt to account for serial correlation just as in the case without a trend ExamplE 184 Unit Root in the log of US Real Gross Domestic product We can apply the unit root test with a time trend to the US GDP data in INVEN These annual data cover the years from 1959 through 1995 We test whether log1GDPt2 has a unit root This series has a pronounced trend that looks roughly linear We include a single lag of Dlog1GDPt2 which is simply the growth in GDP in decimal form to account for dynamics gGDPt 5 165 1 0059 t 2 210 log1GDPt212 1 264 gGDPt21 1672 100272 10872 11652 1826 n 5 35 R2 5 268 From this equation we get r 5 1 2 21 5 79 which is clearly less than one But we cannot reject a unit root in the log of GDP the t statistic on log1GDPt212 is 2210087 5 2241 which is well above the 10 critical value of 2312 The t statistic on gGDPt21 is 160 which is almost significant at the 10 level against a twosided alternative What should we conclude about a unit root Again we cannot reject a unit root but the point estimate of r is not especially close to one When we have a small sample sizeand n 5 35 is con sidered to be pretty smallit is very difficult to reject the null hypothesis of a unit root if the process has something close to a unit root Using more data over longer time periods many researchers have concluded that there is little evidence against the unit root hypothesis for logGDP This has led most of them to assume that the growth in GDP is I0 which means that logGDP is I1 Unfortunately given currently available sample sizes we cannot have much confidence in this conclusion If we omit the time trend there is much less evidence against H0 as u 5 2023 and tu 5 2192 Here the estimate of r is much closer to one but this is misleading due to the omitted time trend It is tempting to compare the t statistic on the time trend in 1826 with the critical value from a standard normal or t distribution to see whether the time trend is significant Unfortunately the t statistic on the trend does not have an asymptotic standard normal distribution 1unless0r0 12 The asymptotic distribution of this t statistic is known but it is rarely used Typically we rely on intuition or plots of the time series to decide whether to include a trend in the DF test There are many other variants on unit root tests In one version that is applicable only to series that are clearly not trending the intercept is omitted from the regression that is a is set to zero in 1821 This variant of the DickeyFuller test is rarely used because of biases induced if a 2 0 Also we can allow for more complicated time trends such as quadratic Again this is seldom used Another class of tests attempts to account for serial correlation in Dyt in a different manner than by including lags in 1821 or 1825 The approach is related to the serial correlationrobust stand ard errors for the OLS estimators that we discussed in Section 125 The idea is to be as agnostic as possible about serial correlation in Dyt In practice the augmented DickeyFuller test has held up pretty well See BDGH 1993 Section 43 for a discussion on other tests 183 Spurious Regression In a crosssectional environment we use the phrase spurious correlation to describe a situation where two variables are related through their correlation with a third variable In particular if we regress y on x we find a significant relationship But when we control for another variable say z the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 579 partial effect of x on y becomes zero Naturally this can also happen in time series contexts with I0 variables As we discussed in Section 105 it is possible to find a spurious relationship between time series that have increasing or decreasing trends Provided the series are weakly dependent about their time trends the problem is effectively solved by including a time trend in the regression model When we are dealing with integrated processes of order one there is an additional complication Even if the two series have means that are not trending a simple regression involving two independ ent I1 series will often result in a significant t statistic To be more precise let 5xt6 and 5yt6 be random walks generated by xt 5 xt21 1 at t 5 1 2 p 1827 and yt 5 yt21 1 et t 5 1 2 p 1828 where 5at6 and 5et6 are independent identically distributed innovations with mean zero and vari ances s2 a and s2 e respectively For concreteness take the initial values to be x0 5 y0 5 0 Assume fur ther that 5at6 and 5et6 are independent processes This implies that 5xt6 and 5yt6 are also independent But what if we run the simple regression yt 5 b 0 1 b 1xt 1829 and obtain the usual t statistic for b 1 and the usual Rsquared Because yt and xt are independent we would hope that plim b 1 5 0 Even more importantly if we test H0 b1 5 0 against H1 b1 2 0 at the 5 level we hope that the t statistic for b 1 is insignificant 95 of the time Through a simula tion Granger and Newbold 1974 showed that this is not the case even though yt and xt are inde pendent the regression of yt on xt yields a statistically significant t statistic a large percentage of the time much larger than the nominal significance level Granger and Newbold called this the spurious regression problem there is no sense in which y and x are related but an OLS regression using the usual t statistics will often indicate a relationship Recent simulation results are given by Davidson and MacKinnon 1993 Table 191 where at and et are generated as independent identically dis tributed normal random variables and 10000 dif ferent samples are generated For a sample size of n 5 50 at the 5 significance level the standard t statistic for H0 b1 5 0 against the twosided alter native rejects H0 about 662 of the time under H0 rather than 5 of the time As the sample size increases things get worse with n 5 250 the null is rejected 847 of the time Here is one way to see what is happening when we regress the level of y on the level of x Write the model underlying 1829 as yt 5 b0 1 b1xt 1 ut 1830 For the t statistic of b 1 to have an approximate standard normal distribution in large samples at a min imum 5ut6 should be a mean zero serially uncorrelated process But under H0 b1 5 0 yt 5 b0 1 ut and because 5yt6 is a random walk starting at y0 5 0 equation 1830 holds under H0 only if b0 5 0 and more importantly if ut 5 yt 5 a t j51ej In other words 5ut6 is a random walk under H0 This clearly violates even the asymptotic version of the GaussMarkov assumptions from Chapter 11 Including a time trend does not really change the conclusion If yt or xt is a random walk with drift and a time trend is not included the spurious regression problem is even worse The same quali tative conclusions hold if 5at6 and 5et6 are general I0 processes rather than iid sequences Under the preceding setup where 5xt6 and 5yt6 are generated by 1827 and 1828 and 5et6 and 5at6 are iid sequences what is the plim of the slope coefficient say g 1 from the regression of Dyt on Dxt Describe the behavior of the t statistic of g 1 Exploring FurthEr 182 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 580 In addition to the usual t statistic not having a limiting standard normal distributionin fact it increases to infinity as n S the behavior of Rsquared is nonstandard In crosssectional con texts or in regressions with I0 time series variables the Rsquared converges in probability to the population Rsquared 1 2 s2 us2 y This is not the case in spurious regressions with I1 processes Rather than the Rsquared having a welldefined plim it actually converges to a random variable Formalizing this notion is well beyond the scope of this text A discussion of the asymptotic proper ties of the t statistic and the Rsquared can be found in BDGH Section 31 The implication is that the Rsquared is large with high probability even though 5yt6 and 5xt6 are independent time series processes The same considerations arise with multiple independent variables each of which may be I1 or some of which may be I0 If 5yt6 is I1 and at least some of the explanatory variables are I1 the regression results may be spurious The possibility of spurious regression with I1 variables is quite important and has led econo mists to reexamine many aggregate time series regressions whose t statistics were very significant and whose Rsquareds were extremely high In the next section we show that regressing an I1 depend ent variable on an I1 independent variable can be informative but only if these variables are related in a precise sense 184 Cointegration and Error Correction Models The discussion of spurious regression in the previous section certainly makes one wary of using the levels of I1 variables in regression analysis In earlier chapters we suggested that I1 variables should be differenced before they are used in linear regression models whether they are estimated by OLS or instrumental variables This is certainly a safe course to follow and it is the approach used in many time series regressions after Granger and Newbolds original paper on the spurious regression problem Unfortunately always differencing I1 variables limits the scope of the questions that we can answer 184a Cointegration The notion of cointegration which was given a formal treatment in Engle and Granger 1987 makes regressions involving I1 variables potentially meaningful A full treatment of cointegration is mathematically involved but we can describe the basic issues and methods that are used in many applications If 5yt t 5 0 1 p6 and 5xt t 5 0 1 p6 are two I1 processes then in general yt 2 bxt is an I1 process for any number b Nevertheless it is possible that for some b 2 0 yt 2 bxt is an I0 process which means it has constant mean constant variance and autocorrelations that depend only on the time distance between any two variables in the series and it is asymptotically uncorrelated If such a b exists we say that y and x are cointe grated and we call b the cointegration parameter Alternatively we could look at xt 2 gyt for g 2 0 if yt 2 bxt is I0 then xt 2 11b2yt is I0 Therefore the linear combination of yt and xt is not unique but if we fix the coefficient on yt at unity then b is unique See Problem 3 For concreteness we consider linear combinations of the form yt 2 bxt For the sake of illustration take b 5 1 suppose that y0 5 x0 5 0 and write yt 5 yt21 1 rt xt 5 xt21 1 vt where 5rt6 and 5vt6 are two I0 processes with zero means Then yt and xt have a tendency to wander around and not return to the initial value of zero with any regularity By contrast if yt 2 xt is I0 it has zero mean and does return to zero with some regularity Let 5 1yt xt2 t 5 1 2 p6 be a bivariate time series where each series is I1 without drift Explain why if yt and xt are cointegrated yt and xt21 are also cointegrated Exploring FurthEr 183 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 581 As a specific example let r6t be the annualized interest rate for sixmonth Tbills at the end of quarter t and let r3t be the annualized interest rate for threemonth Tbills These are typically called bond equivalent yields and they are reported in the financial pages In Example 182 using the data in INTQRT we found little evidence against the hypothesis that r3t has a unit root the same is true of r6t Define the spread between six and threemonth Tbill rates as sprt 5 r6t 2 r3t Then using equation 1821 the DickeyFuller t statistic for sprt is 771 with u 5 267 or r 5 33 Therefore we strongly reject a unit root for sprt in favor of I0 The upshot of this is that though r6t and r3t each appear to be unit root processes the difference between them is an I0 process In other words r6 and r3 are cointegrated Cointegration in this example as in many examples has an economic interpretation If r6 and r3 were not cointegrated the difference between interest rates could become very large with no ten dency for them to come back together Based on a simple arbitrage argument this seems unlikely Suppose that the spread sprt continues to grow for several time periods making sixmonth Tbills a much more desirable investment Then investors would shift away from threemonth and toward sixmonth Tbills driving up the price of sixmonth Tbills while lowering the price of threemonth Tbills Because interest rates are inversely related to price this would lower r6 and increase r3 until the spread is reduced Therefore large deviations between r6 and r3 are not expected to continue the spread has a tendency to return to its mean value The spread actually has a slightly positive mean because longterm investors are more rewarded relative to shortterm investors There is another way to characterize the fact that sprt will not deviate for long periods from its average value r6 and r3 have a longrun relationship To describe what we mean by this let m 5 E1sprt2 denote the expected value of the spread Then we can write r6t 5 r3t 1 m 1 et where 5et6 is a zero mean I0 process The equilibrium or longrun relationship occurs when et 5 0 or r6p 5 r3p 1 m At any time period there can be deviations from equilibrium but they will be tem porary there are economic forces that drive r6 and r3 back toward the equilibrium relationship In the interest rate example we used economic reasoning to tell us the value of b if yt and xt are cointegrated If we have a hypothesized value of b then testing whether two series are cointegrated is easy we simply define a new variable st 5 yt 2 bxt and apply either the usual DF or augmented DF test to 5st6 If we reject a unit root in 5st6 in favor of the I0 alternative then we find that yt and xt are cointegrated In other words the null hypothesis is that yt and xt are not cointegrated Testing for cointegration is more difficult when the potential cointegration parameter b is unknown Rather than test for a unit root in 5st6 we must first estimate b If yt and xt are cointegrated it turns out that the OLS estimator b from the regression yt 5 a 1 b xt 1831 is consistent for b The problem is that the null hypothesis states that the two series are not cointe grated which means that under H0 we are running a spurious regression Fortunately it is possible to tabulate critical values even when b is estimated where we apply the DickeyFuller or augmented DickeyFuller test to the residuals say u t 5 yt 2 a 2 b xt from 1831 The only difference is that the critical values account for estimation of b The resulting test is called the EngleGranger test and the asymptotic critical values are given in Table 184 These are taken from Davidson and MacKinnon 1993 Table 202 TAblE 184 Asymptotic Critical Values for Cointegration Test No Time Trend Significance level 1 25 5 10 Critical value 2390 2359 2334 2304 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 582 In the basic test we run the regression of Du t on u t21 and compare the t statistic on u t21 to the desired critical value in Table 184 If the t statistic is below the critical value we have evidence that yt 2 bxt is I0 for some b that is yt and xt are cointegrated We can add lags of Du t to account for serial correlation If we compare the critical values in Table 184 with those in Table 182 we must get a t statistic much larger in magnitude to find cointegration than if we used the usual DF critical values This happens because OLS which minimizes the sum of squared residuals tends to produce residuals that look like an I0 sequence even if yt and xt are not cointegrated As with the usual DickeyFuller test we can augment the EngleGranger test by including lags of Du t as additional regressors If yt and xt are not cointegrated a regression of yt on xt is spurious and tells us nothing meaning ful there is no longrun relationship between y and x We can still run a regression involving the first differences Dyt and Dxt including lags But we should interpret these regressions for what they are they explain the difference in y in terms of the difference in x and have nothing necessarily to do with a relationship in levels If yt and xt are cointegrated we can use this to specify more general dynamic models as we will see in the next subsection The previous discussion assumes that neither yt nor xt has a drift This is reasonable for interest rates but not for other time series If yt and xt contain drift terms E1yt2 and E1xt2 are linear usually increasing functions of time The strict definition of cointegration requires yt 2 bxt to be I0 without a trend To see what this entails write yt 5 dt 1 gt and xt 5 lt 1 ht where 5gt6 and 5ht6 are I1 processes d is the drift in yt3d 5 E1Dyt2 4 and l is the drift in xt3l 5 E1Dxt2 4 Now if yt and xt are cointegrated there must exist b such that gt 2 bht is I0 But then yt 2 bxt 5 1d 2 bl2t 1 1gt 2 bht2 which is generally a trendstationary process The strict form of cointegration requires that there not be a trend which means d 5 bl For I1 processes with drift it is possible that the stochastic parts that is gt and htare cointegrated but that the parameter b that causes gt 2 bht to be I0 does not eliminate the linear time trend We can test for cointegration between gt and ht without taking a stand on the trend part by run ning the regression yt 5 a 1 h t 1 b xt 1832 and applying the usual DF or augmented DF test to the residuals u t The asymptotic critical values are given in Table 185 from Davidson and MacKinnon 1993 Table 202 A finding of cointegration in this case leaves open the possibility that yt 2 bxt has a linear trend But at least it is not I1 ExamplE 185 Cointegration between Fertility and personal Exemption In Chapters 10 and 11 we studied various models to estimate the relationship between the general fertility rate gfr and the real value of the personal tax exemption pe in the United States The static regression results in levels and first differences are notably different The regression in levels with a time trend included gives an OLS coefficient on pe equal to 187 1se 5 0352 and R2 5 500 In first differences without a trend the coefficient on Dpe is 2043 1se 5 0282 and R2 5 032 Although there are other reasons for these differencessuch as misspecified distributed lag dynamicsthe TAblE 185 Asymptotic Critical Values for Cointegration Test Linear Time Trend Significance level 1 25 5 10 Critical value 2432 2403 2378 2350 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 583 discrepancy between the levels and changes regressions suggests that we should test for cointegration Of course this presumes that gfr and pe are I1 processes This appears to be the case the augmented DF tests with a single lagged change and a linear time trend each yield t statistics of about 147 and the estimated AR1 coefficients are close to one When we obtain the residuals from the regression of gfr on t and pe and apply the augmented DF test with one lag we obtain a t statistic on u t21 of 2243 which is nowhere near the 10 criti cal value 2350 Therefore we must conclude that there is little evidence of cointegration between gfr and pe even allowing for separate trends It is very likely that the earlier regression results we obtained in levels suffer from the spurious regression problem The good news is that when we used first differences and allowed for two lagssee equation 1127we found an overall positive and significant longrun effect of Dpe on Dgfr If we think two series are cointegrated we often want to test hypotheses about the cointegrating parameter For example a theory may state that the cointegrating parameter is one Ideally we could use a t statistic to test this hypothesis We explicitly cover the case without time trends although the extension to the linear trend case is immediate When yt and xt are I1 and cointegrated we can write yt 5 a 1 bxt 1 ut 1833 where ut is a zero mean I0 process Generally 5ut6 contains serial correlation but we know from Chapter 11 that this does not affect consistency of OLS As mentioned earlier OLS applied to 1833 consistently estimates b and a Unfortunately because xt is I1 the usual inference procedures do not necessarily apply OLS is not asymptotically normally distributed and the t statistic for b does not necessarily have an approximate t distribution We do know from Chapter 10 that if 5xt6 is strictly exogenoussee Assumption TS3and the errors are homoskedastic serially uncorrelated and nor mally distributed the OLS estimator is also normally distributed conditional on the explanatory vari ables and the t statistic has an exact t distribution Unfortunately these assumptions are too strong to apply to most situations The notion of cointegration implies nothing about the relationship between 5xt6 and 5ut6indeed they can be arbitrarily correlated Further except for requiring that 5ut6 is I0 cointegration between yt and xt does not restrict the serial dependence in 5ut6 Fortunately the feature of 1833 that makes inference the most difficultthe lack of strict exogeneity of 5xt6can be fixed Because xt is I1 the proper notion of strict exogeneity is that ut is uncorrelated with Dxs for all t and s We can always arrange this for a new set of errors at least approximately by writing ut as a function of the Dxs for all s close to t For example ut 5 h 1 f0Dxt 1 f1Dxt21 1 f2Dxt22 1834 1 g1Dxt11 1 g2Dxt12 1 et where by construction et is uncorrelated with each Dxs appearing in the equation The hope is that et is uncorrelated with further lags and leads of Dxs We know that as 0s 2 t0 gets large the correlation between et and Dxs approaches zero because these are I0 processes Now if we plug 1834 into 1833 we obtain yt 5 a0 1 bxt 1 f0Dxt 1 f1Dxt21 1 f2Dxt22 1835 1 g1Dxt11 1 g2Dxt12 1 et This equation looks a bit strange because future Dxs appear with both current and lagged Dxt The key is that the coefficient on xt is still b and by construction xt is now strictly exogenous in this equation The strict exogeneity assumption is the important condition needed to obtain an approxi mately normal t statistic for b If ut is uncorrelated with all Dxs s 2 t then we can drop the leads and lags of the changes and simply include the contemporaneous change Dxt Then the equation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 584 we estimate looks more standard but still includes the first difference of xt along with its level yt 5 a0 1 bxt 1 f0Dxt 1 et In effect adding Dxt solves any contemporaneous endogeneity between xt and ut Remember any endogeneity does not cause inconsistency But we are trying to obtain an asymptotically normal t statistic Whether we need to include leads and lags of the changes and how many is really an empirical issue Each time we add an additional lead or lag we lose one observa tion and this can be costly unless we have a large data set The OLS estimator of b from 1835 is called the leads and lags estimator of b because of the way it employs Dx See for example Stock and Watson 1993 The only issue we must worry about in 1835 is the possibility of serial correlation in 5et6 This can be dealt with by computing a serial correlationrobust standard error for b as described in Section 125 or by using a standard AR1 correction such as CochraneOrcutt ExamplE 186 Cointegrating parameter for Interest Rates Earlier we tested for cointegration between r6 and r3six and threemonth Tbill ratesby assum ing that the cointegrating parameter was equal to one This led us to find cointegration and naturally to conclude that the cointegrating parameter is equal to unity Nevertheless let us estimate the cointe grating parameter directly and test H0 b 5 1 We apply the leads and lags estimator with two leads and two lags of Dr3 as well as the contemporaneous change The estimate of b is b 5 1038 and the usual OLS standard error is 0081 Therefore the t statistic for H0 b 5 1 is 11038 2 120081 469 which is a strong statistical rejection of H0 Of course whether 1038 is economically different from 1 is a relevant consideration There is little evidence of serial correlation in the residuals so we can use this t statistic as having an approximate normal distribution For comparison the OLS estimate of b without the leads lags or contemporaneous Dr3 termsand using five more observationsis 1026 1se 5 00772 But the t statistic from 1833 is not necessarily valid There are many other estimators of cointegrating parameters and this continues to be a very active area of research The notion of cointegration applies to more than two processes but the inter pretation testing and estimation are much more complicated One issue is that even after we nor malize a coefficient to be one there can be many cointegrating relationships BDGH provide some discussion and several references 184b Error Correction Models In addition to learning about a potential longrun relationship between two series the concept of coin tegration enriches the kinds of dynamic models at our disposal If yt and xt are I1 processes and are not cointegrated we might estimate a dynamic model in first differences As an example consider the equation Dyt 5 a0 1 a1Dyt21 1 g0Dxt 1 g1Dxt21 1 ut 1836 where ut has zero mean given Dxt Dyt21 Dxt21 and further lags This is essentially equation 1816 but in first differences rather than in levels If we view this as a rational distributed lag model we can find the impact propensity longrun propensity and lag distribution for Dy as a distributed lag in Dx If yt and xt are cointegrated with parameter b then we have additional I0 variables that we can include in 1836 Let st 5 yt 2 bxt so that st is I0 and assume for the sake of simplicity that st has zero mean Now we can include lags of st in the equation In the simplest case we include one lag of st Dyt 5 a0 1 a1Dyt21 1 g0Dxt 1 g1Dxt21 1 dst21 1 ut 1837 5 a0 1 a1Dyt21 1 g0Dxt 1 g1Dxt21 1 d1yt21 2 bxt212 1 ut Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 585 where E1ut0It212 5 0 and It21 contains information on Dxt and all past values of x and y The term d1yt21 2 bxt212 is called the error correction term and 1837 is an example of an error correc tion model In some error correction models the contemporaneous change in x Dxt is omitted Whether it is included or not depends partly on the purpose of the equation In forecasting Dxt is rarely included for reasons we will see in Section 185 An error correction model allows us to study the shortrun dynamics in the relationship between y and x For simplicity consider the model without lags of Dyt and Dxt Dyt 5 a0 1 g0Dxt 1 d1yt21 2 bxt212 1 ut 1838 where d 0 If yt21 bxt21 then y in the previous period has overshot the equilibrium because d 0 the error correction term works to push y back toward the equilibrium Similarly if yt21 bxt21 the error correction term induces a positive change in y back toward the equilibrium How do we estimate the parameters of an error correction model If we know b this is easy For example in 1838 we simply regress Dyt on Dxt and st21 where st21 5 1yt21 2 bxt212 ExamplE 187 Error Correction model for Holding Yields In Problem 6 in Chapter 11 we regressed hy6t the threemonth holding yield in percent from buy ing a sixmonth Tbill at time t 1 and selling it at time t as a threemonth Tbill on hy3t21 the threemonth holding yield from buying a threemonth Tbill at time t 1 The expectations hypoth esis implies that the slope coefficient should not be statistically different from one It turns out that there is evidence of a unit root in 5hy3t6 which calls into question the standard regression analysis We will assume that both holding yields are I1 processes The expectations hypothesis implies at a minimum that hy6t and hy3t21 are cointegrated with b equal to one which appears to be the case see Computer Exercise C5 Under this assumption an error correction model is Dhy6t 5 a0 1 g0Dhy3t21 1 d1hy6t21 2 hy3t222 1 ut where ut has zero mean given all hy3 and hy6 dated at time t 1 and earlier The lags on the variables in the error correction model are dictated by the expectations hypothesis Using the data in INTQRT gives Dhy6t 5 090 1 1218 Dhy3t21 2 8401hy6t21 2 hy3t222 10432 12642 12442 1839 n 5 122 R2 5 790 The error correction coefficient is negative and very significant For example if the holding yield on sixmonth Tbills is above that for threemonth Tbills by one point hy6 falls by 84 points on average in the next quarter Interestingly d 5 284 is not statistically different from 1 as is easily seen by computing the 95 confidence interval In many other examples the cointegrating parameter must be estimated Then we replace st21 with st21 5 yt21 2 b xt21 where b can be various estimators of b We have covered the standard OLS estimator as well as the leads and lags estimator This raises the issue about how sampling variation in b affects inference on the other parameters in the error correction model Fortunately as shown by Engle and Granger 1987 we can ignore the preliminary estimation of b asymptotically This property is very convenient and implies that the asymptotic efficiency of the estimators of the param eters in the error correction model is unaffected by whether we use the OLS estimator or the leads and How would you test H0 g0 5 1 d 5 21 in the holding yield error correction model Exploring FurthEr 184 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 586 lags estimator for b Of course the choice of b will generally have an effect on the estimated error correction parameters in any particular sample but we have no systematic way of deciding which preliminary estimator of b to use The procedure of replacing b with b is called the EngleGranger twostep procedure 185 Forecasting Forecasting economic time series is very important in some branches of economics and it is an area that continues to be actively studied In this section we focus on regressionbased forecasting methods Diebold 2001 provides a comprehensive introduction to forecasting including recent developments We assume in this section that the primary focus is on forecasting future values of a time series process and not necessarily on estimating causal or structural economic models It is useful to first cover some fundamentals of forecasting that do not depend on a specific model Suppose that at time t we want to forecast the outcome of y at time t 1 1 or yt11 The time period could correspond to a year a quarter a month a week or even a day Let It denote information that we can observe at time t This information set includes yt earlier values of y and often other variables dated at time t or earlier We can combine this information in innumerable ways to forecast yt11 Is there one best way The answer is yes provided we specify the loss associated with forecast error Let ft denote the forecast of yt11 made at time t We call ft a onestepahead forecast The forecast error is et11 5 yt11 2 ft which we observe once the outcome on yt11 is observed The most common meas ure of loss is the same one that leads to ordinary least squares estimation of a multiple linear regres sion model the squared error e2 t11 The squared forecast error treats positive and negative prediction errors symmetrically and larger forecast errors receive relatively more weight For example errors of 12 and 22 yield the same loss and the loss is four times as great as forecast errors of 11 or 21 The squared forecast error is an example of a loss function Another popular loss function is the absolute value of the prediction error 0et110 For reasons to be seen shortly we focus now on squared error loss Given the squared error loss function we can determine how to best use the information at time t to forecast yt11 But we must recognize that at time t we do not know et11 it is a random variable because yt11 is a random variable Therefore any useful criterion for choosing ft must be based on what we know at time t It is natural to choose the forecast to minimize the expected squared forecast error given It E1e2 t110It2 5 E3 1yt11 2 ft2 20It4 1840 A basic fact from probability see Property CE6 in Appendix B is that the conditional expectation E1yt110It2 minimizes 1840 In other words if we wish to minimize the expected squared forecast error given information at time t our forecast should be the expected value of yt11 given variables we know at time t For many popular time series processes the conditional expectation is easy to obtain Suppose that 5yt t 5 0 1 p6 is a martingale difference sequence MDS and take It to be 5yt yt21 p y06 the observed past of y By definition E1yt110It2 5 0 for all t the best prediction of yt11 at time t is always zero Recall from Section 182 that an iid sequence with zero mean is a martingale differ ence sequence A martingale difference sequence is one in which the past is not useful for predicting the future Stock returns are widely thought to be well approximated as an MDS or perhaps with a positive mean The key is that E1yt110yt yt21 p2 5 E1yt112 the conditional mean is equal to the uncondi tional mean in which case past outcomes on y do not help to predict future y Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 587 A process 5yt6 is a martingale if E1yt110yt yt21 p y02 5 yt for all t 0 If 5yt6 is a martingale then 5Dyt6 is a martingale difference sequence which is where the latter name comes from The pre dicted value of y for the next period is always the value of y for this period A more complicated example is E1yt110It2 5 ayt 1 a11 2 a2yt21 1 p 1 a11 2 a2 ty0 1841 where 0 a 1 is a parameter that we must choose This method of forecasting is called exponen tial smoothing because the weights on the lagged y decline to zero exponentially The reason for writing the expectation as in 1841 is that it leads to a very simple recurrence relation Set f0 5 y0 Then for t 1 the forecasts can be obtained as ft 5 ayt 1 11 2 a2ft21 In other words the forecast of yt11 is a weighted average of yt and the forecast of yt made at time t 2 1 Exponential smoothing is suitable only for very specific time series and requires choosing a Regression methods which we turn to next are more flexible The previous discussion has focused on forecasting y only one period ahead The general issues that arise in forecasting yt1h at time t where h is any positive integer are similar In particular if we use expected squared forecast error as our measure of loss the best predictor is E1yt1h0It2 When deal ing with a multiplestepahead forecast we use the notation ft h to indicate the forecast of yt1h made at time t 185a Types of Regression Models Used for Forecasting There are many different regression models that we can use to forecast future values of a time series The first regression model for time series data from Chapter 10 was the static model To see how we can forecast with this model assume that we have a single explanatory variable yt 5 b0 1 b1zt 1 ut 1842 Suppose for the moment that the parameters b0 and b1 are known Write this equation at time t 1 1 as yt11 5 b0 1 b1zt11 1 ut11 Now if zt11 is known at time t so that it is an element of It and E1ut110It2 5 0 then E1yt110It2 5 b0 1 b1zt11 where It contains zt11 yt zt p y1 z1 The righthand side of this equation is the forecast of yt11 at time t This kind of forecast is usually called a conditional forecast because it is conditional on knowing the value of z at time t 1 1 Unfortunately at any time we rarely know the value of the explanatory variables in future time periods Exceptions include time trends and seasonal dummy variables which we cover explicitly below but otherwise knowledge of zt11 at time t is rare Sometimes we wish to generate conditional forecasts for several values of zt11 Another problem with 1842 as a model for forecasting is that E1ut110It2 5 0 means that 5ut6 cannot contain serial correlation something we have seen to be false in most static regression models Problem 8 asks you to derive the forecast in a simple distributed lag model with AR1 errors If zt11 is not known at time t we cannot include it in It Then we have E1yt110It2 5 b0 1 b1E1zt110It2 This means that in order to forecast yt11 we must first forecast zt11 based on the same information set This is usually called an unconditional forecast because we do not assume knowledge of zt11 at time t Unfortunately this is somewhat of a misnomer as our forecast is still conditional on the infor mation in It But the name is entrenched in the forecasting literature Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 588 For forecasting unless we are wedded to the static model in 1842 for other reasons it makes more sense to specify a model that depends only on lagged values of y and z This saves us the extra step of having to forecast a righthand side variable before forecasting y The kind of model we have in mind is yt 5 d0 1 a1yt21 1 g1zt21 1 ut 1843 E1ut0It212 5 0 where It21 contains y and z dated at time t 1 and earlier Now the forecast of yt11 at time t is d0 1 a1yt 1 g1zt if we know the parameters we can just plug in the values of yt and zt If we only want to use past y to predict future y then we can drop zt21 from 1843 Naturally we can add more lags of y or z and lags of other variables Especially for forecasting one step ahead such models can be very useful 185b OneStepAhead Forecasting Obtaining a forecast one period after the sample ends is relatively straightforward using models such as 1843 As usual let n be the sample size The forecast of yn11 is fn 5 d 0 1 a 1yn 1 g 1zn 1844 where we assume that the parameters have been estimated by OLS We use a hat on fn to emphasize that we have estimated the parameters in the regression model If we knew the parameters there would be no estimation error in the forecast The forecast errorwhich we will not know until time n 1 1is en11 5 yn11 2 fn 1845 If we add more lags of y or z to the forecasting equation we simply lose more observations at the beginning of the sample The forecast fn of yn11 is usually called a point forecast We can also obtain a forecast interval A forecast interval is essentially the same as a prediction interval which we studied in Section 64 There we showed how under the classical linear model assumptions to obtain an exact 95 predic tion interval A forecast interval is obtained in exactly the same way If the model does not satisfy the classical linear model assumptionsfor example if it contains lagged dependent variables as in 1844the forecast interval is still approximately valid provided ut given It21 is normally distrib uted with zero mean and constant variance This ensures that the OLS estimators are approximately normally distributed with the usual OLS variances and that un11 is independent of the OLS estima tors with mean zero and variance s2 Let se1 fn2 be the standard error of the forecast and let s be the standard error of the regression From Section 64 we can obtain fn and se1 fn2 as the intercept and its standard error from the regression of yt on 1yt21 2 yn2 and 1zt21 2 zn2 t 5 1 2 p n that is we subtract the time n value of y from each lagged y and similarly for z before doing the regression Then se1en112 5 53se1fn2 42 1 s 2612 1846 and the approximate 95 forecast interval is fn 6 196 se1en112 1847 Because se1 fn2 is roughly proportional to 1n se1 fn2 is usually small relative to the uncertainty in the error un11 as measured by s Some econometrics packages compute forecast intervals routinely but others require some simple manipulations to obtain 1847 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 589 ExamplE 188 Forecasting the US Unemployment Rate We use the data in PHILLIPS but only for the years 1948 through 1996 to forecast the US civilian unemployment rate for 1997 We use two models The first is a simple AR1 model for unem unemt 5 1572 1 732 unemt21 15772 10972 1848 n 5 48 R2 5 544 s 5 1049 In a second model we add inflation with a lag of one year unemt 5 1304 1 647 unemt21 1 184 inft21 14902 10842 10412 1849 n 5 48 R2 5 677 s 5 883 The lagged inflation rate is very significant in 1849 1t 452 and the adjusted Rsquared from the second equation is much higher than that from the first Nevertheless this does not necessarily mean that the second equation will produce a better forecast for 1997 All we can say so far is that using the data up through 1996 a lag of inflation helps to explain variation in the unemployment rate To obtain the forecasts for 1997 we need to know unem and inf in 1996 These are 54 and 30 respectively Therefore the forecast of unem1997 from equation 1848 is 1572 1 73254 or about 552 The forecast from equation 1849 is 1304 1 64754 1 18430 or about 535 The actual civilian unemployment rate for 1997 was 49 so both equations overpredict the actual rate The second equation does provide a somewhat better forecast We can easily obtain a 95 forecast interval When we regress unemt on 1unemt21 2 542 and 1inft21 2 302 we obtain 535 as the interceptwhich we already computed as the forecastand se1fn2 5 137 Therefore because s 5 883 we have se1en112 5 3 11372 2 1 18832 2412 894 The 95 forecast interval from 1847 is 535 6 19618942 or about 36 71 This is a wide inter val and the realized 1997 value 49 is well within the interval As expected the standard error of un11 which is 883 is a very large fraction of se1en112 A professional forecaster must usually produce a forecast for every time period For example at time n she or he produces a forecast of yn11 Then when yn11 and zn11 become available he or she must forecast yn12 Even if the forecaster has settled on model 1843 there are two choices for forecasting yn12 The first is to use d 0 1 a 1yn11 1 g 1zn11 where the parameters are estimated using the first n observations The second possibility is to reestimate the parameters using all n 1 1 obser vations and then to use the same formula to forecast yn12 To forecast in subsequent time periods we can generally use the parameter estimates obtained from the initial n observations or we can update the regression parameters each time we obtain a new data point Although the latter approach requires more computation the extra burden is relatively minor and it can although it need not work better because the regression coefficients adjust at least somewhat to the new data points As a specific example suppose we wish to forecast the unemployment rate for 1998 using the model with a single lag of unem and inf The first possibility is to just plug the 1997 values of unem ployment and inflation into the righthand side of 1849 With unem 5 49 and inf 5 23 in 1997 we have a forecast for unem1998 of about 49 It is just a coincidence that this is the same as the 1997 unemployment rate The second possibility is to reestimate the equation by adding the 1997 observa tion and then using this new equation see Computer Exercise C6 The model in equation 1843 is one equation in what is known as a vector autoregressive VAR model We know what an autoregressive model is from Chapter 11 we model a single series 5yt6 in terms of its own past In vector autoregressive models we model several serieswhich if you Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 590 are familiar with linear algebra is where the word vector comes fromin terms of their own past If we have two series yt and zt a vector autoregression consists of equations that look like yt 5 d0 1 a1yt21 1 g1zt21 1 a2yt22 1 g2zt22 1 p 1850 and zt 5 h0 1 b1yt21 1 r1zt21 1 b2yt22 1 r2zt22 1 p where each equation contains an error that has zero expected value given past information on y and z In equation 1843and in the example estimated in 1849we assumed that one lag of each vari able captured all of the dynamics An F test for joint significance of unemt22 and inft22 confirms that only one lag of each is needed As Example 188 illustrates VAR models can be useful for forecasting In many cases we are interested in forecasting only one variable y in which case we only need to estimate and analyze the equation for y Nothing prevents us from adding other lagged variables say wt21 wt22 p to equa tion 1850 Such equations are efficiently estimated by OLS provided we have included enough lags of all variables and the equation satisfies the homoskedasticity assumption for time series regressions Equations such as 1850 allow us to test whether after controlling for past values of y past val ues of z help to forecast yt Generally we say that z Granger causes y if E1yt0It212 2 E1yt0Jt212 1851 where It21 contains past information on y and z and Jt21 contains only information on past y When 1851 holds past z is useful in addition to past y for predicting yt The term causes in Granger causes should be interpreted with caution The only sense in which z causes y is given in 1851 In particular it has nothing to say about contemporaneous causality between y and z so it does not allow us to determine whether zt is an exogenous or endogenous variable in an equation relating yt to zt This is also why the notion of Granger causality does not apply in pure crosssectional contexts Once we assume a linear model and decide how many lags of y should be included in E1yt0yt21 yt22 c2 we can easily test the null hypothesis that z does not Granger cause y To be more specific suppose that E1yt0yt21 yt22 c2 depends on only three lags yt 5 d0 1 a1yt21 1 a2yt22 1 a3yt23 1 ut E1ut0yt21 yt22 p2 5 0 Now under the null hypothesis that z does not Granger cause y any lags of z that we add to the equa tion should have zero population coefficients If we add zt21 then we can simply do a t test on zt21 If we add two lags of z then we can do an F test for joint significance of zt21 and zt22 in the equation yt 5 d0 1 a1yt21 1 a2yt22 1 a3yt23 1 g1zt21 1 g2zt22 1 ut If there is heteroskedasticity we can use a robust form of the test There cannot be serial correlation under H0 because the model is dynamically complete As a practical matter how do we decide on which lags of y and z to include First we start by estimating an autoregressive model for y and performing t and F tests to determine how many lags of y should appear With annual data the number of lags is typically small say one or two With quar terly or monthly data there are usually many more lags Once an autoregressive model for y has been chosen we can test for lags of z The choice of lags of z is less important because when z does not Granger cause y no set of lagged zs should be significant With annual data 1 or 2 lags are typically used with quarterly data usually 4 or 8 and with monthly data perhaps 6 12 or maybe even 24 given enough data We have already done one example of testing for Granger causality in equation 1849 The autoregressive model that best fits unemployment is an AR1 In equation 1849 we added a single lag of inflation and it was very significant Therefore inflation Granger causes unemployment Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 591 There is an extended definition of Granger causality that is often useful Let 5wt6 be a third series or it could represent several additional series Then z Granger causes y conditional on w if 1851 holds but now It21 contains past information on y z and w while Jt21 contains past information on y and w It is certainly possible that z Granger causes y but z does not Granger cause y conditional on w A test of the null that z does not Granger cause y conditional on w is obtained by testing for sig nificance of lagged z in a model for y that also depends on lagged y and lagged w For example to test whether growth in the money supply Granger causes growth in real GDP conditional on the change in interest rates we would regress gGDPt on lags of gGDP Dint and gM and do significance tests on the lags of gM See for example Stock and Watson 1989 185c Comparing OneStepAhead Forecasts In almost any forecasting problem there are several competing methods for forecasting Even when we restrict attention to regression models there are many possibilities Which variables should be included and with how many lags Should we use logs levels of variables or first differences In order to decide on a forecasting method we need a way to choose which one is most suitable Broadly we can distinguish between insample criteria and outofsample criteria In a regression context insample criteria include Rsquared and especially adjusted Rsquared There are many other model selection statistics but we will not cover those here see for example Ramanathan 1995 Chapter 4 For forecasting it is better to use outofsample criteria as forecasting is essentially an outof sample problem A model might provide a good fit to y in the sample used to estimate the parameters But this need not translate to good forecasting performance An outofsample comparison involves using the first part of a sample to estimate the parameters of the model and saving the latter part of the sample to gauge its forecasting capabilities This mimics what we would have to do in practice if we did not yet know the future values of the variables Suppose that we have n 1 m observations where we use the first n observations to estimate the parameters in our model and save the last m observations for forecasting Let fn1h be the onestepahead forecast of yn1h11 for h 5 0 1 c m 2 1 The m forecast errors are en1h11 5 yn1h11 2 fn1h How should we measure how well our model forecasts y when it is out of sample Two measures are most common The first is the root mean squared error RMSE RMSE 5 am21 a m21 h50e2 n1h11b 12 1852 This is essentially the sample standard deviation of the forecast errors without any degrees of free dom adjustment If we compute RMSE for two or more forecasting methods then we prefer the method with the smallest outofsample RMSE A second common measure is the mean absolute error MAE which is the average of the absolute forecast errors MAE 5 m21 a m21 h50 0en1h110 1853 Again we prefer a smaller MAE Other possible criteria include minimizing the largest of the abso lute values of the forecast errors ExamplE 189 OutofSample Comparisons of Unemployment Forecasts In Example 188 we found that equation 1849 fit notably better over the years 1948 through 1996 than did equation 1848 and at least for forecasting unemployment in 1997 the model that included lagged inflation worked better Now we use the two models still estimated using the data only through 1996 to compare onestepahead forecasts for 1997 through 2003 This leaves seven Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 592 outofsample observations n 5 48 and m 5 7 to use in equations 1852 and 1853 For the AR1 model RMSE 5 962 and MAE 5 778 For the model that adds lagged inflation a VAR model of order one RMSE 5 673 and MAE 5 628 Thus by either measure the model that includes inft21 produces better outofsample forecasts for 1997 through 2003 In this case the insample and outofsample criteria choose the same model Rather than using only the first n observations to estimate the parameters of the model we can reestimate the models each time we add a new observation and use the new model to forecast the next time period 185d MultipleStepAhead Forecasts Forecasting more than one period ahead is generally more difficult than forecasting one period ahead We can formalize this as follows Suppose we consider forecasting yt11 at time t and at an earlier time period s so that s t Then Var3yt11 2 E1yt110It2 4 Var3yt11 2 E1yt110Is2 4 where the inequality is usually strict We will not prove this result generally but intuitively it makes sense the forecast error variance in predicting yt11 is larger when we make that forecast based on less information If 5yt6 follows an AR1 model which includes a random walk possibly with drift we can eas ily show that the error variance increases with the forecast horizon The model is yt 5 a 1 ryt21 1 ut E1ut0It212 5 0 It21 5 5yt21 yt22 p6 and 5ut6 has constant variance s2 conditional on It21 At time t 1 h 2 1 our forecast of yt1h is a 1 ryt1h21 and the forecast error is simply ut1h Therefore the onestepahead forecast variance is simply s2 To find multiplestepahead forecasts we have by repeated substitution yt1h 5 11 1 r 1 p 1 rh212a 1 rhyt 1 rh21ut11 1 rh22ut12 1 p 1 ut1h At time t the expected value of ut1j for all j 1 is zero So E1yt1h0It2 5 11 1 r 1 p 1 rh212a 1 rhyt 1854 and the forecast error is eth 5 rh21ut11 1 rh22 ut12 1 p 1 ut1h This is a sum of uncor related random variables and so the variance of the sum is the sum of the variances Var1et h2 5 s23r21h212 1 r21h222 1 p 1 r2 1 14 Because r2 0 each term multiplying s2 is posi tive so the forecast error variance increases with h When r2 1 as h gets large the forecast variance converges to s211 2 r22 which is just the unconditional variance of yt In the case of a random walk 1r 5 12 ft h 5 ah 1 yt and Var1et h2 5 s2h the forecast variance grows without bound as the hori zon h increases This demonstrates that it is very difficult to forecast a random walk with or without drift far out into the future For example forecasts of interest rates farther into the future become dramatically less precise Equation 1854 shows that using the AR1 model for multistep forecasting is easy once we have estimated r by OLS The forecast of yn1h at time n is fnh 5 11 1 r 1 p 1 r h212a 1 r hyn 1855 Obtaining forecast intervals is harder unless h 5 1 because obtaining the standard error of fnh is dif ficult Nevertheless the standard error of fnh is usually small compared with the standard deviation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 593 of the error term and the latter can be estimated as s 3r 21h212 1 r 21h222 1 p 1 r 2 1 1412 where s is the standard error of the regression from the AR1 estimation We can use this to obtain an approxi mate confidence interval For example when h 5 2 an approximate 95 confidence interval for large n is fn2 6 196s 11 1 r 22 12 1856 Because we are underestimating the standard deviation of yn1h this interval is too narrow but perhaps not by much especially if n is large A less traditional but useful approach is to estimate a different model for each forecast horizon For example suppose we wish to forecast y two periods ahead If It depends only on y through time t we might assume that E1yt120It2 5 a0 1 g1yt which as we saw earlier holds if 5yt6 follows an AR1 model We can estimate a0 and g1 by regressing yt on an intercept and on yt22 Even though the errors in this equation contain serial correlationerrors in adjacent periods are correlatedwe can obtain consistent and approximately normal estimators of a0 and g1 The forecast of yn12 at time n is simply fn 2 5 a 0 1 g 1yn Further and very importantly the standard error of the regression is just what we need for computing a confidence interval for the forecast Unfortunately to get the standard error of fn 2 using the trick for a onestepahead forecast requires us to obtain a serial correlation robust standard error of the kind described in Section 125 This standard error goes to zero as n gets large while the variance of the error is constant Therefore we can get an approximate interval by using 1856 and by putting the SER from the regression of yt on yt22 in place of s 11 1 r 22 12 But we should remember that this ignores the estimation error in a 0 and g 1 We can also compute multiplestepahead forecasts with more complicated autoregressive models For example suppose 5yt6 follows an AR2 model and that at time n we wish to forecast yn12 Now yn12 5 a 1 r1yn11 1 r2yn 1 un12 so E1yn120In2 5 a 1 r1E1yn110In2 1 r2yn We can write this as fn 2 5 a 1 r1 fn 1 1 r2yn so that the twostepahead forecast at time n can be obtained once we get the onestepahead forecast If the parameters of the AR2 model have been estimated by OLS then we operationalize this as fn2 5 a 1 r 1 fn1 1 r 2 yn 1857 Now fn1 5 a 1 r 1yn 1 r 2yn21 which we can compute at time n Then we plug this into 1857 along with yn to obtain fn2 For any h 2 obtaining any hstepahead forecast for an AR2 model is easy to find in a recursive manner fnh 5 a 1 r 1fnh21 1 r 2 fn h22 Similar reasoning can be used to obtain multiplestepahead forecasts for VAR models To illus trate suppose we have yt 5 d0 1 a1yt21 1 g1zt21 1 ut 1858 and zt 5 h0 1 b1yt21 1 r1zt21 1 vt Now if we wish to forecast yn11 at time n we simply use fn1 5 d 0 1 a 1yn 1 g 1zn Likewise the forecast of zn11 at time n is say g n1 5 h 0 1 b 1yn 1 r 1zn Now suppose we wish to obtain a two stepahead forecast of y at time n From 1858 we have E1yn120In2 5 d0 1 a1E1yn110In2 1 g1E1zn110In2 because E1un120In2 5 0 so we can write the forecast as fn2 5 d 0 1 a 1 fn1 1 g 1g n1 1859 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 594 This equation shows that the twostepahead forecast for y depends on the onestepahead forecasts for y and z Generally we can build up multiplestepahead forecasts of y by using the recursive formula fnh 5 d 0 1 a 1fnh21 1 g 1g nh21 h 2 ExamplE 1810 TwoYearahead Forecast for the Unemployment Rate To use equation 1849 to forecast unemployment two years outsay the 1998 rate using the data through 1996we need a model for inflation The best model for inf in terms of lagged unem and inf appears to be a simple AR1 model unem21 is not significant when added to the regression inft 5 1277 1 665 inft21 15582 11072 n 5 48 R2 5 457 R2 5 445 If we plug the 1996 value of inf into this equation we get the forecast of inf for 1997 inf1997 5 327 Now we can plug this along with unem1997 5 535 which we obtained earlier into 1859 to fore cast unem1998 unem1998 5 1304 1 64715352 1 18413272 537 Remember this forecast uses information only through 1996 The onestepahead forecast of unem1998 obtained by plugging the 1997 values of unem and inf into 1848 was about 490 The actual unem ployment rate in 1998 was 45 which means that in this case the onestepahead forecast does quite a bit better than the twostepahead forecast Just as with onestepahead forecasting an outofsample root mean squared error or a mean absolute error can be used to choose among multiplestepahead forecasting methods 185e Forecasting Trending Seasonal and Integrated Processes We now turn to forecasting series that either exhibit trends have seasonality or have unit roots Recall from Chapters 10 and 11 that one approach to handling trending dependent or independent variables in regression models is to include time trends the most popular being a linear trend Trends can be included in forecasting equations as well although they must be used with caution In the simplest case suppose that 5yt6 has a linear trend but is unpredictable around that trend Then we can write yt 5 a 1 bt 1 ut E1ut0It212 5 0 t 5 1 2 p 1860 where as usual It21 contains information observed through time t 2 1 which includes at least past y How do we forecast yn1h at time n for any h 1 This is simple because E1yn1h0In2 5 a 1 b1n 1 h2 The forecast error variance is simply s2 5 Var1ut2 assuming a constant variance over time If we estimate a and b by OLS using the first n observations then our forecast for yn1h at time n is fnh 5 a 1 b 1n 1 h2 In other words we simply plug the time period corresponding to y into the estimated trend function For example if we use the n 5 131 observations in BARIUM to forecast monthly imports of Chinese barium chloride to the United States from China we obtain a 5 24956 and b 5 515 The sample period ends in December 1988 so the forecast of imports of Chinese barium chloride six months later is 24956 1 51511372 5 95511 measured as short tons For com parison the December 1988 value is 108781 so it is greater than the forecasted value six months later The series and its estimated trend line are shown in Figure 182 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 595 As we discussed in Chapter 10 most economic time series are better characterized as having at least approximately a constant growth rate which suggests that log1yt2 follows a linear time trend Suppose we use n observations to obtain the equation log1yt2 5 a 1 b t t 5 1 2 p n 1861 Then to forecast log1y2 at any future time period n 1 h we just plug n 1 h into the trend equation as before But this does not allow us to forecast y which is usually what we want It is tempting to simply exponentiate a 1 b 1n 1 h2 to obtain the forecast for yn1h but this is not quite right for the same reasons we gave in Section 64 We must prop erly account for the error implicit in 1861 The simplest way to do this is to use the n observations to regress yt on exp1logyt2 without an intercept Let g be the slope coefficient on exp1logyt2 Then the forecast of y in period n h is simply fnh 5 gexp3a 1 b 1n 1 h2 4 1862 As an example if we use the first 687 weeks of data on the New York Stock Exchange index in NYSE we obtain a 5 3782 and b 5 0019 by regressing log1pricet2 on a linear time trend this shows that the index grows about 2 per week on average When we regress price on the expo nentiated fitted values we obtain g 5 1018 Now we forecast price four weeks out which is the last week in the sample using 1862 1018 exp33782 1 001916912 4 16612 The actual value turned out to be 16425 so we have somewhat overpredicted But this result is much better than if we estimate a linear time trend for the first 687 weeks the forecasted value for week 691 is 15223 which is a substantial underprediction barium chloride short tons 131 100 t 70 35 1 40 500 1000 1500 FiguRE 182 US imports of Chinese barium chloride in short tons and its estimated linear trend line 24956 1 515 t Suppose you model 5yt t 5 1 2 p 466 as a linear time trend where data are annual starting in 1950 and ending in 1995 Define the variable yeart as ranging from 50 when t 5 1 to 95 when t 5 46 If you estimate the equation yt 5 g 1 dyeart how do g and d compare with a and b in yt 5 a 1 b t How will forecasts from the two equations compare Exploring FurthEr 185 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 596 Although trend models can be useful for prediction they must be used with caution especially for forecasting far into the future integrated series that have drift The potential problem can be seen by considering a random walk with drift At time t h we can write yt1h as yt1h 5 bh 1 yt 1 ut11 1 p 1 ut1h where b is the drift term usually b 0 and each ut1j has zero mean given It and constant variance s2 As we saw earlier the forecast of yt1h at time t is E1yt1h0It2 5 bh 1 yt and the forecast error vari ance is s2h What happens if we use a linear trend model Let y0 be the initial value of the process at time zero which we take as nonrandom Then we can also write yt1h 5 y0 1 b1t 1 h2 1 u1 1 u2 1 p 1 ut1h 5 y0 1 b1t 1 h2 1 vt1h This looks like a linear trend model with the intercept a 5 y0 But the error vt1h while having mean zero has variance s21t 1 h2 Therefore if we use the linear trend y0 1 b1t 1 h2 to forecast yt1h at time t the forecast error variance is s21t 1 h2 compared with s2h when we use bh 1 yt The ratio of the forecast variances is 1t 1 h2h which can be big for large t The bottom line is that we should not use a linear trend to forecast a random walk with drift Computer Exercise C8 asks you to compare forecasts from a cubic trend line and those from the simple random walk model for the general fertil ity rate in the United States Deterministic trends can also produce poor forecasts if the trend parameters are estimated using old data and the process has a subsequent shift in the trend line Sometimes exogenous shockssuch as the oil crises of the 1970scan change the trajectory of trending variables If an old trend line is used to forecast far into the future the forecasts can be way off This problem can be mitigated by using the most recent data available to obtain the trend line parameters Nothing prevents us from combining trends with other models for forecasting For example we can add a linear trend to an AR1 model which can work well for forecasting series with linear trends but which are also stable AR processes around the trend It is also straightforward to forecast processes with deterministic seasonality monthly or quar terly series For example the file BARIUM contains the monthly production of gasoline in the United States from 1978 through 1988 This series has no obvious trend but it does have a strong sea sonal pattern Gasoline production is higher in the summer months and in December In the simplest model we would regress gas measured in gallons on 11 month dummies say for February through December Then the forecast for any future month is simply the intercept plus the coefficient on the appropriate month dummy For January the forecast is just the intercept in the regression We can also add lags of variables and time trends to allow for general series with seasonality Forecasting processes with unit roots also deserves special attention Earlier we obtained the expected value of a random walk conditional on information through time n To forecast a random walk with possible drift a h periods into the future at time n we use fnh 5 a h 1 yn where a is the sample average of the Dyt up through t 5 n If there is no drift we set a 5 0 This approach imposes the unit root An alternative would be to estimate an AR1 model for 5yt6 and to use the forecast formula 1855 This approach does not impose a unit root but if one is present r converges in probability to one as n gets large Nevertheless r can be substantially different than one especially if the sample size is not very large The matter of which approach produces better outofsample fore casts is an empirical issue If in the AR1 model r is less than one even slightly the AR1 model will tend to produce better longrun forecasts Generally there are two approaches to producing forecasts for I1 processes The first is to impose a unit root For a onestepahead forecast we obtain a model to forecast the change in y Dyt11 given information through time t Then because yt11 5 Dyt11 1 yt E1yt110It2 5 E1yt110It2 1 yt Therefore our forecast of yn11 at time n is just fn 5 g n 1 yn Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 597 where g n is the forecast of Dyn11 at time n Typically an AR model which is necessarily stable is used for Dyt or a vector autoregression This can be extended to multiplestepahead forecasts by writing yn1h as yn1h 5 1yn1h 2 yn1h212 1 1yn1h21 2 yn1h222 1 p 1 1yn11 2 yn2 1 yn or yn1h 5 Dyn1h 1 Dyn1h21 1 p 1 Dyn11 1 yn Therefore the forecast of yn1h at time n is fnh 5 g nh 1 g nh21 1 p 1 g n1 1 yn 1863 where g n j is the forecast of Dyn1j at time n For example we might model Dyt as a stable AR1 obtain the multiplestepahead forecasts from 1855 but with a and r obtained from Dyt on Dyt21 and yn replaced with Dyn and then plug these into 1863 The second approach to forecasting I1 variables is to use a general AR or VAR model for 5yt6 This does not impose the unit root For example if we use an AR2 model yt 5 a 1 r1yt21 1 r2yt22 1 ut 1864 then r1 1 r2 5 1 If we plug in r1 5 1 2 r2 and rearrange we obtain Dyt 5 a 2 r2Dyt21 1 ut which is a stable AR1 model in the difference that takes us back to the first approach described earlier Nothing prevents us from estimating 1864 directly by OLS One nice thing about this regression is that we can use the usual t statistic on r 2 to determine if yt22 is significant This assumes that the homoskedasticity assumption holds if not we can use the heteroskedasticity robust form We will not show this formally but intuitively it follows by rewriting the equation as yt 5 a 1 gyt21 2 r2Dyt21 1 ut where g 5 r1 1 r2 Even if g 5 1 r2 is minus the coefficient on a stationary weakly dependent process 5Dyt216 Because the regression results will be identical to 1864 we can use 1864 directly As an example let us estimate an AR2 model for the general fertility rate in FERTIL3 using the observations through 1979 In Computer Exercise C8 you are asked to use this model for fore casting which is why we save some observations at the end of the sample gfrt 5 322 1 1272 gfrt21 2 311 gfrt22 12922 11202 11212 1865 n 5 65 R2 5 949 R2 5 947 The t statistic on the second lag is about 257 which is statistically different from zero at about the 1 level The first lag also has a very significant t statistic which has an approximate t distribution by the same reasoning used for r 2 The Rsquared adjusted or not is not especially informative as a goodnessoffit measure because gfr apparently contains a unit root and it makes little sense to ask how much of the variance in gfr we are explaining The coefficients on the two lags in 1865 add up to 961 which is close to and not statistically different from one as can be verified by applying the augmented DickeyFuller test to the equation Dgfrt 5 a 1 ugfrt21 1 d1Dgfrt21 1 ut Even though we have not imposed the unit root restriction we can still use 1865 for forecasting as we discussed earlier Before ending this section we point out one potential improvement in forecasting in the context of vector autoregressive models with I1 variables Suppose 5yt6 and 5zt6 are each I1 processes One approach for obtaining forecasts of y is to estimate a bivariate autoregression in the variables Dyt and Dzt and then to use 1863 to generate one or multiplestepahead forecasts this is essentially the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 598 PART 3 Advanced Topics first approach we described earlier However if yt and zt are cointegrated we have more stationary stable variables in the information set that can be used in forecasting Dy namely lags of yt 2 bzt where b is the cointegrating parameter A simple error correction model is Dyt 5 a0 1 a1Dyt21 1 g1Dzt21 1 d11yt21 2 bzt212 1 et 1866 E1et0It212 5 0 To forecast yn11 we use observations up through n to estimate the cointegrating parameter b and then estimate the parameters of the error correction model by OLS as described in Section 184 Forecasting Dyn11 is easy we just plug Dyn Dzn and yn 2 b zn into the estimated equation Having obtained the forecast of Dyn11 we add it to yn By rearranging the error correction model we can write yt 5 a0 1 r1yt21 1 r2yt22 1 d1zt21 1 d2zt22 1 ut 1867 where r1 5 1 1 a1 1 d r2 5 2a1 and so on which is the first equation in a VAR model for yt and zt Notice that this depends on five parameters just as many as in the error correction model The point is that for the purposes of forecasting the VAR model in the levels and the error correction model are essentially the same This is not the case in more general error correction models For example suppose that a1 5 g1 5 0 in 1866 but we have a second error correction term d21yt22 2 bzt222 Then the error correction model involves only four parameters whereas 1867which has the same order of lags for y and zcontains five parameters Thus error correction models can economize on parameters that is they are generally more parsimonious than VARs in levels If yt and zt are I1 but not cointegrated the appropriate model is 1866 without the error correc tion term This can be used to forecast Dyn11 and we can add this to yn to forecast yn11 Summary The time series topics covered in this chapter are used routinely in empirical macroeconomics empirical finance and a variety of other applied fields We began by showing how infinite distributed lag models can be interpreted and estimated These can provide flexible lag distributions with fewer parameters than a similar finite distributed lag model The geometric distributed lag and more generally rational distributed lag models are the most popular They can be estimated using standard econometric procedures on simple dynamic equations Testing for a unit root has become very common in time series econometrics If a series has a unit root then in many cases the usual large sample normal approximations are no longer valid In addition a unit root process has the property that an innovation has a longlasting effect which is of interest in its own right While there are many tests for unit roots the DickeyFuller t testand its extension the augmented DickeyFuller testis probably the most popular and easiest to implement We can allow for a linear trend when testing for unit roots by adding a trend to the DickeyFuller regression When an I1 series yt is regressed on another I1 series xt there is serious concern about spurious regression even if the series do not contain obvious trends This has been studied thoroughly in the case of a random walk even if the two random walks are independent the usual t test for significance of the slope coefficient based on the usual critical values will reject much more than the nominal size of the test In addition the R2 tends to a random variable rather than to zero as would be the case if we regress the dif ference in yt on the difference in xt Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 599 In one important case a regression involving I1 variables is not spurious and that is when the series are cointegrated This means that a linear function of the two I1 variables is I0 If yt and xt are I1 but yt 2 xt is I0 yt and xt cannot drift arbitrarily far apart There are simple tests of the null of no cointegra tion against the alternative of cointegration one of which is based on applying a DickeyFuller unit root test to the residuals from a static regression There are also simple estimators of the cointegrating parameter that yield t statistics with approximate standard normal distributions and asymptotically valid confidence intervals We covered the leads and lags estimator in Section 184 Cointegration between yt and xt implies that error correction terms may appear in a model relating Dyt to Dxt the error correction terms are lags in yt 2 bxt where b is the cointegrating parameter A simple two step estimation procedure is available for estimating error correction models First b is estimated using a static regression or the leads and lags regression Then OLS is used to estimate a simple dynamic model in first differences that includes the error correction terms Section 185 contained an introduction to forecasting with emphasis on regressionbased forecast ing methods Static models or more generally models that contain explanatory variables dated con temporaneously with the dependent variable are limited because then the explanatory variables need to be forecasted If we plug in hypothesized values of unknown future explanatory variables we obtain a conditional forecast Unconditional forecasts are similar to simply modeling yt as a function of past information we have observed at the time the forecast is needed Dynamic regression models including autoregressions and vector autoregressions are used routinely In addition to obtaining onestepahead point forecasts we also discussed the construction of forecast intervals which are very similar to predic tion intervals Various criteria are used for choosing among forecasting methods The most common performance measures are the root mean squared error and the mean absolute error Both estimate the size of the average forecast error It is most informative to compute these measures using outofsample forecasts Multiplestepahead forecasts present new challenges and are subject to large forecast error variances Nevertheless for models such as autoregressions and vector autoregressions multiplestepahead forecasts can be computed and approximate forecast intervals can be obtained Forecasting trending and I1 series requires special care Processes with deterministic trends can be forecasted by including time trends in regression models possibly with lags of variables A potential draw back is that deterministic trends can provide poor forecasts for longhorizon forecasts once it is estimated a linear trend continues to increase or decrease The typical approach to forecasting an I1 process is to forecast the difference in the process and to add the level of the variable to that forecasted difference Alter natively vector autoregressive models can be used in the levels of the series If the series are cointegrated error correction models can be used instead Key Terms Augmented DickeyFuller Test Cointegration Conditional Forecast DickeyFuller Distribution DickeyFuller DF Test EngleGranger Test EngleGranger TwoStep Procedure Error Correction Model Exponential Smoothing Forecast Error Forecast Interval Geometric or Koyck Distributed Lag Granger Causality Infinite Distributed Lag IDL Model Information Set InSample Criteria Leads and Lags Estimator Loss Function Martingale Martingale Difference Sequence Mean Absolute Error MAE MultipleStepAhead Forecast OneStepAhead Forecast OutofSample Criteria Point Forecast Rational Distributed Lag RDL Model Root Mean Squared Error RMSE Spurious Regression Problem Unconditional Forecast Unit Roots Vector Autoregressive VAR Model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 600 PART 3 Advanced Topics Problems 1 Consider equation 1815 with k 5 2 Using the IV approach to estimating the gh and r what would you use as instruments for yt21 2 An interesting economic model that leads to an econometric model with a lagged dependent variable relates yt to the expected value of xt say xp t where the expectation is based on all observed information at time t 2 1 yt 5 a0 1 a1xp t 1 ut 1868 A natural assumption on 5ut6 is that E1ut0It212 5 0 where It21 denotes all information on y and x observed at time t 2 1 this means that E1yt0It212 5 a0 1 a1xp t To complete this model we need an assumption about how the expectation xp t is formed We saw a simple example of adaptive expectations in Section 112 where xp t 5 xt21 A more complicated adaptive expectations scheme is xp t 2 xp t21 5 l1xt21 2 xp t212 1869 where 0 l 1 This equation implies that the change in expectations reacts to whether last periods realized value was above or below its expectation The assumption 0 l 1 implies that the change in expectations is a fraction of last periods error i Show that the two equations imply that yt 5 la0 1 11 2 l2yt21 1 la1xt21 1 ut 2 11 2 l2ut21 Hint Lag equation 1868 one period multiply it by 11 2 l2 and subtract this from 1868 Then use 1869 ii Under E1ut0It212 5 0 5ut6 is serially uncorrelated What does this imply about the new errors vt 5 ut 2 11 2 l2ut21 iii If we write the equation from part i as yt 5 b0 1 b1yt21 1 b2xt21 1 vt how would you consistently estimate the bj iv Given consistent estimators of the bj how would you consistently estimate l and a1 3 Suppose that 5yt6 and 5zt6 are I1 series but yt 2 bzt is I0 for some b 2 0 Show that for any d 2 b yt 2 dzt must be I1 4 Consider the error correction model in equation 1837 Show that if you add another lag of the error correction term yt22 2 bxt22 the equation suffers from perfect collinearity Hint Show that yt22 2 bxt22 is a perfect linear function of yt21 2 bxt21 Dxt21 and Dyt21 5 Suppose the process 5 1xt yt2 t 5 0 1 2 p6 satisfies the equations yt 5 bxt 1 ut and Dxt 5 gDxt21 1 vt where E1ut0It212 5 E1vt0It212 5 0 It21 contains information on x and y dated at time t 2 1 and earlier b 2 0 and 0g0 1 so that xt and therefore yt is I1 Show that these two equations imply an error correction model of the form Dyt 5 g1Dxt21 1 d1yt21 2 bxt212 1 et where g1 5 bg d 5 21 and et 5 ut 1 bvt Hint First subtract yt21 from both sides of the first equa tion Then add and subtract bxt21 from the righthand side and rearrange Finally use the second equation to get the error correction model that contains Dxt21 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 601 6 Using the monthly data in VOLAT the following model was estimated pcip 5 154 1 344 pcip21 1 074 pcip22 1 073 pcip23 1 031 pcsp21 1562 10422 10452 10422 10132 n 5 554 R2 5 174 R2 5 168 where pcip is the percentage change in monthly industrial production at an annualized rate and pcsp is the percentage change in the Standard Poors 500 Index also at an annualized rate i If the past three months of pcip are zero and pcsp21 5 0 what is the predicted growth in industrial production for this month Is it statistically different from zero ii If the past three months of pcip are zero but pcsp21 5 10 what is the predicted growth in industrial production iii What do you conclude about the effects of the stock market on real economic activity 7 Let gMt be the annual growth in the money supply and let unemt be the unemployment rate Assuming that unemt follows a stable AR1 process explain in detail how you would test whether gM Granger causes unem 8 Suppose that yt follows the model yt 5 a 1 d1zt21 1 ut ut 5 rut21 1 et E1et0It212 5 0 where It21 contains y and z dated at t 2 1 and earlier i Show that E1yt110It2 5 11 2 r2a 1 ryt 1 d1zt 2 rd1zt21 Hint Write ut21 5 yt21 2 a 2 d1zt22 and plug this into the second equation then plug the result into the first equation and take the conditional expectation ii Suppose that you use n observations to estimate a d1 and r Write the equation for forecasting yn11 iii Explain why the model with one lag of z and AR1 serial correlation is a special case of the model yt 5 a0 1 ryt21 1 g1zt21 1 g2zt22 1 et iv What does part iii suggest about using models with AR1 serial correlation for forecasting 9 Let 5yt6 be an I1 sequence Suppose that g n is the onestepahead forecast of Dyn11 and let fn 5 g n 1 yn be the onestepahead forecast of yn11 Explain why the forecast errors for forecasting Dyn11 and yn11 are identical Computer Exercises C1 Use the data in WAGEPRC for this exercise Problem 5 in Chapter 11 gave estimates of a finite distrib uted lag model of gprice on gwage where 12 lags of gwage are used i Estimate a simple geometric DL model of gprice on gwage In particular estimate equation 1811 by OLS What are the estimated impact propensity and LRP Sketch the estimated lag distribution ii Compare the estimated IP and LRP to those obtained in Problem 5 in Chapter 11 How do the estimated lag distributions compare iii Now estimate the rational distributed lag model from 1816 Sketch the lag distribution and compare the estimated IP and LRP to those obtained in part ii Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 602 PART 3 Advanced Topics C2 Use the data in HSEINV for this exercise i Test for a unit root in log1invpc2 including a linear time trend and two lags of Dlog1invpct2 Use a 5 significance level ii Use the approach from part i to test for a unit root in log1price2 iii Given the outcomes in parts i and ii does it make sense to test for cointegration between log1invpc2 and log1price2 C3 Use the data in VOLAT for this exercise i Estimate an AR3 model for pcip Now add a fourth lag and verify that it is very insignificant ii To the AR3 model from part i add three lags of pcsp to test whether pcsp Granger causes pcip Carefully state your conclusion iii To the model in part ii add three lags of the change in i3 the threemonth Tbill rate Does pcsp Granger cause pcip conditional on past Di3 C4 In testing for cointegration between gfr and pe in Example 185 add t2 to equation 1832 to obtain the OLS residuals Include one lag in the augmented DF test The 5 critical value for the test is 2415 C5 Use INTQRT for this exercise i In Example 187 we estimated an error correction model for the holding yield on sixmonth Tbills where one lag of the holding yield on threemonth Tbills is the explanatory variable We assumed that the cointegration parameter was one in the equation hy6t 5 a 1 bhy3t21 1 ut Now add the lead change Dhy3t the contemporaneous change Dhy3t21 and the lagged change Dhy3t22 of hy3t21 That is estimate the equation hy6t 5 a 1 bhy3t21 1 f0Dhy3t 1 f1Dhy3t21 1 r1Dhy3t22 1 et and report the results in equation form Test H0 b 5 1 against a twosided alternative Assume that the lead and lag are sufficient so that 5hy3t216 is strictly exogenous in this equation and do not worry about serial correlation ii To the error correction model in 1839 add Dhy3t22 and 1hy6t22 2 hy3t232 Are these terms jointly significant What do you conclude about the appropriate error correction model C6 Use the data in PHILLIPS to answer these questions i Estimate the models in 1848 and 1849 using the data through 1997 Do the parameter estimates change much compared with 1848 and 1849 ii Use the new equations to forecast unem1998 round to two places after the decimal Which equation produces a better forecast iii As we discussed in the text the forecast for unem1998 using 1849 is 490 Compare this with the forecast obtained using the data through 1997 Does using the extra year of data to obtain the parameter estimates produce a better forecast iv Use the model estimated in 1848 to obtain a twostepahead forecast of unem That is forecast unem1998 using equation 1855 with a 5 1572 r 5 732 and h 5 2 Is this better or worse than the onestepahead forecast obtained by plugging unem1997 5 49 into 1848 C7 Use the data in BARIUM for this exercise i Estimate the linear trend model chnimpt 5 a 1 bt 1 ut using the first 119 observations this excludes the last 12 months of observations for 1988 What is the standard error of the regression ii Now estimate an AR1 model for chnimp again using all data but the last 12 months Compare the standard error of the regression with that from part i Which model provides a better insample fit iii Use the models from parts i and ii to compute the onestepahead forecast errors for the 12 months in 1988 You should obtain 12 forecast errors for each method Compute and compare the RMSEs and the MAEs for the two methods Which forecasting method works better outofsample for onestepahead forecasts Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 18 Advanced Time Series Topics 603 iv Add monthly dummy variables to the regression from part i Are these jointly significant Do not worry about the slight serial correlation in the errors from this regression when doing the joint test C8 Use the data in FERTIL3 for this exercise i Graph gfr against time Does it contain a clear upward or downward trend over the entire sample period ii Using the data through 1979 estimate a cubic time trend model for gfr that is regress gfr on t t2 and t3 along with an intercept Comment on the Rsquared of the regression iii Using the model in part ii compute the mean absolute error of the onestepahead forecast errors for the years 1980 through 1984 iv Using the data through 1979 regress Dgfrt on a constant only Is the constant statistically different from zero Does it make sense to assume that any drift term is zero if we assume that gfrt follows a random walk v Now forecast gfr for 1980 through 1984 using a random walk model the forecast of gfrn11 is simply gfrn Find the MAE How does it compare with the MAE from part iii Which method of forecasting do you prefer vi Now estimate an AR2 model for gfr again using the data only through 1979 Is the second lag significant vii Obtain the MAE for 1980 through 1984 using the AR2 model Does this more general model work better outofsample than the random walk model C9 Use CONSUMP for this exercise i Let yt be real per capita disposable income Use the data through 1989 to estimate the model yt 5 a 1 bt 1 ryt21 1 ut and report the results in the usual form ii Use the estimated equation from part i to forecast y in 1990 What is the forecast error iii Compute the mean absolute error of the onestepahead forecasts for the 1990s using the parameters estimated in part i iv Now compute the MAE over the same period but drop yt21 from the equation Is it better to include yt21 in the model or not C10 Use the data in INTQRT for this exercise i Using the data from all but the last four years 16 quarters estimate an AR1 model for Dr6t We use the difference because it appears that r6t has a unit root Find the RMSE of the one stepahead forecasts for Dr6 using the last 16 quarters ii Now add the error correction term sprt21 5 r6t21 2 r3t21 to the equation from part i This assumes that the cointegrating parameter is one Compute the RMSE for the last 16 quarters Does the error correction term help with outofsample forecasting in this case iii Now estimate the cointegrating parameter rather than setting it to one Use the last 16 quarters again to produce the outofsample RMSE How does this compare with the forecasts from parts i and ii iv Would your conclusions change if you wanted to predict r6 rather than Dr6 Explain C11 Use the data in VOLAT for this exercise i Confirm that lsp500 5 log1sp5002 and lip 5 log1ip2 appear to contain unit roots Use Dickey Fuller tests with four lagged changes and do the tests with and without a linear time trend ii Run a simple regression of lsp500 on lip Comment on the sizes of the t statistic and Rsquared iii Use the residuals from part ii to test whether lsp500 and lip are cointegrated Use the standard DickeyFuller test and the ADF test with two lags What do you conclude iv Add a linear time trend to the regression from part ii and now test for cointegration using the same tests from part iii v Does it appear that stock prices and real economic activity have a longrun equilibrium relationship Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 604 PART 3 Advanced Topics C12 This exercise also uses the data from VOLAT Computer Exercise C11 studies the longrun relation ship between stock prices and industrial production Here you will study the question of Granger causality using the percentage changes i Estimate an AR3 model for pcipt the percentage change in industrial production reported at an annualized rate Show that the second and third lags are jointly significant at the 25 level ii Add one lag of pcspt to the equation estimated in part i Is the lag statistically significant What does this tell you about Granger causality between the growth in industrial production and the growth in stock prices iii Redo part ii but obtain a heteroskedasticityrobust t statistic Does the robust test change your conclusions from part ii C13 Use the data in TRAFFIC2 for this exercise These monthly data on traffic accidents in California over the years 1981 to 1989 were used in Computer Exercise C11 in Chapter 10 i Using the standard DickeyFuller regression test whether ltotacct has a unit root Can you reject a unit root at the 25 level ii Now add two lagged changes to the test from part i and compute the augmented Dickey Fuller test What do you conclude iii Add a linear time trend to the ADF regression from part ii Now what happens iv Given the findings from parts i through iii what would you say is the best characterization of ltotacct an I1 process or an I0 process about a linear time trend v Test the percentage of fatalities prcfatt for a unit root using two lags in an ADF regression In this case does it matter whether you include a linear time trend C14 Use the data in MINWAGEDTA for sector 232 to answer the following questions i Confirm that lwage232t and lemp232t are best characterized as I1 processes Use the augmented DF test with one lag of gwage232 and gemp232 respectively and a linear time trend Is there any doubt that these series should be assumed to have unit roots ii Regress lemp232t on lwage232t and test for cointegration both with and without a time trend allowing for two lags in the augmented EngleGranger test What do you conclude iii Now regress lemp232t on log of the real wage rate lrwage232t 5 lwage232t 2 lcpit and a time trend Do you find cointegration Are they closer to being cointegrated when you use real wages rather than nominal wages iv What are some factors that might be missing from the cointegrating regression in part iii C15 This question asks you to study the socalled Beveridge Curve from the perspective of cointegration analysis The US monthly data from December 2000 through February 2012 are in BEVERIDGE i Test for a unit root in urate using the usual DickeyFuller test with a constant and the augmented DF with two lags of curate What do you conclude Are the lags of curate in the augmented DF test statistically significant Does it matter to the outcome of the unit root test ii Repeat part i but with the vacancy rate vrate iii Assuming that urate and vrate are both I1 the Beveridge Curve uratet 5 a 1 bvrate 1 ut only makes sense if urate and vrate are cointegrated with cointegrating parameter b 0 Test for cointegration using the EngleGranger test with no lags Are urate and vrate cointegrated at the 10 significance level What about at the 5 level iv Obtain the leads and lags estimator with cvratet cvratet21 and cvratet11 as the I0 explanatory variables added to the equation in part iii Obtain the NeweyWest standard error for b using four lags so g 5 4 in the notation of Section 125 What is the resulting 95 confidence interval for b How does it compare with the confidence interval that is not robust to serial correlation or heteroskedasticity v Redo the EngleGranger test but with two lags in the augmented DF regression What happens What do you conclude about the robustness of the claim that urate and vrate are cointegratedlemp232t Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 605 I n this chapter we discuss the ingredients of a successful empirical analysis with emphasis on completing a term project In addition to reminding you of the important issues that have arisen throughout the text we emphasize recurring themes that are important for applied research We also provide suggestions for topics as a way of stimulating your imagination Several sources of economic research and data are given as references 191 Posing a Question The importance of posing a very specific question that in principle can be answered with data can not be overstated Without being explicit about the goal of your analysis you cannot know where to begin The widespread availability of rich data sets makes it tempting to launch into data collection based on halfbaked ideas but this is often counterproductive It is likely that without carefully for mulating your hypotheses and the kind of model you will need to estimate you will forget to collect information on important variables obtain a sample from the wrong population or collect data for the wrong time period This does not mean that you should pose your question in a vacuum Especially for a oneterm project you cannot be too ambitious Therefore when choosing a topic you should be reasonably sure that data sources exist that will allow you to answer your question in the allotted time You need to decide what areas of economics or other social sciences interest you when selecting a topic For example if you have taken a course in labor economics you have probably seen theories that can be tested empirically or relationships that have some policy relevance Labor economists are constantly coming up with new variables that can explain wage differentials Examples include Carrying Out an Empirical Project c h a p t e r 19 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 606 quality of high school Card and Krueger 1992 and Betts 1995 amount of math and science taken in high school Levine and Zimmerman 1995 and physical appearance Hamermesh and Biddle 1994 Averett and Korenman 1996 Biddle and Hamermesh 1998 and Hamermesh and Parker 2005 Researchers in state and local public finance study how local economic activity depends on economic policy variables such as property taxes sales taxes level and quality of services such as schools fire and police and so on See for example White 1986 Papke 1987 Bartik 1991 Netzer 1992 and Mark McGuire and Papke 2000 Economists that study education issues are interested in determining how spending affects per formance Hanushek 1986 whether attending certain kinds of schools improves performance for example Evans and Schwab 1995 and what factors affect where private schools choose to locate Downes and Greenstein 1996 Macroeconomists are interested in relationships between various aggregate time series such as the link between growth in gross domestic product and growth in fixed investment or machinery see De Long and Summers 1991 or the effect of taxes on interest rates for example Peek 1982 There are certainly reasons for estimating models that are mostly descriptive For example prop erty tax assessors use models called hedonic price models to estimate housing values for homes that have not been sold recently This involves a regression model relating the price of a house to its char acteristics size number of bedrooms number of bathrooms and so on As a topic for a term paper this is not very exciting we are unlikely to learn much that is surprising and such an analysis has no obvious policy implications Adding the crime rate in the neighborhood as an explanatory variable would allow us to determine how important a factor crime is on housing prices something that would be useful in estimating the costs of crime Several relationships have been estimated using macroeconomic data that are mostly descriptive For example an aggregate saving function can be used to estimate the aggregate marginal propensity to save as well as the response of saving to asset returns such as interest rates Such an analysis could be made more interesting by using time series data on a country that has a history of political upheavals and determining whether savings rates decline during times of political uncertainty Once you decide on an area of research there are a variety of ways to locate specific papers on the topic The Journal of Economic Literature JEL has a detailed classification system in which each paper is given a set of identifying codes that places it within certain subfields of economics The JEL also contains a list of articles published in a wide variety of journals organized by topic and it even contains short abstracts of some articles Especially convenient for finding published papers on various topics are Internet services such as EconLit which many universities subscribe to EconLit allows users to do a comprehensive search of almost all economics journals by author subject words in the title and so on The Social Sciences Citation Index is useful for finding papers on a broad range of topics in the social sciences including popular papers that have been cited often in other published works Google Scholar is an Internet search engine that can be very helpful for tracking down research on various topics or research by a particular author This is especially true of work that has not been published in an academic journal or that has not yet been published In thinking about a topic you should keep some things in mind First for a question to be inter esting it does not need to have broadbased policy implications rather it can be of local interest For example you might be interested in knowing whether living in a fraternity at your university causes students to have lower or higher grade point averages This may or may not be of interest to people outside your university but it is probably of concern to at least some people within the university On the other hand you might study a problem that starts by being of local interest but turns out to have widespread interest such as determining which factors affect and which university policies can stem alcohol abuse on college campuses Second it is very difficult especially for a quarter or semester project to do truly original research using the standard macroeconomic aggregates on the US economy For example the ques tion of whether money growth government spending growth and so on affect economic growth has Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 607 been and continues to be studied by professional macroeconomists The question of whether stock or other asset returns can be systematically predicted using known information has for obvious reasons been studied pretty carefully This does not mean that you should avoid estimating macroeconomic or empirical finance models as even just using more recent data can add constructively to a debate In addition you can sometimes find a new variable that has an important effect on economic aggregates or financial returns such a discovery can be exciting The point is that exercises such as using a few additional years to estimate a standard Phillips curve or an aggregate consumption function for the US economy or some other large economy are unlikely to yield additional insights although they can be instructive for the student Instead you might use data on a smaller country to estimate a static or dynamic Phillips curve or a Beveridge curve possibly allowing the slopes of the curves to depend on information known prior to the current time period or to test the efficient markets hypothesis and so on At the nonmacroeconomic level there are also plenty of questions that have been studied extensively For example labor economists have published many papers on estimating the return to education This question is still studied because it is very important and new data sets as well as new econometric approaches continue to be developed For example as we saw in Chapter 9 certain data sets have better proxy variables for unobserved ability than other data sets Compare WAGE1 and WAGE2 In other cases we can obtain panel data or data from a natural experimentsee Chapter 13that allow us to approach an old question from a different perspective As another example criminologists are interested in studying the effects of various laws on crime The question of whether capital punishment has a deterrent effect has long been debated Similarly economists have been interested in whether taxes on cigarettes and alcohol reduce consumption as always in a ceteris paribus sense As more years of data at the state level become available a richer panel data set can be created and this can help us better answer major policy questions Plus the effectiveness of fairly recent crimefighting innovationssuch as community policingcan be eval uated empirically While you are formulating your question it is helpful to discuss your ideas with your classmates instructor and friends You should be able to convince people that the answer to your question is of some interest Of course whether you can persuasively answer your question is another issue but you need to begin with an interesting question If someone asks you about your paper and you respond with Im doing my paper on crime or Im doing my paper on interest rates chances are you have only decided on a general area without formulating a true question You should be able to say something like Im studying the effects of community policing on city crime rates in the United States or Im looking at how inflation volatility affects shortterm interest rates in Brazil 192 Literature Review All papers even if they are relatively short should contain a review of relevant literature It is rare that one attempts an empirical project for which no published precedent exists If you search through journals or use online search services such as EconLit to come up with a topic you are already well on your way to a literature review If you select a topic on your ownsuch as studying the effects of drug usage on college performance at your universitythen you will probably have to work a little harder But online search services make that work a lot easier as you can search by keywords by words in the title by author and so on You can then read abstracts of papers to see how relevant they are to your own work When doing your literature search you should think of related topics that might not show up in a search using a handful of keywords For example if you are studying the effects of drug usage on wages or grade point average you should probably look at the literature on how alcohol usage affects such factors Knowing how to do a thorough literature search is an acquired skill but you can get a long way by thinking before searching Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 608 Researchers differ on how a literature review should be incorporated into a paper Some like to have a separate section called literature review while others like to include the literature review as part of the introduction This is largely a matter of taste although an extensive literature review prob ably deserves its own section If the term paper is the focus of the coursesay in a senior seminar or an advanced econometrics courseyour literature review probably will be lengthy Term papers at the end of a first course are typically shorter and the literature reviews are briefer 193 Data Collection 193a Deciding on the Appropriate Data Set Collecting data for a term paper can be educational exciting and sometimes even frustrating You must first decide on the kind of data needed to answer your posed question As we discussed in the introduction and have covered throughout this text data sets come in a variety of forms The most common kinds are crosssectional time series pooled cross sections and panel data sets Many questions can be addressed using any of the data structures we have described For exam ple to study whether more law enforcement lowers crime we could use a cross section of cities a time series for a given city or a panel data set of citieswhich consists of data on the same cities over two or more years Deciding on which kind of data to collect often depends on the nature of the analysis To answer questions at the individual or family level we often only have access to a single cross section typi cally these are obtained via surveys Then we must ask whether we can obtain a rich enough data set to do a convincing ceteris paribus analysis For example suppose we want to know whether families who save through individual retirement accounts IRAswhich have certain tax advantageshave less nonIRA savings In other words does IRA saving simply crowd out other forms of saving There are data sets such as the Survey of Consumer Finances that contain information on various kinds of saving for a different sample of families each year Several issues arise in using such a data set Perhaps the most important is whether there are enough controlsincluding income demograph ics and proxies for saving tastesto do a reasonable ceteris paribus analysis If these are the only kinds of data available we must do what we can with them The same issues arise with crosssectional data on firms cities states and so on In most cases it is not obvious that we will be able to do a ceteris paribus analysis with a single cross section For example any study of the effects of law enforcement on crime must recognize the endogeneity of law enforcement expenditures When using standard regression methods it may be very hard to complete a convincing ceteris paribus analysis no matter how many controls we have See Section 194 for more discussion If you have read the advanced chapters on panel data methods you know that having the same crosssectional units at two or more different points in time can allow us to control for timeconstant unobserved effects that would normally confound regression on a single cross section Panel data sets are relatively hard to obtain for individuals or familiesalthough some important ones exist such as the Panel Study of Income Dynamicsbut they can be used in very convincing ways Panel data sets on firms also exist For example Compustat and the Center for Research in Security Prices CRSP manage very large panel data sets of financial information on firms Easier to obtain are panel data sets on larger units such as schools cities counties and states as these tend not to disappear over time and government agencies are responsible for collecting information on the same variables each year For example the Federal Bureau of Investigation collects and reports detailed information on crime rates at the city level Sources of data are listed at the end of this chapter Data come in a variety of forms Some data sets especially historical ones are available only in printed form For small data sets entering the data yourself from the printed source is Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 609 manageable and convenient Sometimes articles are published with small data setsespecially time series applications These can be used in an empirical study perhaps by supplementing the data with more recent years Many data sets are available in electronic form Various government agencies provide data on their websites Private companies sometimes compile data sets to make them user friendly and then they provide them for a fee Authors of papers are often willing to provide their data sets in electronic form More and more data sets are available on the Internet The web is a vast resource of online databases Numerous websites containing economic and related data sets have been created Several other websites contain links to data sets that are of interest to economists some of these are listed at the end of this chapter Generally searching the Internet for data sources is easy and will become even more convenient in the future 193b Entering and Storing Your Data Once you have decided on a data type and have located a data source you must put the data into a usable format If the data came in electronic form they are already in some format hopefully one in widespread use The most flexible way to obtain data in electronic form is as a standard text ASCII file All statistics and econometrics software packages allow raw data to be stored this way Typically it is straightforward to read a text file directly into an econometrics package provided the file is properly structured The data files we have used throughout the text provide several examples of how crosssectional time series pooled cross sections and panel data sets are usually stored As a rule the data should have a tabular form with each observation representing a different row the columns in the data set represent different variables Occasionally you might encounter a data set stored with each column representing an observation and each row a different variable This is not ideal but most software packages allow data to be read in this form and then reshaped Naturally it is crucial to know how the data are organized before reading them into your econometrics package For time series data sets there is only one sensible way to enter and store the data namely chronologically with the earliest time period listed as the first observation and the most recent time period as the last observation It is often useful to include variables indicating year and if relevant quarter or month This facilitates estimation of a variety of models later on including allowing for seasonality and breaks at different time periods For cross sections pooled over time it is usually best to have the cross section for the earliest year fill the first block of observations followed by the cross section for the second year and so on See FERTIL1 as an example This arrangement is not crucial but it is very important to have a variable stating the year attached to each observation For panel data as we discussed in Section 135 it is best if all the years for each crosssectional observation are adjacent and in chronological order With this ordering we can use all of the panel data methods from Chapters 13 and 14 With panel data it is important to include a unique identifier for each crosssectional unit along with a year variable If you obtain your data in printed form you have several options for entering them into a computer First you can create a text file using a standard text editor This is how several of the raw data sets included with the text were initially created Typically it is required that each row starts a new obser vation that each row contains the same ordering of the variablesin particular each row should have the same number of entriesand that the values are separated by at least one space Sometimes a different separator such as a comma is better but this depends on the software you are using If you have missing observations on some variables you must decide how to denote that simply leaving a blank does not generally work Many regression packages accept a period as the missing value symbol Some people prefer to use a numberpresumably an impossible value for the variable of interestto denote missing values If you are not careful this can be dangerous we discuss this further later If you have nonnumerical datafor example you want to include the names in a sample of colleges or the names of citiesthen you should check the econometrics package you will use to see Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 610 the best way to enter such variables often called strings Typically strings are put between double or single quotation marks Or the text file can follow a rigid formatting which usually requires a small program to read in the text file But you need to check your econometrics package for details Another generally available option is to use a spreadsheet to enter your data such as Excel This has a couple of advantages over a text file First because each observation on each variable is a cell it is less likely that numbers will be run together as would happen if you forget to enter a space in a text file Second spreadsheets allow manipulation of data such as sorting or computing averages This benefit is less important if you use a software package that allows sophisticated data management many software packages including EViews and Stata fall into this category If you use a spreadsheet for initial data entry then you must often export the data in a form that can be read by your econo metrics package This is usually straightforward as spreadsheets export to text files using a variety of formats A third alternative is to enter the data directly into your econometrics package Although this obviates the need for a text editor or a spreadsheet it can be more awkward if you cannot freely move across different observations to make corrections or additions Data downloaded from the Internet may come in a variety of forms Often data come as text files but different conventions are used for separating variables for panel data sets the conventions on how to order the data may differ Some Internet data sets come as spreadsheet files in which case you must use an appropriate spreadsheet to read them 193c Inspecting Cleaning and Summarizing Your Data It is extremely important to become familiar with any data set you will use in an empirical analysis If you enter the data yourself you will be forced to know everything about it But if you obtain data from an outside source you should still spend some time understanding its structure and conventions Even data sets that are widely used and heavily documented can contain glitches If you are using a data set obtained from the author of a paper you must be aware that rules used for data set construc tion can be forgotten Earlier we reviewed the standard ways that various data sets are stored You also need to know how missing values are coded Preferably missing values are indicated with a nonnumeric character such as a period If a number is used as a missing value code such as 999 or 21 you must be very careful when using these observations in computing any statistics Your econometrics package will probably not know that a certain number really represents a missing value it is likely that such observations will be used as if they are valid and this can produce rather misleading results The best approach is to set any numerical codes for missing values to some other character such as a period that cannot be mistaken for real data You must also know the nature of the variables in the data set Which are binary variables Which are ordinal variables such as a credit rating What are the units of measurement of the variables For example are monetary values expressed in dollars thousands of dollars millions of dollars or some other units Are variables representing a ratesuch as school dropout rates inflation rates unioniza tion rates or interest ratesmeasured as a percentage or a proportion Especially for time series data it is crucial to know if monetary values are in nominal current or real constant dollars If the values are in real terms what is the base year or period If you receive a data set from an author some variables may already be transformed in certain ways For example sometimes only the log of a variable such as wage or salary is reported in the data set Detecting mistakes in a data set is necessary for preserving the integrity of any data analysis It is always useful to find minimums maximums means and standard deviations of all or at least the most important variables in the analysis For example if you find that the minimum value of education in your sample is 99 you know that at least one entry on education needs to be set to a missing value If upon further inspection you find that several observations have 299 as the level of Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 611 education you can be confident that you have stumbled onto the missing value code for education As another example if you find that an average murder conviction rate across a sample of cities is 632 you know that conviction rate is measured as a proportion not a percentage Then if the maximum value is above one this is likely a typographical error It is not uncommon to find data sets where most of the entries on a rate variable were entered as a percentage but where some were entered as a proportion and vice versa Such data coding errors can be difficult to detect but it is important to try We must also be careful in using time series data If we are using monthly or quarterly data we must know which variables if any have been seasonally adjusted Transforming data also requires great care Suppose we have a monthly data set and we want to create the change in a variable from one month to the next To do this we must be sure that the data are ordered chronologically from earliest period to latest If for some reason this is not the case the differencing will result in garbage To be sure the data are properly ordered it is useful to have a time period indicator With annual data it is sufficient to know the year but then we should know whether the year is entered as four digits or two digits for example 1998 versus 98 With monthly or quarterly data it is also useful to have a variable or variables indicating month or quarter With monthly data we may have a set of dummy variables 11 or 12 or one variable indicating the month 1 through 12 or a string variable such as jan feb and so on With or without yearly monthly or quarterly indicators we can easily construct time trends in all econometrics software packages Creating seasonal dummy variables is easy if the month or quarter is indicated at a minimum we need to know the month or quarter of the first observation Manipulating panel data can be even more challenging In Chapter 13 we discussed pooled OLS on the differenced data as one general approach to controlling for unobserved effects In construct ing the differenced data we must be careful not to create phantom observations Suppose we have a balanced panel on cities from 1992 through 1997 Even if the data are ordered chronologically within each crosssectional unitsomething that should be done before proceedinga mindless differenc ing will create an observation for 1992 for all cities except the first in the sample This observation will be the 1992 value for city i minus the 1997 value for city i 2 1 this is clearly nonsense Thus we must make sure that 1992 is missing for all differenced variables 194 Econometric Analysis This text has focused on econometric analysis and we are not about to provide a review of econo metric methods in this section Nevertheless we can give some general guidelines about the sorts of issues that need to be considered in an empirical analysis As we discussed earlier after deciding on a topic we must collect an appropriate data set Assuming that this has also been done we must next decide on the appropriate econometric methods If your course has focused on ordinary least squares estimation of a multiple linear regres sion model using either crosssectional or time series data the econometric approach has pretty much been decided for you This is not necessarily a weakness as OLS is still the most widely used econometric method Of course you still have to decide whether any of the variants of OLS such as weighted least squares or correcting for serial correlation in a time series regressionare warranted In order to justify OLS you must also make a convincing case that the key OLS assumptions are satisfied for your model As we have discussed at some length the first issue is whether the error term is uncorrelated with the explanatory variables Ideally you have been able to control for enough other factors to assume that those that are left in the error are unrelated to the regressors Especially when dealing with individual family or firmlevel crosssectional data the selfselection problem which we discussed in Chapters 7 and 15is often relevant For instance in the IRA example from Section 193 it may be that families with an unobserved taste for saving are also the ones that Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 612 open IRAs You should also be able to argue that the other potential sources of endogeneitynamely measurement error and simultaneityare not a serious problem When specifying your model you must also make functional form decisions Should some vari ables appear in logarithmic form In econometric applications the answer is often yes Should some variables be included in levels and squares to possibly capture a diminishing effect How should qualitative factors appear Is it enough to just include binary variables for different attributes or groups Or do these need to be interacted with quantitative variables See Chapter 7 for details A common mistake especially among beginners is to incorrectly include explanatory varia bles in a regression model that are listed as numerical values but have no quantitative meaning For example in an individuallevel data set that contains information on wages education experience and other variables an occupation variable might be included Typically these are just arbitrary codes that have been assigned to different occupations the fact that an elementary school teacher is given say the value 453 while a computer technician is say 751 is relevant only in that it allows us to distinguish between the two occupations It makes no sense to include the raw occupational vari able in a regression model What sense would it make to measure the effect of increasing occupa tion by one unit when the oneunit increase has no quantitative meaning Instead different dummy variables should be defined for different occupations or groups of occupations if there are many occupations Then the dummy variables can be included in the regression model A less egregious problem occurs when an ordered qualitative variable is included as an explanatory variable Suppose that in a wage data set a variable is included measuring job satisfaction defined on a scale from 1 to 7 with 7 being the most satisfied Provided we have enough data we would want to define a set of six dummy variables for say job satisfaction levels of 2 through 7 leaving job satisfaction level 1 as the base group By including the six job satisfaction dummies in the regression we allow a completely flexible relationship between the response variable and job satisfaction Putting in the job satisfaction variable in raw form implicitly assumes that a oneunit increase in the ordinal vari able has quantitative meaning While the direction of the effect will often be estimated appropriately interpreting the coefficient on an ordinal variable is difficult If an ordinal variable takes on many values then we can define a set of dummy variables for ranges of values See Section 173 for an example Sometimes we want to explain a variable that is an ordinal response For example one could think of using a job satisfaction variable of the type described above as the dependent variable in a regression model with both worker and employer characteristics among the independent variables Unfortunately with the job satisfaction variable in its original form the coefficients in the model are hard to interpret each measures the change in job satisfaction given a unit increase in the inde pendent variable Certain modelsordered probit and ordered logit are the most commonare well suited for ordered responses These models essentially extend the binary probit and logit models we discussed in Chapter 17 See Wooldridge 2010 Chapter 16 for a treatment of ordered response models A simple solution is to turn any ordered response into a binary response For example we could define a variable equal to one if job satisfaction is at least four and zero otherwise Unfortunately creating a binary variable throws away information and requires us to use a somewhat arbitrary cutoff For crosssectional analysis a secondary but nevertheless important issue is whether there is heteroskedasticity In Chapter 8 we explained how this can be dealt with The simplest way is to com pute heteroskedasticityrobust statistics As we emphasized in Chapters 10 11 and 12 time series applications require additional care Should the equation be estimated in levels If levels are used are time trends needed Is differencing the data more appropriate If the data are monthly or quarterly does seasonality have to be accounted for If you are allowing for dynamicsfor example distributed lag dynamicshow many lags should be included You must start with some lags based on intuition or common sense but eventu ally it is an empirical matter Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 613 If your model has some potential misspecification such as omitted variables and you use OLS you should attempt some sort of misspecification analysis of the kinds we discussed in Chapters 3 and 5 Can you determine based on reasonable assumptions the direction of any bias in the estimators If you have studied the method of instrumental variables you know that it can be used to solve various forms of endogeneity including omitted variables Chapter 15 errorsinvariables Chapter 15 and simultaneity Chapter 16 Naturally you need to think hard about whether the instrumental variables you are considering are likely to be valid Good papers in the empirical social sciences contain sensitivity analysis Broadly this means you estimate your original model and modify it in ways that seem reasonable Hopefully the impor tant conclusions do not change For example if you use as an explanatory variable a measure of alcohol consumption say in a grade point average equation do you get qualitatively similar results if you replace the quantitative measure with a dummy variable indicating alcohol usage If the binary usage variable is significant but the alcohol quantity variable is not it could be that usage reflects some unobserved attribute that affects GPA and is also correlated with alcohol usage But this needs to be considered on a casebycase basis If some observations are much different from the bulk of the samplesay you have a few firms in a sample that are much larger than the other firmsdo your results change much when those observations are excluded from the estimation If so you may have to alter functional forms to allow for these observations or argue that they follow a completely different model The issue of outliers was discussed in Chapter 9 Using panel data raises some additional econometric issues Suppose you have collected two periods There are at least four ways to use two periods of panel data without resorting to instru mental variables You can pool the two years in a standard OLS analysis as discussed in Chapter 13 Although this might increase the sample size relative to a single cross section it does not control for timeconstant unobservables In addition the errors in such an equation are almost always serially correlated because of an unobserved effect Random effects estimation corrects the serial correlation problem and produces asymptotically efficient estimators provided the unobserved effect has zero mean given values of the explanatory variables in all time periods Another possibility is to include a lagged dependent variable in the equation for the second year In Chapter 9 we presented this as a way to at least mitigate the omitted variables problem as we are in any event holding fixed the initial outcome of the dependent variable This often leads to similar results as differencing the data as we covered in Chapter 13 With more years of panel data we have the same options plus an additional choice We can use the fixed effects transformation to eliminate the unobserved effect With two years of data this is the same as differencing In Chapter 15 we showed how instrumental variables techniques can be combined with panel data transformations to relax exogeneity assumptions even more As a rule it is a good idea to apply several reasonable econometric methods and compare the results This often allows us to determine which of our assumptions are likely to be false Even if you are very careful in devising your topic postulating your model collecting your data and carrying out the econometrics it is quite possible that you will obtain puzzling results at least some of the time When that happens the natural inclination is to try different models different estimation techniques or perhaps different subsets of data until the results correspond more closely to what was expected Virtually all applied researchers search over various models before finding the best model Unfortunately this practice of data mining violates the assump tions we have made in our econometric analysis The results on unbiasedness of OLS and other estimators as well as the t and F distributions we derived for hypothesis testing assume that we observe a sample following the population model and we estimate that model once Estimating models that are variants of our original model violates that assumption because we are using the same set of data in a specification search In effect we use the outcome of tests by using the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 614 data to respecify our model The estimates and tests from different model specifications are not independent of one another Some specification searches have been programmed into standard software packages A popular one is known as stepwise regression where different combinations of explanatory variables are used in multiple regression analysis in an attempt to come up with the best model There are vari ous ways that stepwise regression can be used and we have no intention of reviewing them here The general idea is either to start with a large model and keep variables whose pvalues are below a certain significance level or to start with a simple model and add variables that have significant pvalues Sometimes groups of variables are tested with an F test Unfortunately the final model often depends on the order in which variables were dropped or added For more on stepwise regres sion see Draper and Smith 1981 In addition this is a severe form of data mining and it is difficult to interpret t and F statistics in the final model One might argue that stepwise regression simply automates what researchers do anyway in searching over various models However in most applica tions one or two explanatory variables are of primary interest and then the goal is to see how robust the coefficients on those variables are to either adding or dropping other variables or to changing functional form In principle it is possible to incorporate the effects of data mining into our statistical inference in practice this is very difficult and is rarely done especially in sophisticated empirical work See Leamer 1983 for an engaging discussion of this problem But we can try to minimize data mining by not searching over numerous models or estimation methods until a significant result is found and then reporting only that result If a variable is statistically significant in only a small fraction of the models estimated it is quite likely that the variable has no effect in the population 195 Writing an Empirical Paper Writing a paper that uses econometric analysis is very challenging but it can also be rewarding A successful paper combines a careful convincing data analysis with good explanations and expo sition Therefore you must have a good grasp of your topic good understanding of econometric methods and solid writing skills Do not be discouraged if you find writing an empirical paper difficult most professional researchers have spent many years learning how to craft an empirical analysis and to write the results in a convincing form While writing styles vary many papers follow the same general outline The following para graphs include ideas for section headings and explanations about what each section should contain These are only suggestions and hardly need to be strictly followed In the final paper each section would be given a number usually starting with one for the introduction 195a Introduction The introduction states the basic objectives of the study and explains why it is important It gener ally entails a review of the literature indicating what has been done and how previous work can be improved upon As discussed in Section 192 an extensive literature review can be put in a separate section Presenting simple statistics or graphs that reveal a seemingly paradoxical relationship is a useful way to introduce the papers topic For example suppose that you are writing a paper about factors affecting fertility in a developing country with the focus on education levels of women An appealing way to introduce the topic would be to produce a table or a graph showing that fertility has been falling say over time and a brief explanation of how you hope to examine the factors contrib uting to the decline At this point you may already know that ceteris paribus more highly educated women have fewer children and that average education levels have risen over time Most researchers like to summarize the findings of their paper in the introduction This can be a useful device for grabbing the readers attention For example you might state that your best estimate Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 615 of the effect of missing 10 hours of lecture during a 30hour term is about onehalf a grade point But the summary should not be too involved because neither the methods nor the data used to obtain the estimates have yet been introduced 195b Conceptual or Theoretical Framework In this section you describe the general approach to answering the question you have posed It can be formal economic theory but in many cases it is an intuitive discussion about what conceptual prob lems arise in answering your question Suppose you are studying the effects of economic opportunities and severity of punishment on criminal behavior One approach to explaining participation in crime is to specify a utility maximiza tion problem where the individual chooses the amount of time spent in legal and illegal activities given wage rates in both kinds of activities as well as variables measuring probability and severity of punishment for criminal activity The usefulness of such an exercise is that it suggests which variables should be included in the empirical analysis it gives guidance but rarely specifics as to how the variables should appear in the econometric model Often there is no need to write down an economic theory For econometric policy analysis com mon sense usually suffices for specifying a model For example suppose you are interested in esti mating the effects of participation in Aid to Families with Dependent Children AFDC on the effects of child performance in school AFDC provides supplemental income but participation also makes it easier to receive Medicaid and other benefits The hard part of such an analysis is deciding on the set of variables that should be controlled for In this example we could control for family income including AFDC and any other welfare income mothers education whether the family lives in an urban area and other variables Then the inclusion of an AFDC participation indicator hopefully measures the nonincome benefits of AFDC participation A discussion of which factors should be controlled for and the mechanisms through which AFDC participation might improve school perfor mance substitute for formal economic theory 195c Econometric Models and Estimation Methods It is very useful to have a section that contains a few equations of the sort you estimate and present in the results section of the paper This allows you to fix ideas about what the key explanatory variable is and what other factors you will control for Writing equations containing error terms allows you to discuss whether OLS is a suitable estimation method The distinction between a model and an estimation method should be made in this section A model represents a population relationship broadly defined to allow for time series equations For example we should write colGPA 5 b0 1 b1alcohol 1 b2hsGPA 1 b3SAT 1 b4 female 1 u 191 to describe the relationship between college GPA and alcohol consumption with some other controls in the equation Presumably this equation represents a population such as all undergraduates at a particular university There are no hats ˆ on the bj or on colGPA because this is a model not an estimated equation We do not put in numbers for the bj because we do not know and never will know these numbers Later we will estimate them In this section do not anticipate the presentation of your empirical results In other words do not start with a general model and then say that you omit ted certain variables because they turned out to be insignificant Such discussions should be left for the results section A time series model to relate citylevel car thefts to the unemployment rate and conviction rates could look like theftst 5 b0 1 b1unemt 1 b2unemt21 1 b3carst 192 1 b4convratet 1 b5convratet21 1 ut Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 616 where the t subscript is useful for emphasizing any dynamics in the equation in this case allowing for unemployment and the automobile theft conviction rate to have lagged effects After specifying a model or models it is appropriate to discuss estimation methods In most cases this will be OLS but for example in a time series equation you might use feasible GLS to do a serial correlation correction as in Chapter 12 However the method for estimating a model is quite distinct from the model itself It is not meaningful for instance to talk about an OLS model Ordinary least squares is a method of estimation and so are weighted least squares CochraneOrcutt and so on There are usually several ways to estimate any model You should explain why the method you are choosing is warranted Any assumptions that are used in obtaining an estimable econometric model from an underly ing economic model should be clearly discussed For example in the quality of high school example mentioned in Section 191 the issue of how to measure school quality is central to the analysis Should it be based on average SAT scores percentage of graduates attending college student teacher ratios average education level of teachers some combination of these or possibly other measures We always have to make assumptions about functional form whether or not a theoretical model has been presented As you know constant elasticity and constant semielasticity models are attrac tive because the coefficients are easy to interpret as percentage effects There are no hard rules on how to choose functional form but the guidelines discussed in Section 62 seem to work well in prac tice You do not need an extensive discussion of functional form but it is useful to mention whether you will be estimating elasticities or a semielasticity For example if you are estimating the effect of some variable on wage or salary the dependent variable will almost surely be in logarithmic form and you might as well include this in any equations from the beginning You do not have to present every one or even most of the functional form variations that you will report later in the results section Often the data used in empirical economics are at the city or county level For example suppose that for the population of small to midsize cities you wish to test the hypothesis that having a minor league baseball team causes a city to have a lower divorce rate In this case you must account for the fact that larger cities will have more divorces One way to account for the size of the city is to scale divorces by the city or adult population Thus a reasonable model is log1divpop2 5 b0 1 b1mlb 1 b2 perCath 1 b3log1incpop2 193 1 other factors where mlb is a dummy variable equal to one if the city has a minor league baseball team and perCath is the percentage of the population that is Catholic so a number such as 346 means 346 Note that divpop is a divorce rate which is generally easier to interpret than the absolute number of divorces Another way to control for population is to estimate the model log1div2 5 g0 1 g1mlb 1 g2 perCath 1 g3log1inc2 1 g4log1pop2 194 1 other factors The parameter of interest g1 when multiplied by 100 gives the percentage difference between divorce rates holding population percent Catholic income and whatever else is in other factors constant In equation 193 b1 measures the percentage effect of minor league baseball on divpop which can change either because the number of divorces or the population changes Using the fact that log1divpop2 5 log1div2 2 log1pop2 and log1incpop2 5 log1inc2 2 log1pop2 we can rewrite 193 as log1div2 5 b0 1 b1mlb 1 b2 perCath 1 b3log1inc2 1 11 2 b32log1pop2 1 others factors Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 617 which shows that 193 is a special case of 194 with g4 5 11 2 b32 and gj 5 bj j 5 0 1 2 3 Alternatively 194 is equivalent to adding log1pop2 as an additional explanatory variable to 193 This makes it easy to test for a separate population effect on the divorce rate If you are using a more advanced estimation method such as two stage least squares you need to provide some reasons for doing so If you use 2SLS you must provide a careful discussion on why your IV choices for the endogenous explanatory variable or variables are valid As we mentioned in Chapter 15 there are two requirements for a variable to be considered a good IV First it must be omitted from and exogenous to the equation of interest structural equation This is something we must assume Second it must have some partial correlation with the endogenous explanatory vari able This we can test For example in equation 191 you might use a binary variable for whether a student lives in a dormitory dorm as an IV for alcohol consumption This requires that living situ ation has no direct impact on colGPAso that it is omitted from 191and that it is uncorrelated with unobserved factors in u that have an effect on colGPA We would also have to verify that dorm is partially correlated with alcohol by regressing alcohol on dorm hsGPA SAT and female See Chapter 15 for details You might account for the omitted variable problem or omitted heterogeneity by using panel data Again this is easily described by writing an equation or two In fact it is useful to show how to difference the equations over time to remove timeconstant unobservables this gives an equation that can be estimated by OLS Or if you are using fixed effects estimation instead you simply state so As a simple example suppose you are testing whether higher county tax rates reduce economic activity as measured by per capita manufacturing output Suppose that for the years 1982 1987 and 1992 the model is log1manufit2 5 b0 1 d1d87t 1 d2d92t 1 b1taxit 1 p 1 ai 1 uit where d87t and d92t are year dummy variables and taxit is the tax rate for county i at time t in percent form We would have other variables that change over time in the equation including measures for costs of doing business such as average wages measures of worker productivity as measured by average education and so on The term ai is the fixed effect containing all factors that do not vary over time and uit is the idiosyncratic error term To remove ai we can either difference across the years or use timedemeaning the fixed effects transformation 195d The Data You should always have a section that carefully describes the data used in the empirical analysis This is particularly important if your data are nonstandard or have not been widely used by other research ers Enough information should be presented so that a reader could in principle obtain the data and redo your analysis In particular all applicable public data sources should be included in the refer ences and short data sets can be listed in an appendix If you used your own survey to collect the data a copy of the questionnaire should be presented in an appendix Along with a discussion of the data sources be sure to discuss the units of each of the variables for example is income measured in hundreds or thousands of dollars Including a table of variable definitions is very useful to the reader The names in the table should correspond to the names used in describing the econometric results in the following section It is also very informative to present a table of summary statistics such as minimum and maximum values means and standard deviations for each variable Having such a table makes it easier to inter pret the coefficient estimates in the next section and it emphasizes the units of measurement of the variables For binary variables the only necessary summary statistic is the fraction of ones in the sample which is the same as the sample mean For trending variables things like means are less interesting It is often useful to compute the average growth rate in a variable over the years in your sample Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 618 You should always clearly state how many observations you have For time series data sets identify the years that you are using in the analysis including a description of any special periods in history such as World War II If you use a pooled cross section or a panel data set be sure to report how many crosssectional units people cities and so on you have for each year 195e Results The results section should include your estimates of any models formulated in the models section You might start with a very simple analysis For example suppose that percentage of students attend ing college from the graduating class percoll is used as a measure of the quality of the high school a person attended Then an equation to estimate is log1wage2 5 b0 1 b1percoll 1 u Of course this does not control for several other factors that may determine wages and that may be correlated with percoll But a simple analysis can draw the reader into the more sophisticated analysis and reveal the importance of controlling for other factors If only a few equations are estimated you can present the results in equation form with standard errors in parentheses below estimated coefficients If your model has several explanatory variables and you are presenting several variations on the general model it is better to report the results in tabu lar rather than equation form Most of your papers should have at least one table which should always include at least the Rsquared and the number of observations for each equation Other statistics such as the adjusted Rsquared can also be listed The most important thing is to discuss the interpretation and strength of your empirical results Do the coefficients have the expected signs Are they statistically significant If a coefficient is sta tistically significant but has a counterintuitive sign why might this be true It might be revealing a problem with the data or the econometric method for example OLS may be inappropriate due to omitted variables problems Be sure to describe the magnitudes of the coefficients on the major explanatory variables Often one or two policy variables are central to the study Their signs magnitudes and statistical significance should be treated in detail Remember to distinguish between economic and statistical significance If a t statistic is small is it because the coefficient is practically small or because its standard error is large In addition to discussing estimates from the most general model you can provide interesting special cases especially those needed to test certain multiple hypotheses For example in a study to determine wage differentials across industries you might present the equation without the indus try dummies this allows the reader to easily test whether the industry differentials are statistically significant using the Rsquared form of the F test Do not worry too much about dropping various variables to find the best combination of explanatory variables As we mentioned earlier this is a difficult and not even very welldefined task Only if eliminating a set of variables substantially alters the magnitudes andor significance of the coefficients of interest is this important Dropping a group of variables to simplify the modelsuch as quadratics or interactionscan be justified via an F test If you have used at least two different methodssuch as OLS and 2SLS or levels and differenc ing for a time series or pooled OLS versus differencing with a panel data setthen you should com ment on any critical differences If OLS gives counterintuitive results did using 2SLS or panel data methods improve the estimates Or did the opposite happen 195f Conclusions This can be a short section that summarizes what you have learned For example you might want to present the magnitude of a coefficient that was of particular interest The conclusion should also dis cuss caveats to the conclusions drawn and it might even suggest directions for further research It is useful to imagine readers turning first to the conclusion to decide whether to read the rest of the paper Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 619 195g Style Hints You should give your paper a title that reflects its topic but make sure the title is not so long as to be cumbersome The title should be on a separate title page that also includes your name affiliation andif relevantthe course number The title page can also include a short abstract or an abstract can be included on a separate page Papers should be typed and doublespaced All equations should begin on a new line and they should be centered and numbered consecutively that is 1 2 3 and so on Large graphs and tables may be included after the main body In the text refer to papers by author and date for example White 1980 The reference section at the end of the paper should be done in standard format Several examples are given in the references at the back of the text When you introduce an equation in the econometric models section you should describe the important variables the dependent variable and the key independent variable or variables To focus on a single independent variable you can write an equation such as GPA 5 b0 1 b1alcohol 1 xd 1 u or log1wage2 5 b0 1 b1educ 1 xd 1 u where the notation xd is shorthand for several other explanatory variables At this point you need only describe them generally they can be described specifically in the data section in a table For example in a study of the factors affecting chief executive officer salaries you might include a table like Table 191 A table of summary statistics obtained from Table I in Papke and Wooldridge 1996 and similar to the data in 401K might be set up as shown in Table 192 In the results section you can write the estimates either in equation form as we often have done or in a table Especially when several models have been estimated with different sets of explanatory variables tables are very useful If you write out the estimates as an equation for example log1salary2 5 245 1 236 log1sales2 1 008 roe 1 061 ceoten 10932 11152 10032 10282 n 5 204 R2 5 351 be sure to state near the first equation that standard errors are in parentheses It is acceptable to report the t statistics for testing H0 bj 5 0 or their absolute values but it is most important to state what you are doing TAblE 191 Variable Descriptions salary annual salary including bonuses in 1990 in thousands sales firm sales in 1990 in millions roe average return on equity 19881990 in percent pcsal percentage change in salary 19881990 pcroe percentage change in roe 19881990 indust 5 1 if an industrial company 0 otherwise finance 5 1 if a financial company 0 otherwise consprod 5 1 if a consumer products company 0 otherwise util 5 1 if a utility company 0 otherwise ceoten number of years as CEO of the company Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it PART 3 Advanced Topics 620 If you report your results in tabular form make sure the dependent and independent variables are clearly indicated Again state whether standard errors or t statistics are below the coefficients with the former preferred Some authors like to use asterisks to indicate statistical significance at different significance levels for example one star means significant at 5 two stars mean significant at 10 but not 5 and so on This is not necessary if you carefully discuss the significance of the explana tory variables in the text A sample table of results derived from Table II in Papke and Wooldridge 1996 is shown in Table 193 Your results will be easier to read and interpret if you choose the units of both your dependent and independent variables so that coefficients are not too large or too small You should never report numbers such as 1051e2007 or 3524e1006 for your coefficients or standard errors and you should not use scientific notation If coefficients are either extremely small or large rescale the dependent or independent variables as we discussed in Chapter 6 You should limit the number of digits reported after the decimal point so as not to convey a false sense of precision For example if your regression TAblE 192 Summary Statistics Variable Mean Standard Deviation Minimum Maximum prate 869 167 023 1 mrate 746 844 011 5 employ 462101 1629964 53 443040 age 1314 963 4 76 sole 415 493 0 1 Number of observations 5 3784 Note The quantities in parentheses below the estimates are the standard errors TAblE 193 OLS Results Dependent Variable Participation Rate Independent Variables 1 2 3 mrate 156 012 239 042 218 342 mrate2 2087 043 2096 073 logemp 2112 014 2112 014 2098 111 log1emp2 2 0057 0009 0057 0009 0052 0007 age 0060 0010 0059 0010 0050 0021 age2 200007 00002 200007 00002 200006 00002 sole 20001 0058 0008 0058 0006 0061 constant 1213 051 198 052 085 041 industry dummies no no yes Observations R squared 3784 143 3784 152 3784 162 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 621 package estimates a coefficient to be 54821059 you should report this as 548 or even 55 in the paper As a rule the commands that your particular econometrics package uses to produce results should not appear in the paper only the results are important If some special command was used to carry out a certain estimation method this can be given in an appendix An appendix is also a good place to include extra results that support your analysis but are not central to it Summary In this chapter we have discussed the ingredients of a successful empirical study and have provided hints that can improve the quality of an analysis Ultimately the success of any study depends crucially on the care and effort put into it Key Terms Data Mining Internet Misspecification Analysis Online Databases Online Search Services Sensitivity Analysis Spreadsheet Text Editor Text ASCII File Sample Empirical Projects Throughout the text we have seen examples of econometric analysis that either came from or were moti vated by published works We hope these have given you a good idea about the scope of empirical analy sis We include the following list as additional examples of questions that others have found or are likely to find interesting These are intended to stimulate your imagination no attempt is made to fill in all the details of specific models data requirements or alternative estimation methods It should be possible to complete these projects in one term 1 Do your own campus survey to answer a question of interest at your university For example What is the effect of working on college GPA You can ask students about high school GPA college GPA ACT or SAT scores hours worked per week participation in athletics major gender race and so on Then use these variables to create a model that explains GPA How much of an effect if any does another hour worked per week have on GPA One issue of concern is that hours worked might be endogenous it might be correlated with unobserved factors that affect college GPA or lower GPAs might cause students to work more A better approach would be to collect cumulative GPA prior to the semester and then to obtain GPA for the most recent semester along with amount worked during that semester and the other vari ables Now cumulative GPA could be used as a control explanatory variable in the equation 2 There are many variants on the preceding topic You can study the effects of drug or alcohol usage or of living in a fraternity on grade point average You would want to control for many family back ground variables as well as previous performance variables 3 Do gun control laws at the city level reduce violent crimes Such questions can be difficult to answer with a single cross section because city and state laws are often endogenous See Kleck and Patterson 1993 for an example They used crosssectional data and instrumental variables methods but their IVs are questionable Panel data can be very useful for inferring causality in these contexts At a minimum you could control for a previous years violent crime rate 4 Low and McPheters 1983 used city crosssectional data on wage rates and estimates of risk of death for police officers along with other controls The idea is to determine whether police officers are compensated for working in cities with a higher risk of onthejob injury or death Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 622 PART 3 Advanced Topics 5 Do parental consent laws increase the teenage birthrate You can use state level data for this either a time series for a given state or even better a panel data set of states Do the same laws reduce abortion rates among teenagers The Statistical Abstract of the United States contains all kinds of statelevel data Levine Trainor and Zimmerman 1996 studied the effects of abortion funding restrictions on similar outcomes Other factors such as access to abortions may affect teen birth and abortion rates There is also recent interest in the effects of abstinenceonly sex education curricula One can again use statelevel panel data or maybe even panel data at the school district level to determine the effects of abstinenceonly approaches to sex education on various outcomes including rates of sexu ally transmitted diseases and teen birthrates 6 Do changes in traffic laws affect traffic fatalities McCarthy 1994 contains an analysis of monthly time series data for the state of California A set of dummy variables can be used to indicate the months in which certain laws were in effect The file TRAFFIC2 contains the data used by McCarthy An alternative is to obtain a panel data set on states in the United States where you can exploit vari ation in laws across states as well as across time Freeman 2007 is a good example of a statelevel analysis using 25 years of data that straddle changes in various state drunk driving seat belt and speed limit laws The data can be found in the file DRIVING Mullahy and Sindelar 1994 used individuallevel data matched with state laws and taxes on alcohol to estimate the effects of laws and taxes on the probability of driving drunk 7 Are blacks discriminated against in the lending market Hunter and Walker 1996 looked at this ques tion in fact we used their data in Computer Exercises C8 in Chapter 7 and C2 in Chapter 17 8 Is there a marriage premium for professional athletes Korenman and Neumark 1991 found a signifi cant wage premium for married men after using a variety of econometric methods but their analysis is limited because they cannot directly observe productivity Plus Korenman and Neumark used men in a variety of occupations Professional athletes provide an interesting group in which to study the marriage premium because we can easily collect data on various productivity measures in addition to salary The data set NBASAL on players in the National Basketball Association NBA is one example For each player we have information on points scored rebounds assists playing time and demographics As in Computer Exercise C9 in Chapter 6 we can use multiple regression analysis to test whether the productivity measures differ by marital status We can also use this kind of data to test whether married men are paid more after we account for productivity differences For example NBA owners may think that married men bring stability to the team or are better for the team image For individual sportssuch as golf and tennisannual earnings directly reflect productivity Such data along with age and experience are relatively easy to collect 9 Answer this question Are cigarette smokers less productive A variant on this is Do workers who smoke take more sick days everything else being equal Mullahy and Portney 1990 use individual level data to evaluate this question You could use data at say the metropolitan level Something like average productivity in manufacturing can be related to percentage of manufacturing workers who smoke Other variables such as average worker education capital per worker and size of the city you can think of more should be controlled for 10 Do minimum wages alleviate poverty You can use state or county data to answer this question The idea is that the minimum wage varies across states because some states have higher minimums than the federal minimum Further there are changes over time in the nominal minimum within a state some due to changes at the federal level and some because of changes at the state level Neumark and Wascher 1995 used a panel data set on states to estimate the effects of the minimum wage on the employment rates of young workers as well as on school enrollment rates 11 What factors affect student performance at public schools It is fairly easy to get schoollevel or at least districtlevel data in most states Does spending per student matter Do studentteacher ratios have any effects It is difficult to estimate ceteris paribus effects because spending is related to other factors such as family incomes or poverty rates The data set MEAP93 for Michigan high schools contains a Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 623 measure of the poverty rates Another possibility is to use panel data or at least to control for a previ ous years performance measure such as average test score or percentage of students passing an exam You can look at less obvious factors that affect student performance For example after control ling for income does family structure matter Perhaps families with two parents but only one work ing for a wage have a positive effect on performance There could be at least two channels parents spend more time with the children and they might also volunteer at school What about the effect of singleparent households controlling for income and other factors You can merge census data for one or two years with school district data Do public schools with more charter or private schools nearby better educate their students be cause of competition There is a tricky simultaneity issue here because private schools are probably located in areas where the public schools are already poor Hoxby 1994 used an instrumental vari ables approach where population proportions of various religions were IVs for the number of private schools Rouse 1998 studied a different question Did students who were able to attend a private school due to the Milwaukee voucher program perform better than those who did not She used panel data and was able to control for an unobserved student effect A subset of Rouses data is contained in the file VOUCHER 12 Can excess returns on a stock or a stock index be predicted by the lagged pricedividend ratio Or by lagged interest rates or weekly monetary policy It would be interesting to pick a foreign stock index or one of the less wellknown US indexes Cochrane 1997 provides a nice survey of recent theories and empirical results for explaining excess stock returns 13 Is there racial discrimination in the market for baseball cards This involves relating the prices of baseball cards to factors that should affect their prices such as career statistics whether the player is in the Hall of Fame and so on Holding other factors fixed do cards of black or Hispanic players sell at a discount 14 You can test whether the market for gambling on sports is efficient For example does the spread on football or basketball games contain all usable information for picking against the spread The data set PNTSPRD contains information on mens college basketball games The outcome variable is binary Was the spread covered or not Then you can try to find information that was known prior to each games being played in order to predict whether the spread is covered Good luck A useful website that contains historical spreads and outcomes for college football and mens basketball games is wwwgoldsheetcom 15 What effect if any does success in college athletics have on other aspects of the university applica tions quality of students quality of nonathletic departments McCormick and Tinsley 1987 looked at the effects of athletic success at major colleges on changes in SAT scores of entering freshmen Timing is important here presumably it is recent past success that affects current applications and student quality One must control for many other factorssuch as tuition and measures of school qualityto make the analysis convincing because without controlling for other factors there is a negative correlation between academics and athletic performance A more recent examination of the link between academic and athletic performance is provided by Tucker 2004 who also looks at how alumni contributions are affected by athletic success A variant is to match natural rivals in football or mens basketball and to look at differences across schools as a function of which school won the football game or one or more basketball games ATHLET1 and ATHLET2 are small data sets that could be expanded and updated 16 Collect murder rates for a sample of counties say from the FBI Uniform Crime Reports for two years Make the latter year such that economic and demographic variables are easy to obtain from the County and City Data Book You can obtain the total number of people on death row plus executions for intervening years at the county level If the years are 1990 and 1985 you might estimate mrdrte90 5 b0 1 b1mrdrte85 1 b2executions 1 other factors Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 624 PART 3 Advanced Topics where interest is in the coefficient on executions The lagged murder rate and other factors serve as controls If more than two years of data are obtained then the panel data methods in Chapters 13 and 14 can be applied Other factors may also act as a deterrent to crime For example Cloninger 1991 presented a crosssectional analysis of the effects of lethal police response on crime rates As a different twist what factors affect crime rates on college campuses Does the fraction of students living in fraternities or sororities have an effect Does the size of the police force matter or the kind of policing used Be careful about inferring causality here Does having an escort program help reduce crime What about crime rates in nearby communities Recently colleges and universities have been required to report crime statistics in previous years reporting was voluntary 17 What factors affect manufacturing productivity at the state level In addition to levels of capital and worker education you could look at degree of unionization A panel data analysis would be most convincing here using multiple years of census data say 1980 1990 2000 and 2010 Clark 1984 provides an analysis of how unionization affects firm performance and productivity What other vari ables might explain productivity Firmlevel data can be obtained from Compustat For example other factors being fixed do changes in unionization affect stock price of a firm 18 Use state or countylevel data or if possible school districtlevel data to look at the factors that affect education spending per pupil An interesting question is Other things being equal such as income and education levels of residents do districts with a larger percentage of elderly people spend less on schools Census data can be matched with school district spending data to obtain a very large cross section The US Department of Education compiles such data 19 What are the effects of state regulations such as motorcycle helmet laws on motorcycle fatalities Or do differences in boating lawssuch as minimum operating agehelp to explain boating accident rates The US Department of Transportation compiles such information This can be merged with data from the Statistical Abstract of the United States A panel data analysis seems to be warranted here 20 What factors affect output growth Two factors of interest are inflation and investment for example Blomström Lipsey and Zejan 1996 You might use time series data on a country you find interest ing Or you could use a cross section of countries as in De Long and Summers 1991 Friedman and Kuttner 1992 found evidence that at least in the 1980s the spread between the commercial paper rate and the Treasury bill rate affects real output 21 What is the behavior of mergers in the US economy or some other economy Shughart and Tollison 1984 characterize the log of annual mergers in the US economy as a random walk by showing that the difference in logsroughly the growth rateis unpredictable given past growth rates Does this still hold Does it hold across various industries What past measures of economic activity can be used to forecast mergers 22 What factors might explain racial and gender differences in employment and wages For example Holzer 1991 reviewed the evidence on the spatial mismatch hypothesis to explain differences in employment rates between blacks and whites Korenman and Neumark 1992 examined the effects of childbearing on womens wages while Hersch and Stratton 1997 looked at the effects of household responsibilities on mens and womens wages 23 Obtain monthly or quarterly data on teenage employment rates the minimum wage and factors that affect teen employment to estimate the effects of the minimum wage on teen employment Solon 1985 used quarterly US data while CastilloFreeman and Freeman 1992 used annual data on Puerto Rico It might be informative to analyze time series data on a lowwage state in the United Stateswhere changes in the minimum wage are likely to have the largest effect 24 At the city level estimate a time series model for crime An example is Cloninger and Sartorius 1979 As a twist you might estimate the effects of community policing or midnight basketball programs Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 625 relatively new innovations in fighting crime Inferring causality is tricky Including a lagged dependent variable might be helpful Because you are using time series data you should be aware of the spurious regression problem Grogger 1990 used data on daily homicide counts to estimate the deterrent effects of capital punishment Might there be other factorssuch as news on lethal response by policethat have an effect on daily crime counts 25 Are there aggregate productivity effects of computer usage You would need to obtain time series data perhaps at the national level on productivity percentage of employees using computers and other fac tors What about spending probably as a fraction of total sales on research and development What sociological factors for example alcohol usage or divorce rates might affect productivity 26 What factors affect chief executive officer salaries The files CEOSAL1 and CEOSAL2 are data sets that have various firm performance measures as well as information such as tenure and education You can certainly update these data files and look for other interesting factors Rose and Shepard 1997 considered firm diversification as one important determinant of CEO compensation 27 Do differences in tax codes across states affect the amount of foreign direct investment Hines 1996 studied the effects of state corporate taxes along with the ability to apply foreign tax credits on invest ment from outside the United States 28 What factors affect election outcomes Does spending matter Do votes on specific issues matter Does the state of the local economy matter See for example Levitt 1994 and the data sets VOTE1 and VOTE2 Fair 1996 performed a time series analysis of US presidential elections 29 Test whether stores or restaurants practice price discrimination based on race or ethnicity Graddy 1997 used data on fastfood restaurants in New Jersey and Pennsylvania along with ZIP codelevel characteristics to see whether prices vary by characteristics of the local population She found that prices of standard items such as sodas increase when the fraction of black residents increases Her data are contained in the file DISCRIM You can collect similar data in your local area by surveying stores or restaurants for prices of common items and matching those with recent census data See Graddys paper for details of her analysis 30 Do your own audit study to test for race or gender discrimination in hiring One such study is described in Example C3 of Appendix C Have pairs of equally qualified friends say one male and one female apply for job openings in local bars or restaurants You can provide them with phony résu més that give each the same experience and background where the only difference is gender or race Then you can keep track of who gets the interviews and job offers Neumark 1996 described one such study conducted in Philadelphia A variant would be to test whether general physical attractive ness or a specific characteristic such as being obese or having visible tattoos or body piercings plays a role in hiring decisions You would want to use the same gender in the matched pairs and it may not be easy to get volunteers for such a study 31 Following Hamermesh and Parker 2005 try to establish a link between the physical appearance of college instructors and student evaluations This can be done on campus via a survey Somewhat crude data can be obtained from websites that allow students to rank their professors and provide some information about appearance Ideally though any evaluations of attractiveness are not done by cur rent or former students as those evaluations can be influenced by the grade received 32 Use panel data to study the effects of various economic policies on regional economic growth Studying the effects of taxes and spending is natural but other policies may be of interest For example Craig Jackson and Thomson 2007 study the effects of Small Business Association Loan Guarantee pro grams on per capita income growth 33 Blinder and Watson 2014 have recently studied explanations for systematic differences in economic variables particularly growth in real GDP in the United States based on the political party of the sit ting president One might update the data to the most recent quarters and also study variables other than GDP such as unemployment Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 626 PART 3 Advanced Topics List of Journals The following is a partial list of popular journals containing empirical research in business economics and other social sciences A complete list of journals can be found on the Internet at httpwwweconlitorg American Economic Journal Applied Economics American Economic Journal Economic Policy American Economic Review American Journal of Agricultural Economics American Political Science Review Applied Economics Brookings Papers on Economic Activity Canadian Journal of Economics Demography Economic Development and Cultural Change Economic Inquiry Economica Economics of Education Review Education Finance and Policy Economics Letters Empirical Economics Federal Reserve Bulletin International Economic Review International Tax and Public Finance Journal of Applied Econometrics Journal of Business and Economic Statistics Journal of Development Economics Journal of Economic Education Journal of Empirical Finance Journal of Environmental Economics and Management Journal of Finance Journal of Health Economics Journal of Human Resources Journal of Industrial Economics Journal of International Economics Journal of Labor Economics Journal of Monetary Economics Journal of Money Credit and Banking Journal of Political Economy Journal of Public Economics Journal of Quantitative Criminology Journal of Urban Economics National Bureau of Economic Research Working Papers Series National Tax Journal Public Finance Quarterly Quarterly Journal of Economics Regional Science Urban Economics Review of Economic Studies Review of Economics and Statistics Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it CHAPTER 19 Carrying Out an Empirical Project 627 Data Sources Numerous data sources are available throughout the world Governments of most countries compile a wealth of data some general and easily accessible data sources for the United States such as the Economic Report of the President the Statistical Abstract of the United States and the County and City Data Book have already been mentioned International financial data on many countries are published annually in International Financial Statistics Various magazines like BusinessWeek and US News and World Report often publish statisticssuch as CEO salaries and firm performance or ranking of academic programs that are novel and can be used in an econometric analysis Rather than attempting to provide a list here we instead give some Internet addresses that are compre hensive sources for economists A very useful site for economists called Resources for Economists on the Internet is maintained by Bill Goffe at Pennsylvania State University The address is httpwwwrfeorg This site provides links to journals data sources and lists of professional and academic economists It is quite simple to use Another very useful site is httpeconometriclinkscom which contains links to lots of data sources as well as to other sites of interest to empirical economists In addition the Journal of Applied Econometrics and the Journal of Business and Economic Statistics have data archives that contain data sets used in most papers published in the journals over the past several years If you find a data set that interests you this is a good way to go as much of the cleaning and format ting of the data have already been done The downside is that some of these data sets are used in economet ric analyses that are more advanced than we have learned about in this text On the other hand it is often useful to estimate simpler models using standard econometric methods for comparison Many universities such as the University of CaliforniaBerkeley the University of Michigan and the University of Maryland maintain very extensive data sets as well as links to a variety of data sets Your own library possibly contains an extensive set of links to databases in business economics and the other social sciences The regional Federal Reserve banks such as the one in St Louis manage a variety of data The National Bureau of Economic Research posts data sets used by some of its researchers State and federal governments now publish a wealth of data that can be accessed via the Internet Census data are publicly available from the US Census Bureau Two useful publications are the Economic Census published in years ending with two and seven and the Census of Population and Housing published at the beginning of each decade Other agencies such as the US Department of Justice also make data avail able to the public Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 628 Basic Mathematical Tools Appendix A T his appendix covers some basic mathematics that are used in econometric analysis We summa rize various properties of the summation operator study properties of linear and certain nonlin ear equations and review proportions and percentages We also present some special functions that often arise in applied econometrics including quadratic functions and the natural logarithm The first four sections require only basic algebra skills Section A5 contains a brief review of differential calculus although a knowledge of calculus is not necessary to understand most of the text it is used in some endofchapter appendices and in several of the more advanced chapters in Part 3 A1 The Summation Operator and Descriptive Statistics The summation operator is a useful shorthand for manipulating expressions involving the sums of many numbers and it plays a key role in statistics and econometric analysis If 5xi i 5 1 p n6 denotes a sequence of n numbers then we write the sum of these numbers as a n i51 xi x1 1 x2 1 p 1 xn A1 With this definition the summation operator is easily shown to have the following properties Property Sum1 For any constant c a n i51 c 5 nc A2 Property Sum2 For any constant c a n i51 cxi 5 c a n i51 xi A3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 629 Property Sum3 If 51xi yi2 i 5 1 2 p n6 is a set of n pairs of numbers and a and b are constants then a n i51 1axi 1 byi2 5 a a n i51 xi 1 b a n i51 yi A4 It is also important to be aware of some things that cannot be done with the summation operator Let51xi yi2 i 5 1 2 p n6 again be a set of n pairs of numbers with yi 2 0 for each i Then a n i51 1xi yi2 2 a a n i51 xiba a n i51 yib In other words the sum of the ratios is not the ratio of the sums In the n 5 2 case the application of familiar elementary algebra also reveals this lack of equality x1y1 1 x2y2 2 1x1 1 x221y1 1 y22 Similarly the sum of the squares is not the square of the sum g n i51x2 i 2 1 g n i51xi2 2 ex cept in special cases That these two quantities are not generally equal is easiest to see when n 5 2 x2 1 1 x2 2 2 1x1 1 x22 2 5 x2 1 1 2x1x2 1 x2 2 Given n numbers 5xi i 5 1 p n6 we compute their average or mean by adding them up and dividing by n x 5 11n2 a n i51 xi A5 When the xi are a sample of data on a particular variable such as years of education we often call this the sample average or sample mean to emphasize that it is computed from a particular set of data The sample average is an example of a descriptive statistic in this case the statistic describes the central tendency of the set of points xi There are some basic properties about averages that are important to understand First suppose we take each observation on x and subtract off the average di xi 2 x the d here stands for devia tion from the average Then the sum of these deviations is always zero a n i51 di 5 a n i51 1xi 2 x2 5 a n i51 xi 2 a n i51 x 5 a n i51 xi 2 nx 5 nx 2 nx 5 0 We summarize this as a n i51 1xi 2 x2 5 0 A6 A simple numerical example shows how this works Suppose n 5 5 and x1 5 6 x2 5 1 x3 5 22 x4 5 0 and x5 5 5 Then x 5 2 and the demeaned sample is 54 21 24 22 36 Adding these gives zero which is just what equation A6 says In our treatment of regression analysis in Chapter 2 we need to know some additional algebraic facts involving deviations from sample averages An important one is that the sum of squared devia tions is the sum of the squared xi minus n times the square of x a n i51 1xi 2 x2 2 5 a n i51 x2 i 2 n1x2 2 A7 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 630 This can be shown using basic properties of the summation operator a n i51 1xi 2 x2 2 5 a n i51 1x2 i 2 2xix 1 x22 5 a n i51 x2 i 2 2x a n i51 xi 1 n1x2 2 5 a n i51 x2 i 2 2n1x2 2 1 n1x2 2 5 a n i51 x2 i 2 n1x2 2 Given a data set on two variables 51xi yi2 i 5 1 2 p n6 it can also be shown that a n i51 1xi 2 x2 1yi 2 y2 5 a n i51 xi1yi 2 y2 A8 5 a n i51 1xi 2 x2yi 5 a n i51 xiyi 2 n1x y2 this is a generalization of equation A7 There yi 5 xi for all i The average is the measure of central tendency that we will focus on in most of this text How ever it is sometimes informative to use the median or sample median to describe the central value To obtain the median of the n numbers 5x1 p xn6 we first order the values of the xi from smallest to largest Then if n is odd the sample median is the middle number of the ordered observations For example given the numbers 524 8 2 0 21 210 186 the median value is 2 because the or dered sequence is 5210 24 0 2 8 18 216 If we change the largest number in this list 21 to twice its value 42 the median is still 2 By contrast the sample average would increase from 5 to 8 a sizable change Generally the median is less sensitive than the average to changes in the extreme values large or small in a list of numbers This is why median incomes or median housing values are often reported rather than averages when summarizing income or housing values in a city or county If n is even there is no unique way to define the median because there are two numbers at the center Usually the median is defined to be the average of the two middle values again after ordering the numbers from smallest to largest Using this rule the median for the set of numbers 54 12 2 66 would be 14 1 622 5 5 A2 Properties of Linear Functions Linear functions play an important role in econometrics because they are simple to interpret and ma nipulate If x and y are two variables related by y 5 b0 1 b1x A9 then we say that y is a linear function of x and b0 and b1 are two parameters numbers describing this relationship The intercept is b0 and the slope is b1 The defining feature of a linear function is that the change in y is always b1 times the change in x Dy 5 b1Dx A10 where D denotes change In other words the marginal effect of x on y is constant and equal to b1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 631 ExamplE a1 linear Housing Expenditure Function Suppose that the relationship between monthly housing expenditure and monthly income is housing 5 164 1 27 income A11 Then for each additional dollar of income 27 cents is spent on housing If family income increases by 200 then housing expenditure increases by 1272200 5 54 This function is graphed in Figure A1 According to equation A11 a family with no income spends 164 on housing which of course cannot be literally true For low levels of income this linear function would not describe the relation ship between housing and income very well which is why we will eventually have to use other types of functions to describe such relationships In A11 the marginal propensity to consume MPC housing out of income is 27 This is dif ferent from the average propensity to consume APC which is housing income 5 164income 1 27 The APC is not constant it is always larger than the MPC and it gets closer to the MPC as income increases Linear functions are easily defined for more than two variables Suppose that y is related to two variables x1 and x2 in the general form y 5 b0 1 b1x1 1 b2x2 A12 164 1514 housing 5000 income housing income 27 D D Figure A1 Graph of housing 5 164 1 27 income Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 632 It is rather difficult to envision this function because its graph is threedimensional Nevertheless b0 is still the intercept the value of y when x1 5 0 and x2 5 0 and b1 and b2 measure particular slopes From A12 the change in y for given changes in x1 and x2 is Dy 5 b1Dx1 1 b2Dx2 A13 If x2 does not change that is Dx2 5 0 then we have Dy 5 b1Dx1 if Dx2 5 0 so that b1 is the slope of the relationship in the direction of x1 b1 5 Dy Dx1 if Dx2 5 0 Because it measures how y changes with x1 holding x2 fixed b1 is often called the partial effect of x1 on y Because the partial effect involves holding other factors fixed it is closely linked to the notion of ceteris paribus The parameter b2 has a similar interpretation b2 5 DyDx2 if Dx1 5 0 so that b2 is the partial effect of x2 on y ExamplE a2 Demand for Compact Discs For college students suppose that the monthly quantity demanded of compact discs is related to the price of compact discs and monthly discretionary income by quantity 5 120 2 98 price 1 03 income where price is dollars per disc and income is measured in dollars The demand curve is the relationship between quantity and price holding income and other factors fixed This is graphed in two dimensions in Figure A2 at an income level of 900 The slope of the demand curve 298 is the partial effect of price on quantity holding income fixed if the price of compact discs increases by one dollar then the quantity demanded falls by 98 We abstract from the fact that CDs can only be purchased in discrete units An increase in income simply shifts the demand curve up changes the intercept but the slope remains the same 147 quantity 15 price 98 D price D quantity Figure A2 Graph of quantity 5 120 2 98 price 1 03 income with income fixed at 900 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 633 A3 Proportions and Percentages Proportions and percentages play such an important role in applied economics that it is necessary to become very comfortable in working with them Many quantities reported in the popular press are in the form of percentages a few examples are interest rates unemployment rates and high school graduation rates An important skill is being able to convert proportions to percentages and vice versa A percent age is easily obtained by multiplying a proportion by 100 For example if the proportion of adults in a county with a high school degree is 82 then we say that 82 82 percent of adults have a high school degree Another way to think of percentages and proportions is that a proportion is the deci mal form of a percentage For example if the marginal tax rate for a family earning 30000 per year is reported as 28 then the proportion of the next dollar of income that is paid in income taxes is 28 or 28 When using percentages we often need to convert them to decimal form For example if a state sales tax is 6 and 200 is spent on a taxable item then the sales tax paid is 20006 5 12 If the annual return on a certificate of deposit CD is 76 and we invest 3000 in such a CD at the begin ning of the year then our interest income is 3000076 5 228 As much as we would like it the interest income is not obtained by multiplying 3000 by 76 We must be wary of proportions that are sometimes incorrectly reported as percentages in the popular media If we read The percentage of high school students who drink alcohol is 57 we know that this really means 57 not just over onehalf of a percent as the statement literally im plies College volleyball fans are probably familiar with press clips containing statements such as Her hitting percentage was 372 This really means that her hitting percentage was 372 In econometrics we are often interested in measuring the changes in various quantities Let x de note some variable such as an individuals income the number of crimes committed in a community or the profits of a firm Let x0 and x1 denote two values for x x0 is the initial value and x1 is the sub sequent value For example x0 could be the annual income of an individual in 1994 and x1 the income of the same individual in 1995 The proportionate change in x in moving from x0 to x1 sometimes called the relative change is simply 1x1 2 x02x0 5 Dxx0 A14 assuming of course that x0 2 0 In other words to get the proportionate change we simply divide the change in x by its initial value This is a way of standardizing the change so that it is free of units For example if an individuals income goes from 30000 per year to 36000 per year then the pro portionate change is 600030000 5 20 It is more common to state changes in terms of percentages The percentage change in x in go ing from x0 to x1 is simply 100 times the proportionate change Dx 5 1001Dxx02 A15 the notation Dx is read as the percentage change in x For example when income goes from 30000 to 33750 income has increased by 125 to get this we simply multiply the proportionate change 125 by 100 Again we must be on guard for proportionate changes that are reported as percentage changes In the previous example for instance reporting the percentage change in income as 125 is incorrect and could lead to confusion When we look at changes in things like dollar amounts or population there is no ambiguity about what is meant by a percentage change By contrast interpreting percentage change calculations can be tricky when the variable of interest is itself a percentage something that happens often in economics and other social sciences To illustrate let x denote the percentage of adults in a particular city having a college education Suppose the initial value is x0 5 24 24 have a college education and the new Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 634 value is x1 5 30 We can compute two quantities to describe how the percentage of collegeeducated people has changed The first is the change in x Dx In this case Dx 5 x1 2 x0 5 6 the percentage of people with a college education has increased by six percentage points On the other hand we can compute the percentage change in x using equation A15 Dx 5 1003 130 2 242244 5 25 In this example the percentage point change and the percentage change are very different The percentage point change is just the change in the percentages The percentage change is the change relative to the initial value Generally we must pay close attention to which number is being com puted The careful researcher makes this distinction perfectly clear unfortunately in the popular press as well as in academic research the type of reported change is often unclear ExamplE a3 michigan Sales Tax Increase In March 1994 Michigan voters approved a sales tax increase from 4 to 6 In political advertise ments supporters of the measure referred to this as a two percentage point increase or an increase of two cents on the dollar Opponents of the tax increase called it a 50 increase in the sales tax rate Both claims are correct they are simply different ways of measuring the increase in the sales tax Naturally each group reported the measure that made its position most favorable For a variable such as salary it makes no sense to talk of a percentage point change in salary because salary is not measured as a percentage We can describe a change in salary either in dollar or percentage terms A4 Some Special Functions and Their Properties In Section A2 we reviewed the basic properties of linear functions We already indicated one impor tant feature of functions like y 5 b0 1 b1x a oneunit change in x results in the same change in y re gardless of the initial value of x As we noted earlier this is the same as saying the marginal effect of x on y is constant something that is not realistic for many economic relationships For example the im portant economic notion of diminishing marginal returns is not consistent with a linear relationship In order to model a variety of economic phenomena we need to study several nonlinear func tions A nonlinear function is characterized by the fact that the change in y for a given change in x depends on the starting value of x Certain nonlinear functions appear frequently in empirical eco nomics so it is important to know how to interpret them A complete understanding of nonlinear functions takes us into the realm of calculus Here we simply summarize the most significant aspects of the functions leaving the details of some derivations for Section A5 A4a Quadratic Functions One simple way to capture diminishing returns is to add a quadratic term to a linear relationship Consider the equation y 5 b0 1 b1x 1 b2x2 A16 where b0 b1 and b2 are parameters When b1 0 and b2 0 the relationship between y and x has the parabolic shape given in Figure A3 where b0 5 6 b1 5 8 and b2 5 22 When b1 0 and b2 0 it can be shown using calculus in the next section that the maximum of the function occurs at the point xp 5 b1122b22 A17 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 635 For example if y 5 6 1 8x 2 2x2 so b1 5 8 and b2 5 2 2 then the largest value of y occurs at xp 5 84 5 2 and this value is 6 1 8122 2 2122 2 5 14 see Figure A3 The fact that equation A16 implies a diminishing marginal effect of x on y is easily seen from its graph Suppose we start at a low value of x and then increase x by some amount say c This has a larger effect on y than if we start at a higher value of x and increase x by the same amount c In fact once x xp an increase in x actually decreases y The statement that x has a diminishing marginal effect on y is the same as saying that the slope of the function in Figure A3 decreases as x increases Although this is clear from looking at the graph we usually want to quantify how quickly the slope is changing An application of calculus gives the approximate slope of the quadratic function as slope 5 Dy Dx b1 1 2b2x A18 for small changes in x The righthand side of equation A18 is the derivative of the function in equation A16 with respect to x Another way to write this is Dy 1b1 1 2b2x2Dx for small Dx A19 To see how well this approximation works consider again the function y 5 6 1 8x 2 2x2 Then according to equation A19 Dy 18 2 4x2Dx Now suppose we start at x 5 1 and change x by Dx 5 1 Using A19 Dy 18 2 42 112 5 4 Of course we can compute the change exactly by finding the values of y when x 5 1 and x 5 11 y0 5 6 1 8112 2 2112 2 5 12 and y1 5 6 1 81112 2 21112 2 5 1238 so the exact change in y is 38 The approximation is pretty close in this case Now suppose we start at x 5 1 but change x by a larger amount Dx 5 5 Then the approxima tion gives Dy 4152 5 2 The exact change is determined by finding the difference in y when x 5 1 0 1 2 3 0 x 2 4 6 8 10 12 14 4 y x Figure A3 Graph of y 5 6 1 8x 2 2x2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 636 and x 5 15 The former value of y was 12 and the latter value is 6 1 81152 2 21152 2 5 135 so the actual change is 15 not 2 The approximation is worse in this case because the change in x is larger For many applications equation A19 can be used to compute the approximate marginal effect of x on y for any initial value of x and small changes And we can always compute the exact change if necessary ExamplE a4 a Quadratic Wage Function Suppose the relationship between hourly wages and years in the workforce exper is given by wage 5 525 1 48 exper 2 008 exper2 A20 This function has the same general shape as the one in Figure A3 Using equation A17 exper has a positive effect on wage up to the turning point experp 5 483210082 4 5 30 The first year of experi ence is worth approximately 48 or 48 cents see A19 with x 5 0 Dx 5 1 Each additional year of experience increases wage by less than the previous yearreflecting a diminishing marginal return to experience At 30 years an additional year of experience would actually lower the wage This is not very realistic but it is one of the consequences of using a quadratic function to capture a dimin ishing marginal effect at some point the function must reach a maximum and curve downward For practical purposes the point at which this happens is often large enough to be inconsequential but not always The graph of the quadratic function in A16 has a Ushape if b1 0 and b2 0 in which case there is an increasing marginal return The minimum of the function is at the point 2b112b22 A4b The Natural Logarithm The nonlinear function that plays the most important role in econometric analysis is the natural logarithm In this text we denote the natural logarithm which we often refer to simply as the log function as y 5 log1x2 A21 You might remember learning different symbols for the natural log ln1x2 or loge1x2 are the most common These different notations are useful when logarithms with several different bases are be ing used For our purposes only the natural logarithm is important and so log1x2 denotes the natural logarithm throughout this text This corresponds to the notational usage in many statistical packages although some use ln1x2 and most calculators use ln1x2 Economists use both log1x2 and ln1x2 which is useful to know when you are reading papers in applied economics The function y 5 log1x2 is defined only for x 0 and it is plotted in Figure A4 It is not very important to know how the values of log1x2 are obtained For our purposes the function can be thought of as a black box we can plug in any x 0 and obtain log1x2 from a calculator or a computer Several things are apparent from Figure A4 First when y 5 log1x2 the relationship between y and x displays diminishing marginal returns One important difference between the log and the qua dratic function in Figure A3 is that when y 5 log1x2 the effect of x on y never becomes negative the slope of the function gets closer and closer to zero as x gets large but the slope never quite reaches zero and certainly never becomes negative Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 637 The following are also apparent from Figure A4 log1x2 0 for 0 x 1 log112 5 0 log1x2 0 for x 1 In particular log1x2 can be positive or negative Some useful algebraic facts about the log function are log1x1 x22 5 log1x12 1 log1x22 x1 x2 0 log1x1x22 5 log1x12 2 log1x22 x1 x2 0 log1xc2 5 c log1x2 x 0 c any number Occasionally we will need to rely on these properties The logarithm can be used for various approximations that arise in econometric applications First log11 1 x2 x for x 0 You can try this with x 5 02 1 and 5 to see how the quality of the approximation deteriorates as x gets larger Even more useful is the fact that the difference in logs can be used to approximate proportionate changes Let x0 and x1 be positive values Then it can be shown using calculus that log1x12 2 log1x02 1x1 2 x02x0 5 Dxx0 A22 for small changes in x If we multiply equation A22 by 100 and write Dlog1x2 5 log1x12 2 log1x02 then 100 Dlog1x2 Dx A23 for small changes in x The meaning of small depends on the context and we will encounter several examples throughout this text Why should we approximate the percentage change using A23 when the exact percentage change is so easy to compute Momentarily we will see why the approximation in A23 is useful in econometrics First let us see how good the approximation is in two examples 0 y 1 x y logx Figure A4 Graph of y 5 log1x2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 638 First suppose x0 5 40 and x1 5 41 Then the percentage change in x in moving from x0 to x1 is 25 using 1001x1 2 x02x0 Now log1412 2 log1402 5 0247 to four decimal places which when multiplied by 100 is very close to 25 The approximation works pretty well Now con sider a much bigger change x0 5 40 and x1 5 60 The exact percentage change is 50 However log1602 2 log1402 4055 so the approximation gives 4055 which is much farther off Why is the approximation in A23 useful if it is only satisfactory for small changes To build up to the answer we first define the elasticity of y with respect to x as Dy Dx x y 5 Dy Dx A24 In other words the elasticity of y with respect to x is the percentage change in y when x increases by 1 This notion should be familiar from introductory economics If y is a linear function of x y 5 b0 1 b1x then the elasticity is Dy Dx x y 5 b1 x y 5 b1 x b0 1 b1x A25 which clearly depends on the value of x This is a generalization of the wellknown result from basic demand theory the elasticity is not constant along a straightline demand curve Elasticities are of critical importance in many areas of applied economics not just in demand theory It is convenient in many situations to have constant elasticity models and the log function al lows us to specify such models If we use the approximation in A23 for both x and y then the elas ticity is approximately equal to Dlog1y2Dlog1x2 Thus a constant elasticity model is approximated by the equation log1y2 5 b0 1 b1log1x2 A26 and b1 is the elasticity of y with respect to x assuming that x y 0 ExamplE a5 Constant Elasticity Demand Function If q is quantity demanded and p is price and these variables are related by log1q2 5 47 2 125 log1p2 then the price elasticity of demand is 2125 Roughly a 1 increase in price leads to a 125 fall in the quantity demanded For our purposes the fact that b1 in A26 is only close to the elasticity is not important In fact when the elasticity is defined using calculusas in Section A5the definition is exact For the pur poses of econometric analysis A26 defines a constant elasticity model Such models play a large role in empirical economics Other possibilities for using the log function often arise in empirical work Suppose that y 0 and log1y2 5 b0 1 b1x A27 Then Dlog1y2 5 b1Dx so 100 Dlog1y2 5 1100 b12Dx It follows that when y and x are related by equation A27 Dy 1100 b12Dx A28 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 639 ExamplE a6 logarithmic Wage Equation Suppose that hourly wage and years of education are related by log1wage2 5 278 1 094 educ Then using equation A28 Dwage 10010942Deduc 5 94 Deduc It follows that one more year of education increases hourly wage by about 94 Generally the quantity DyDx is called the semielasticity of y with respect to x The semi elasticity is the percentage change in y when x increases by one unit What we have just shown is that in model A27 the semielasticity is constant and equal to 100 b1 In Example A6 we can conve niently summarize the relationship between wages and education by saying that one more year of educationstarting from any amount of educationincreases the wage by about 94 This is why such models play an important role in economics Another relationship of some interest in applied economics is y 5 b0 1 b1log1x2 A29 where x 0 How can we interpret this equation If we take the change in y we get Dy 5 b1Dlog1x2 which can be rewritten as Dy 5 1b11002 3100 Dlog1x2 4 Thus using the approximation in A23 we have Dy 1b11002 1Dx2 A30 In other words b1100 is the unit change in y when x increases by 1 ExamplE a7 labor Supply Function Assume that the labor supply of a worker can be described by hours 5 33 1 451 log1wage2 where wage is hourly wage and hours is hours worked per week Then from A30 Dhours 14511002 1Dwage2 5 451 Dwage In other words a 1 increase in wage increases the weekly hours worked by about 45 or slightly less than onehalf hour If the wage increases by 10 then Dhours 5 4511102 5 451 or about four and onehalf hours We would not want to use this approximation for much larger percentage changes in wages A4c The Exponential Function Before leaving this section we need to discuss a special function that is related to the log As motiva tion consider equation A27 There logy is a linear function of x But how do we find y itself as a function of x The answer is given by the exponential function We will write the exponential function as y 5 exp1x2 which is graphed in Figure A5 From Fig ure A5 we see that exp1x2 is defined for any value of x and is always greater than zero Sometimes the exponential function is written as y 5 ex but we will not use this notation Two important values of the exponential function are exp102 5 1 and exp112 5 27183 to four decimal places Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 640 The exponential function is the inverse of the log function in the following sense log3exp1x2 4 5 x for all x and exp3log1x2 4 5 x for x 0 In other words the log undoes the exponential and vice versa This is why the exponential function is sometimes called the antilog function In particular note that log1y2 5 b0 1 b1x is equivalent to y 5 exp1b0 1 b1x2 If b1 0 the relationship between x and y has the same shape as in Figure A5 Thus if log1y2 5 b0 1 b1x with b1 0 then x has an increasing marginal effect on y In Example A6 this means that another year of education leads to a larger change in wage than the previous year of education Two useful facts about the exponential function are exp1x1 1 x22 5 exp1x12exp1x22 and exp3c log1x2 4 5 xc A5 Differential Calculus In the previous section we asserted several approximations that have foundations in calculus Let y 5 f 1x2 for some function f Then for small changes in x Dy df dx Dx A31 where dfdx is the derivative of the function f evaluated at the initial point x0 We also write the de rivative as dydx For example if y 5 log1x2 then dydx 5 1x Using A31 with dydx evaluated at x0 we have Dy 11x02Dx or Dlog1x2 Dxx0 which is the approximation given in A22 In applying econometrics it helps to recall the derivatives of a handful of functions because we use the derivative to define the slope of a function at a given point We can then use A31 to find the 0 y x y expx Figure A5 Graph of y 5 exp1x2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 641 approximate change in y for small changes in x In the linear case the derivative is simply the slope of the line as we would hope if y 5 b0 1 b1x then dydx 5 b1 If y 5 xc then dydx 5 cxc21 The derivative of a sum of two functions is the sum of the deriva tives d3 f 1x2 1 g1x2 4dx 5 df 1x2dx 1 dg1x2dx The derivative of a constant times any function is that same constant times the derivative of the function d3cf 1x2 4dx 5 c3df 1x2dx4 These simple rules allow us to find derivatives of more complicated functions Other rules such as the product quotient and chain rules will be familiar to those who have taken calculus but we will not review those here Some functions that are often used in economics along with their derivatives are y 5 b0 1 b1x 1 b2x2 dydx 5 b1 1 2b2x y 5 b0 1 b1x dydx 5 2b11x22 y 5 b0 1 b1x dydx 5 1b122x212 y 5 b0 1 b1log1x2 dydx 5 b1x y 5 exp1b0 1 b1x2 dydx 5 b1exp1b0 1 b1x2 If b0 5 0 and b1 5 1 in this last expression we get dydx 5 exp1x2 when y 5 exp1x2 In Section A4 we noted that equation A26 defines a constant elasticity model when calculus is used The calculus definition of elasticity is 1dydx2 1xy2 It can be shown using properties of logs and exponentials that when A26 holds 1dydx2 1xy2 5 b1 When y is a function of multiple variables the notion of a partial derivative becomes important Suppose that y 5 f 1x1 x22 A32 Then there are two partial derivatives one with respect to x1 and one with respect to x2 The partial derivative of y with respect to x1 denoted here by yx1 is just the usual derivative of A32 with respect to x1 where x2 is treated as a constant Similarly yx2 is just the derivative of A32 with respect to x2 holding x1 fixed Partial derivatives are useful for much the same reason as ordinary derivatives We can approxi mate the change in y as Dy y x1 Dx1 holding x2 fixed A33 Thus calculus allows us to define partial effects in nonlinear models just as we could in linear models In fact if y 5 b0 1 b1x1 1 b2x2 then y x1 5 b1 y x2 5 b2 These can be recognized as the partial effects defined in Section A2 A more complicated example is y 5 5 1 4x1 1 x2 1 2 3x2 1 7x1 x2 A34 Now the derivative of A34 with respect to x1 treating x2 as a constant is simply y x1 5 4 1 2x1 1 7x2 note how this depends on x1 and x2 The derivative of A34 with respect to x2 is yx2 5 23 1 7x1 so this depends only on x1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 642 ExamplE a8 Wage Function with Interaction A function relating wages to years of education and experience is wage 5 310 1 41 educ 1 19 exper 2 004 exper2 A35 1 007 educ exper The partial effect of exper on wage is the partial derivative of A35 wage exper 5 19 2 008 exper 1 007 educ This is the approximate change in wage due to increasing experience by one year Notice that this partial effect depends on the initial level of exper and educ For example for a worker who is starting with educ 5 12 and exper 5 5 the next year of experience increases wage by about 19 2 008152 1 0071122 5 234 or 234 cents per hour The exact change can be calculated by computing A35 at exper 5 5 educ 5 12 and at exper 6 educ 12 and then taking the difference This turns out to be 23 which is very close to the approximation Differential calculus plays an important role in minimizing and maximizing functions of one or more variables If f 1x1 x2 p xk2 is a differentiable function of k variables then a necessary condi tion for x1 p x2 p p xk p to either minimize or maximize f over all possible values of xj is f xi 1xp 1 xp 2 p xp k2 5 0 j 5 1 2 p k A36 In other words all of the partial derivatives of f must be zero when they are evaluated at the xhp These are called the first order conditions for minimizing or maximizing a function Practically we hope to solve equation A36 for the xhp Then we can use other criteria to determine whether we have mini mized or maximized the function We will not need those here See Sydsaeter and Hammond 1995 for a discussion of multivariable calculus and its use in optimizing functions Summary The math tools reviewed here are crucial for understanding regression analysis and the probability and sta tistics that are covered in Appendices B and C The material on nonlinear functionsespecially quadratic logarithmic and exponential functionsis critical for understanding modern applied economic research The level of comprehension required of these functions does not include a deep knowledge of calculus although calculus is needed for certain derivations Key Terms Average Ceteris Paribus Constant Elasticity Model Derivative Descriptive Statistic Diminishing Marginal Effect Elasticity Exponential Function Intercept Linear Function Log Function Marginal Effect Median Natural Logarithm Nonlinear Function Partial Derivative Partial Effect Percentage Change Percentage Point Change Proportionate Change Relative Change SemiElasticity Slope Summation Operator Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix A Basic Mathematical Tools 643 i Find the average monthly housing expenditure ii Find the median monthly housing expenditure iii If monthly housing expenditures were measured in hundreds of dollars rather than in dollars what would be the average and median expenditures iv Suppose that family number 8 increases its monthly housing expenditure to 900 but the ex penditures of all other families remain the same Compute the average and median housing expenditures 2 Suppose the following equation describes the relationship between the average number of classes missed during a semester missed and the distance from school distance measured in miles missed 5 3 1 02 distance i Sketch this line being sure to label the axes How do you interpret the intercept in this equation ii What is the average number of classes missed for someone who lives five miles away iii What is the difference in the average number of classes missed for someone who lives 10 miles away and someone who lives 20 miles away 3 In Example A2 quantity of compact discs was related to price and income by quantity 5 120 2 98 price 1 03 income What is the demand for CDs if price 5 15 and income 5 200 What does this suggest about using linear functions to describe demand curves 4 Suppose the unemployment rate in the United States goes from 64 in one year to 56 in the next i What is the percentage point decrease in the unemployment rate ii By what percentage has the unemployment rate fallen 5 Suppose that the return from holding a particular firms stock goes from 15 in one year to 18 in the following year The majority shareholder claims that the stock return only increased by 3 while the chief executive officer claims that the return on the firms stock increased by 20 Reconcile their disagreement 6 Suppose that Person A earns 35000 per year and Person B earns 42000 i Find the exact percentage by which Person Bs salary exceeds Person As ii Now use the difference in natural logs to find the approximate percentage difference Problems 1 The following table contains monthly housing expenditures for 10 families Family Monthly Housing Expenditures Dollars 1 300 2 440 3 350 4 1100 5 640 6 480 7 450 8 700 9 670 10 530 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 644 7 Suppose the following model describes the relationship between annual salary salary and the number of previous years of labor market experience exper log1salary2 5 106 1 027 exper i What is salary when exper 5 0 When exper 5 5 Hint You will need to exponentiate ii Use equation A28 to approximate the percentage increase in salary when exper increases by five years iii Use the results of part i to compute the exact percentage difference in salary when exper 5 5 and exper 5 0 Comment on how this compares with the approximation in part ii 8 Let grthemp denote the proportionate growth in employment at the county level from 1990 to 1995 and let salestax denote the county sales tax rate stated as a proportion Interpret the intercept and slope in the equation grthemp 5 043 2 78 sales tax 9 Suppose the yield of a certain crop in bushels per acre is related to fertilizer amount in pounds per acre as yield 5 120 1 19fertilizer i Graph this relationship by plugging in several values for fertilizer ii Describe how the shape of this relationship compares with a linear relationship between yield and fertilizer 10 Suppose that in a particular state a standardized test is given to all graduating seniors Let score denote a students score on the test Someone discovers that performance on the test is related to the size of the students graduating high school class The relationship is quadratic score 5 456 1 082 class 2 000147 class2 where class is the number of students in the graduating class i How do you literally interpret the value 456 in the equation By itself is it of much interest Explain ii From the equation what is the optimal size of the graduating class the size that maximizes the test score Round your answer to the nearest integer What is the highest achievable test score iii Sketch a graph that illustrates your solution in part ii iv Does it seem likely that score and class would have a deterministic relationship That is is it realistic to think that once you know the size of a students graduating class you know with certainty his or her test score Explain 11 Consider the line y 5 b0 1 b1x i Let 1x1 y12 and 1x2 y22 be two points on the line Show that 1x y2 is also on the line where x 5 1x1 1 x222 is the average of the two values and y 5 1y1 1 y222 ii Extend the result of part i to n points on the line 5 1xi yi2 i 5 1 p n6 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 645 Fundamentals of Probability Appendix B T his appendix covers key concepts from basic probability Appendices B and C are primarily for review they are not intended to replace a course in probability and statistics However all of the probability and statistics concepts that we use in the text are covered in these appendices Probability is of interest in its own right for students in business economics and other social sciences For example consider the problem of an airline trying to decide how many reservations to accept for a flight that has 100 available seats If fewer than 100 people want reservations then these should all be accepted But what if more than 100 people request reservations A safe solution is to accept at most 100 reservations However because some people book reservations and then do not show up for the flight there is some chance that the plane will not be full even if 100 reservations are booked This results in lost revenue to the airline A different strategy is to book more than 100 reser vations and to hope that some people do not show up so the final number of passengers is as close to 100 as possible This policy runs the risk of the airline having to compensate people who are neces sarily bumped from an overbooked flight A natural question in this context is Can we decide on the optimal or best number of reserva tions the airline should make This is a nontrivial problem Nevertheless given certain information on airline costs and how frequently people show up for reservations we can use basic probability to arrive at a solution B1 Random Variables and Their Probability Distributions Suppose that we flip a coin 10 times and count the number of times the coin turns up heads This is an example of an experiment Generally an experiment is any procedure that can at least in theory be infinitely repeated and has a welldefined set of outcomes We could in principle carry out the coinflipping procedure again and again Before we flip the coin we know that the number of heads appearing is an integer from 0 to 10 so the outcomes of the experiment are well defined A random variable is one that takes on numerical values and has an outcome that is determined by an experiment In the coinflipping example the number of heads appearing in 10 flips of a coin is an example of a random variable Before we flip the coin 10 times we do not know how many Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 646 times the coin will come up heads Once we flip the coin 10 times and count the number of heads we obtain the outcome of the random variable for this particular trial of the experiment Another trial can produce a different outcome In the airline reservation example mentioned earlier the number of people showing up for their flight is a random variable before any particular flight we do not know how many people will show up To analyze data collected in business and the social sciences it is important to have a basic understanding of random variables and their properties Following the usual conventions in probabil ity and statistics throughout Appendices B and C we denote random variables by uppercase letters usually W X Y and Z particular outcomes of random variables are denoted by the corresponding lowercase letters w x y and z For example in the coinflipping experiment let X denote the number of heads appearing in 10 flips of a coin Then X is not associated with any particular value but we know X will take on a value in the set 50 1 2 p 106 A particular outcome is say x 5 6 We indicate large collections of random variables by using subscripts For example if we record last years income of 20 randomly chosen households in the United States we might denote these random variables by X1 X2 p X20 the particular outcomes would be denoted x1 x2 p x20 As stated in the definition random variables are always defined to take on numerical values even when they describe qualitative events For example consider tossing a single coin where the two outcomes are heads and tails We can define a random variable as follows X 5 1 if the coin turns up heads and X 5 0 if the coin turns up tails A random variable that can only take on the values zero and one is called a Bernoulli or binary random variable In basic probability it is traditional to call the event X 5 1 a success and the event X 5 0 a failure For a particular application the successfailure nomenclature might not correspond to our notion of a success or failure but it is a useful terminology that we will adopt B1a Discrete Random Variables A discrete random variable is one that takes on only a finite or countably infinite number of values The notion of countably infinite means that even though an infinite number of values can be taken on by a random variable those values can be put in a onetoone correspondence with the positive in tegers Because the distinction between countably infinite and uncountably infinite is somewhat subtle we will concentrate on discrete random variables that take on only a finite number of values Larsen and Marx 1986 Chapter 3 provide a detailed treatment A Bernoulli random variable is the simplest example of a discrete random variable The only thing we need to completely describe the behavior of a Bernoulli random variable is the probability that it takes on the value one In the coinflipping example if the coin is fair then P1X 5 12 5 12 read as the probability that X equals one is onehalf Because probabilities must sum to one P1X 5 02 5 12 also Social scientists are interested in more than flipping coins so we must allow for more general situations Again consider the example where the airline must decide how many people to book for a flight with 100 available seats This problem can be analyzed in the context of several Bernoulli random variables as follows for a randomly selected customer define a Bernoulli random variable as X 5 1 if the person shows up for the reservation and X 5 0 if not There is no reason to think that the probability of any particular customer showing up is 12 in principle the probability can be any number between 0 and 1 Call this number u so that P1X 5 12 5 u B1 P1X 5 02 5 1 2 u B2 For example if u 5 75 then there is a 75 chance that a customer shows up after making a reservation and a 25 chance that the customer does not show up Intuitively the value of u is crucial in determining the airlines strategy for booking reservations Methods for estimating u given his torical data on airline reservations are a subject of mathematical statistics something we turn to in Appendix C Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 647 More generally any discrete random variable is completely described by listing its possible values and the associated probability that it takes on each value If X takes on the k possible values 5x1 p xk6 then the probabilities p1 p2 p pk are defined by pj 5 P1X 5 xj2 j 5 1 2 p k B3 where each pj is between 0 and 1 and p1 1 p2 1 p 1 pk 5 1 B4 Equation B3 is read as The probability that X takes on the value xj is equal to pj Equations B1 and B2 show that the probabilities of success and failure for a Bernoulli ran dom variable are determined entirely by the value of u Because Bernoulli random variables are so prevalent we have a special notation for them X Bernoulli1u2 is read as X has a Bernoulli distribution with probability of success equal to u The probability density function pdf of X summarizes the information concerning the possible outcomes of X and the corresponding probabilities f 1xj2 5 pj j 5 1 2 p k B5 with f 1x2 5 0 for any x not equal to xj for some j In other words for any real number x f 1x2 is the probability that the random variable X takes on the particular value x When dealing with more than one random variable it is sometimes useful to subscript the pdf in question fX is the pdf of X fY is the pdf of Y and so on Given the pdf of any discrete random variable it is simple to compute the probability of any event involving that random variable For example suppose that X is the number of free throws made by a basketball player out of two attempts so that X can take on the three values 50 1 26 Assume that the pdf of X is given by f 102 5 20 f 112 5 44 and f 122 5 36 The three probabilities sum to one as they must Using this pdf we can calculate the probability that the player makes at least one free throw P1X 12 5 P1X 5 12 1 P1X 5 22 5 44 1 36 5 80 The pdf of X is shown in Figure B1 fx 0 1 2 x 20 44 36 Figure B1 The pdf of the number of free throws made out of two attempts Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 648 B1b Continuous Random Variables A variable X is a continuous random variable if it takes on any real value with zero probability This definition is somewhat counterintuitive because in any application we eventually observe some outcome for a random variable The idea is that a continuous random variable X can take on so many possible values that we cannot count them or match them up with the positive integers so logical con sistency dictates that X can take on each value with probability zero While measurements are always discrete in practice random variables that take on numerous values are best treated as continuous For example the most refined measure of the price of a good is in terms of cents We can imagine listing all possible values of price in order even though the list may continue indefinitely which technically makes price a discrete random variable However there are so many possible values of price that using the mechanics of discrete random variables is not feasible We can define a probability density function for continuous random variables and as with discrete random variables the pdf provides information on the likely outcomes of the random variable However because it makes no sense to discuss the probability that a continuous random variable takes on a particular value we use the pdf of a continuous random variable only to compute events involving a range of values For example if a and b are constants where a b the probability that X lies between the numbers a and b P1a X b2 is the area under the pdf between points a and b as shown in Figure B2 If you are familiar with calculus you recognize this as the integral of the function f between the points a and b The entire area under the pdf must always equal one When computing probabilities for continuous random variables it is easiest to work with the cumulative distribution function cdf If X is any random variable then its cdf is defined for any real number x by F1x2 P1X x2 B6 a fx b x Figure B2 The probability that X lies between the points a and b Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 649 For discrete random variables B6 is obtained by summing the pdf over all values xj such that xj x For a continuous random variable F1x2 is the area under the pdf f to the left of the point x Because F1x2 is simply a probability it is always between 0 and 1 Further if x1 x2 then P1X x12 P1X x22 that is F1x12 F1x22 This means that a cdf is an increasing or at least a nondecreasing function of x Two important properties of cdfs that are useful for computing probabilities are the following For any number c P1X c2 5 1 2 F1c2 B7 For any numbers a b P1a X b2 5 F1b2 2 F1a2 B8 In our study of econometrics we will use cdfs to compute probabilities only for continuous random variables in which case it does not matter whether inequalities in probability statements are strict or not That is for a continuous random variable X P1X c2 5 P1X c2 B9 and P1a X b2 5 P1a X b2 5 P1a X b2 5 P1a X b2 B10 Combined with B7 and B8 equations B9 and B10 greatly expand the probability calcula tions that can be done using continuous cdfs Cumulative distribution functions have been tabulated for all of the important continuous distri butions in probability and statistics The most well known of these is the normal distribution which we cover along with some related distributions in Section B5 B2 Joint Distributions Conditional Distributions and Independence In economics we are usually interested in the occurrence of events involving more than one random variable For example in the airline reservation example referred to earlier the airline might be inter ested in the probability that a person who makes a reservation shows up and is a business traveler this is an example of a joint probability Or the airline might be interested in the following conditional probability conditional on the person being a business traveler what is the probability of his or her showing up In the next two subsections we formalize the notions of joint and conditional distribu tions and the important notion of independence of random variables B2a Joint Distributions and Independence Let X and Y be discrete random variables Then 1X Y2 have a joint distribution which is fully de scribed by the joint probability density function of 1X Y2 fX Y1x y2 5 P1X 5 x Y 5 y2 B11 where the righthand side is the probability that X x and Y 5 y When X and Y are continuous a joint pdf can also be defined but we will not cover such details because joint pdfs for continuous ran dom variables are not used explicitly in this text In one case it is easy to obtain the joint pdf if we are given the pdfs of X and Y In particular random variables X and Y are said to be independent if and only if fX Y1x y2 5 fX1x2fY1y2 B12 for all x and y where fX is the pdf of X and fY is the pdf of Y In the context of more than one random variable the pdfs fX and fY are often called marginal probability density functions to distinguish them from the joint pdf fX Y This definition of independence is valid for discrete and continuous random variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 650 To understand the meaning of B12 it is easiest to deal with the discrete case If X and Y are discrete then B12 is the same as P1X 5 x Y 5 y2 5 P1X 5 x2P1Y 5 y2 B13 in other words the probability that X 5 x and Y 5 y is the product of the two probabilities P1X 5 x2 and P1Y 5 y2 One implication of B13 is that joint probabilities are fairly easy to compute since they only require knowledge of P1X 5 x2 and P1Y 5 y2 If random variables are not independent then they are said to be dependent ExamplE B1 Free Throw Shooting Consider a basketball player shooting two free throws Let X be the Bernoulli random variable equal to one if she or he makes the first free throw and zero otherwise Let Y be a Bernoulli random variable equal to one if he or she makes the second free throw Suppose that she or he is an 80 free throw shooter so that P1X 5 12 5 P1Y 5 12 5 8 What is the probability of the player making both free throws I f X a n d Y a r e i n d e p e n d e n t w e c a n e a s i l y a n s w e r t h i s q u e s t i o n P1X 5 1 Y 5 12 5 P1X 5 12P1Y 5 12 5 182 182 5 64 Thus there is a 64 chance of making both free throws If the chance of making the second free throw depends on whether the first was madethat is X and Y are not independentthen this simple calculation is not valid Independence of random variables is a very important concept In the next subsection we will show that if X and Y are independent then knowing the outcome of X does not change the probabilities of the possible outcomes of Y and vice versa One useful fact about independence is that if X and Y are independent and we define new random variables g1X2 and h1Y2 for any functions g and h then these new random variables are also independent There is no need to stop at two random variables If X1 X2 p Xn are discrete random variables then their joint pdf is f 1x1 x2 p xn2 5 P1X1 5 x1 X2 5 x2 p Xn 5 xn2 The random variables X1 X2 p Xn are independent random variables if and only if their joint pdf is the product of the individual pdfs for any 1x1 x2 p xn2 This definition of independence also holds for continuous random variables The notion of independence plays an important role in obtaining some of the classic distributions in probability and statistics Earlier we defined a Bernoulli random variable as a zeroone random variable indicating whether or not some event occurs Often we are interested in the number of suc cesses in a sequence of independent Bernoulli trials A standard example of independent Bernoulli trials is flipping a coin again and again Because the outcome on any particular flip has nothing to do with the outcomes on other flips independence is an appropriate assumption Independence is often a reasonable approximation in more complicated situations In the airline reservation example suppose that the airline accepts n reservations for a particular flight For each i 5 1 2 p n let Yi denote the Bernoulli random variable indicating whether customer i shows up Yi 5 1 if customer i appears and Yi 5 0 otherwise Letting u again denote the probability of success using reservation each Yi has a Bernoulli1u2 distribution As an approximation we might assume that the Yi are independent of one another although this is not exactly true in reality some people travel in groups which means that whether or not a person shows up is not truly independent of whether all others show up Modeling this kind of dependence is complex however so we might be willing to use independence as an approximation The variable of primary interest is the total number of customers showing up out of the n reservations call this variable X Since each Yi is unity when a person shows up we can write Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 651 X 5 Y1 1 Y2 1 p 1 Yn Now assuming that each Yi has probability of success u and that the Yi are independent X can be shown to have a binomial distribution That is the probability density function of X is f 1x2 5 QnxRux11 2 u2 n2x x 5 0 1 2 p n B14 where QnxR 5 n x1n 2 x2 and for any integer n n read n factorial is defined as n 5 n 1n 2 12 1n 2 22 c 1 By convention 0 5 1 When a random variable X has the pdf given in B14 we write X Binomial1n u2 Equation B14 can be used to compute P1X 5 x2 for any value of x from 0 to n If the flight has 100 available seats the airline is interested in P1X 1002 Suppose initially that n 5 120 so that the airline accepts 120 reservations and the probability that each person shows up is u 5 85 Then P1X 1002 5 P1X 5 1012 1 P1X 5 1022 1 p 1 P1X 5 1202 and each of the probabilities in the sum can be found from equation B14 with n 5 120 u 5 85 and the appropriate value of x 101 to 120 This is a difficult hand calculation but many statistical packages have commands for computing this kind of probability In this case the probability that more than 100 people will show up is about 659 which is probably more risk of overbooking than the airline wants to tolerate If instead the number of reservations is 110 the probability of more than 100 passengers showing up is only about 024 B2b Conditional Distributions In econometrics we are usually interested in how one random variable call it Y is related to one or more other variables For now suppose that there is only one variable whose effects we are interested in call it X The most we can know about how X affects Y is contained in the conditional distribution of Y given X This information is summarized by the conditional probability density function defined by fY 0 X1y 0 x2 5 fX Y1x y2fX1x2 B15 for all values of x such that fX1x2 0 The interpretation of B15 is most easily seen when X and Y are discrete Then fY 0 X1y 0 x2 5 P1Y 5 y 0 X 5 x2 B16 where the righthand side is read as the probability that Y 5 y given that X 5 x When Y is continu ous fY 0 X1y 0 x2 is not interpretable directly as a probability for the reasons discussed earlier but condi tional probabilities are found by computing areas under the conditional pdf An important feature of conditional distributions is that if X and Y are independent random vari ables knowledge of the value taken on by X tells us nothing about the probability that Y takes on vari ous values and vice versa That is fY 0 X1y 0 x2 5 fY1y2 and fX 0 Y1x 0 y2 5 fX1x2 ExamplE B2 Free Throw Shooting Consider again the basketballshooting example where two free throws are to be attempted Assume that the conditional density is fY 0 X11 0 12 5 85 fY 0 X10 0 12 5 15 fY 0 X11 0 02 5 70 fY 0 X10 0 02 5 30 This means that the probability of the player making the second free throw depends on whether the first free throw was made if the first free throw is made the chance of making the second is 85 if the Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 652 first free throw is missed the chance of making the second is 70 This implies that X and Y are not independent they are dependent We can still compute P1X 5 1 Y 5 12 provided we know P1X 5 12 Assume that the probability of making the first free throw is 8 that is P1X 5 12 5 8 Then from B15 we have P1X 5 1 Y 5 12 5 P1Y 5 1 0 X 5 12 P1X 5 12 5 1852 182 5 68 B3 Features of Probability Distributions For many purposes we will be interested in only a few aspects of the distributions of random vari ables The features of interest can be put into three categories measures of central tendency measures of variability or spread and measures of association between two random variables We cover the last of these in Section B4 B3a A Measure of Central Tendency The Expected Value The expected value is one of the most important probabilistic concepts that we will encounter in our study of econometrics If X is a random variable the expected value or expectation of X denoted E1X2 and sometimes mX or simply m is a weighted average of all possible values of X The weights are determined by the probability density function Sometimes the expected value is called the popu lation mean especially when we want to emphasize that X represents some variable in a population The precise definition of expected value is simplest in the case that X is a discrete random variable taking on a finite number of values say 5x1 p xk6 Let f 1x2 denote the probability density function of X The expected value of X is the weighted average E1X2 5 x1 f 1x12 1 x2 f 1x22 1 p 1 xk f 1xk2 a k j51 xj f 1xj2 B17 This is easily computed given the values of the pdf at each possible outcome of X ExamplE B3 Computing an Expected Value Suppose that X takes on the values 1 0 and 2 with probabilities 18 12 and 38 respectively Then E1X2 5 1212 1182 1 0 1122 1 2 1382 5 58 This example illustrates something curious about expected values the expected value of X can be a number that is not even a possible outcome of X We know that X takes on the values 21 0 or 2 yet its expected value is 58 This makes the expected value deficient for summarizing the central tendency of certain discrete random variables but calculations such as those just mentioned can be useful as we will see later If X is a continuous random variable then E1X2 is defined as an integral E1X2 5 e 2x f 1x2dx B18 which we assume is well defined This can still be interpreted as a weighted average For the most common continuous distributions E1X2 is a number that is a possible outcome of X In this text we will not need to compute expected values using integration although we will draw on some well known results from probability for expected values of special random variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 653 Given a random variable X and a function g we can create a new random variable g1X2 For example if X is a random variable then so is X2 and log1X2 1if X 02 The expected value of g1X2 is again simply a weighted average E3g1X2 4 5 a k j51 g1xj2fX1xj2 B19 or for a continuous random variable E3g1X2 4 5 e 2g1x2fX1x2dx B20 ExamplE B4 Expected Value of X2 For the random variable in Example B3 let g1X2 5 X2 Then E1X22 5 1212 21182 1 102 21122 1 122 21382 5 138 In Example B3 we computed E1X2 5 58 so that 3E1X2 42 5 2564 This shows that E1X22 is not the same as 3E1X2 42 In fact for a nonlinear function g1X2 E3g1X2 4 2 g3E1X2 4 except in very special cases If X and Y are random variables then g1X Y2 is a random variable for any function g and so we can define its expectation When X and Y are both discrete taking on values 5x1 x2 p xk6 and 5y1 y2 p ym6 respectively the expected value is E3g1X Y2 4 5 a k h51 a m j51 g1xh yj2fX Y1xh yj2 where fX Y is the joint pdf of 1X Y2 The definition is more complicated for continuous random vari ables since it involves integration we do not need it here The extension to more than two random variables is straightforward B3b Properties of Expected Values In econometrics we are not so concerned with computing expected values from various distributions the major calculations have been done many times and we will largely take these on faith We will need to manipulate some expected values using a few simple rules These are so important that we give them labels Property E1 For any constant c E1c2 5 c Property E2 For any constants a and b E1aX 1 b2 5 aE1X2 1 b One useful implication of E2 is that if m 5 E1X2 and we define a new random variable as Y 5 X 2 m then E1Y2 5 0 in E2 take a 5 1 and b 5 2m As an example of Property E2 let X be the temperature measured in Celsius at noon on a par ticular day at a given location suppose the expected temperature is E1X2 5 25 If Y is the tempera ture measured in Fahrenheit then Y 5 32 1 1952X From Property E2 the expected temperature in Fahrenheit is E1Y2 5 32 1 1952 E1X2 5 32 1 1952 25 5 77 Generally it is easy to compute the expected value of a linear function of many random variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 654 Property E3 If 5a1 a2 p an6 are constants and 5X1 X2 p Xn6 are random variables then E1a1X1 1 a2X2 1 p 1 anXn2 5 a1E1X12 1 a2E1X22 1 p 1 anE1Xn2 Or using summation notation Ea a n i51 aiXib 5 a n i51 aiE1Xi2 B21 As a special case of this we have with each ai 5 1 Ea a n i51 Xib 5 a n i51 E1Xi2 B22 so that the expected value of the sum is the sum of expected values This property is used often for derivations in mathematical statistics ExamplE B5 Finding Expected Revenue Let X1 X2 and X3 be the numbers of small medium and large pizzas respectively sold during the day at a pizza parlor These are random variables with expected values E1X12 5 25 E1X22 5 57 and E1X32 5 40 The prices of small medium and large pizzas are 550 760 and 915 Therefore the expected revenue from pizza sales on a given day is E1550 X1 1 760 X2 1 915 X32 5 550 E1X12 1 760 E1X22 1 915 E1X32 5 5501252 1 7601572 1 9151402 5 93670 that is 93670 The actual revenue on any particular day will generally differ from this value but this is the expected revenue We can also use Property E3 to show that if X Binomial1n u2 then E1X2 5 nu That is the expected number of successes in n Bernoulli trials is simply the number of trials times the probability of success on any particular trial This is easily seen by writing X as X 5 Y1 1 Y2 1 p 1 Yn where each Yi Bernoulli1u2 Then E1X2 5 a n i51 E1Yi2 5 a n i51 u 5 nu We can apply this to the airline reservation example where the airline makes n 5 120 reserva tions and the probability of showing up is u 5 85 The expected number of people showing up is 1201852 5 102 Therefore if there are 100 seats available the expected number of people show ing up is too large this has some bearing on whether it is a good idea for the airline to make 120 reservations Actually what the airline should do is define a profit function that accounts for the net revenue earned per seat sold and the cost per passenger bumped from the flight This profit function is random because the actual number of people showing up is random Let r be the net revenue from each pas senger You can think of this as the price of the ticket for simplicity Let c be the compensation owed to any passenger bumped from the flight Neither r nor c is random these are assumed to be known to the airline Let Y denote profits for the flight Then with 100 seats available Y 5 rX if X 100 5 100r 2 c1X 2 1002 if X 100 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 655 The first equation gives profit if no more than 100 people show up for the flight the second equation is profit if more than 100 people show up In the latter case the net revenue from ticket sales is 100r since all 100 seats are sold and then c1X 2 1002 is the cost of making more than 100 reservations Using the fact that X has a Binomialn85 distribution where n is the number of reservations made expected profits EY can be found as a function of n and r and c Computing EY directly would be quite difficult but it can be found quickly using a computer Once values for r and c are given the value of n that maximizes expected profits can be found by searching over different values of n B3c Another Measure of Central Tendency The Median The expected value is only one possibility for defining the central tendency of a random variable Another measure of central tendency is the median A general definition of median is too compli cated for our purposes If X is continuous then the median of X say m is the value such that onehalf of the area under the pdf is to the left of m and onehalf of the area is to the right of m When X is discrete and takes on a finite number of odd values the median is obtained by ordering the possible values of X and then selecting the value in the middle For example if X can take on the values 524 0 2 8 10 13 176 then the median value of X is 8 If X takes on an even number of val ues there are really two median values sometimes these are averaged to get a unique median value Thus if X takes on the values 525 3 9 176 then the median values are 3 and 9 if we average these we get a median equal to 6 In general the median sometimes denoted MedX and the expected value E1X2 are different Neither is better than the other as a measure of central tendency they are both valid ways to mea sure the center of the distribution of X In one special case the median and expected value or mean are the same If X has a symmetric distribution about the value m then m is both the expected value and the median Mathematically the condition is f 1m 1 x2 5 f 1m 2 x2 for all x This case is illus trated in Figure B3 x fx m Figure B3 A symmetric probability distribution Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 656 B3d Measures of Variability Variance and Standard Deviation Although the central tendency of a random variable is valuable it does not tell us everything we want to know about the distribution of a random variable Figure B4 shows the pdfs of two random variables with the same mean Clearly the distribution of X is more tightly centered about its mean than is the distribu tion of Y We would like to have a simple way of summarizing differences in the spreads of distributions B3e Variance For a random variable X let m 5 E1X2 There are various ways to measure how far X is from its expected value but the simplest one to work with algebraically is the squared difference 1X 2 m2 2 The squaring eliminates the sign from the distance measure the resulting positive value corresponds to our intuitive notion of distance and treats values above and below m symmetrically This distance is itself a random variable since it can change with every outcome of X Just as we needed a number to summarize the central tendency of X we need a number that tells us how far X is from m on average One such number is the variance which tells us the expected distance from X to its mean Var1X2 E3 1X 2 m2 24 B23 Variance is sometimes denoted s2 X or simply s2 when the context is clear From B23 it follows that the variance is always nonnegative As a computational device it is useful to observe that s2 5 E1X2 2 2Xm 1 m22 5 E1X22 2 2m2 1 m2 5 E1X22 2 m2 B24 In using either B23 or B24 we need not distinguish between discrete and continuous ran dom variables the definition of variance is the same in either case Most often we first compute E1X2 then E1X22 and then we use the formula in B24 For example if X Bernoulli1u2 then E1X2 5 u and since X2 5 X E1X22 5 u It follows from equation B24 that Var1X2 5 E1X22 2 m2 5 u 2 u2 5 u11 2 u2 xy pdf m fX fY Figure B4 Random variables with the same mean but different distributions Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 657 Two important properties of the variance follow Property VAR1 Var1X2 5 0 if and only if there is a constant c such that P1X 5 c2 5 1 in which case E1X2 5 c This first property says that the variance of any constant is zero and if a random variable has zero variance then it is essentially constant Property VAR2 For any constants a and b Var1aX 1 b2 5 a2Var1X2 This means that adding a constant to a random variable does not change the variance but multiplying a random variable by a constant increases the variance by a factor equal to the square of that constant For example if X denotes temperature in Celsius and Y 5 32 1 1952X is temperature in Fahrenheit then Var1Y2 5 1952 2Var1X2 5 181252Var1X2 B3f Standard Deviation The standard deviation of a random variable denoted sd1X2 is simply the positive square root of the variance sd1X2 1Var1X2 The standard deviation is sometimes denoted sX or simply s when the random variable is understood Two standard deviation properties immediately follow from Properties VAR1 and VAR2 Property SD1 For any constant c sd1c2 5 0 Property SD2 For any constants a and b sd1aX 1 b2 5 0a0sd1X2 In particular if a 0 then sd1aX2 5 a sd1X2 This last property makes the standard deviation more natural to work with than the variance For example suppose that X is a random variable measured in thousands of dollars say income If we define Y 5 1000X then Y is income measured in dollars Suppose that E1X2 5 20 and sd1X2 5 6 Then E1Y2 5 1000E1X2 5 20000 and sd1Y2 5 1000 sd1X2 5 6000 so that the expected value and standard deviation both increase by the same factor 1000 If we worked with variance we would have Var1Y2 5 110002 2Var1X2 so that the variance of Y is one million times larger than the variance of X B3g Standardizing a Random Variable As an application of the properties of variance and standard deviationand a topic of practical inter est in its own rightsuppose that given a random variable X we define a new random variable by subtracting off its mean m and dividing by its standard deviation s Z X 2 m s B25 which we can write as Z 5 aX 1 b where a 11s2 and b 21ms2 Then from Property E2 E1Z2 5 aE1X2 1 b 5 1ms2 2 1ms2 5 0 From Property VAR2 Var1Z2 5 a2Var1X2 5 1s2s22 5 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 658 Thus the random variable Z has a mean of zero and a variance and therefore a standard deviation equal to one This procedure is sometimes known as standardizing the random variable X and Z is called a standardized random variable In introductory statistics courses it is sometimes called the ztransform of X It is important to remember that the standard deviation not the variance ap pears in the denominator of B25 As we will see this transformation is frequently used in statistical inference As a specific example suppose that E1X2 5 2 and Var1X2 5 9 Then Z 5 1X 2 223 has ex pected value zero and variance one B3h Skewness and Kurtosis We can use the standardized version of a random variable to define other features of the distribution of a random variable These features are described by using what are called higher order moments For example the third moment of the random variable Z in B25 is used to determine whether a dis tribution is symmetric about its mean We can write E1Z32 5 E3 1X 2 m2 34s3 If X has a symmetric distribution about m then Z has a symmetric distribution about zero The divi sion by s3 does not change whether the distribution is symmetric That means the density of Z at any two points z and z is the same which means that in computing E1Z32 positive values z3 when z 0 are exactly offset with the negative value 12z2 3 5 2z3 It follows that if X is symmetric about zero then E1Z2 5 0 Generally E3 1X 2 m2 34 s3 is viewed as a measure of skewness in the distribution of X In a statistical setting we might use data to estimate E1Z32 to determine whether an underlying population distribution appears to be symmetric Computer Exercise C54 in Chapter 5 provides an illustration It also can be informative to compute the fourth moment of Z E1Z42 5 E3 1X 2 m2 44s4 Because Z4 0 E1Z42 0 and in any interesting case strictly greater than zero Without having a reference value it is difficult to interpret values of E1Z42 but larger values mean that the tails in the distribution of X are thicker The fourth moment E1Z42 is called a measure of kurtosis in the distribu tion of X In Section B5 we will obtain E1Z42 for the normal distribution B4 Features of Joint and Conditional Distributions B4a Measures of Association Covariance and Correlation While the joint pdf of two random variables completely describes the relationship between them it is useful to have summary measures of how on average two random variables vary with one another As with the expected value and variance this is similar to using a single number to summarize some thing about an entire distribution which in this case is a joint distribution of two random variables B4b Covariance Let mX 5 E1X2 and mY 5 E1Y2 and consider the random variable 1X 2 mX2 1Y 2 mY2 Now if X is above its mean and Y is above its mean then 1X 2 mX2 1Y 2 mY2 0 This is also true if X mX and Y mY On the other hand if X mX and Y mY or vice versa then 1X 2 mX2 1Y 2 mY2 0 How then can this product tell us anything about the relationship between X and Y Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 659 The covariance between two random variables X and Y sometimes called the population covari ance to emphasize that it concerns the relationship between two variables describing a population is defined as the expected value of the product 1X 2 mX2 1Y 2 mY2 Cov1X Y2 E3 1X 2 mX2 1Y 2 mY2 4 B26 which is sometimes denoted sXY If sXY 0 then on average when X is above its mean Y is also above its mean If sXY 0 then on average when X is above its mean Y is below its mean Several expressions useful for computing Cov1X Y2 are as follows Cov1X Y2 5 E3 1X 2 mX2 1Y 2 mY2 4 5 E3 1X 2 mX2Y4 5 E3X1Y 2 mY2 4 5 E1XY2 2 mXmY B27 It follows from B27 that if E1X2 5 0 or E1Y2 5 0 then Cov1X Y2 5 E1XY2 Covariance measures the amount of linear dependence between two random variables A positive covariance indicates that two random variables move in the same direction while a negative covari ance indicates they move in opposite directions Interpreting the magnitude of a covariance can be a little tricky as we will see shortly Because covariance is a measure of how two random variables are related it is natural to ask how covariance is related to the notion of independence This is given by the following property Property CoV1 If X and Y are independent then Cov1X Y2 5 0 This property follows from equation B27 and the fact that E1XY2 5 E1X2E1Y2 when X and Y are independent It is important to remember that the converse of COV1 is not true zero covariance between X and Y does not imply that X and Y are independent In fact there are random variables X such that if Y 5 X2 Cov1X Y2 5 0 Any random variable with E1X2 5 0 and E1X32 5 0 has this property If Y 5 X2 then X and Y are clearly not independent once we know X we know Y It seems rather strange that X and X2 could have zero covariance and this reveals a weakness of covariance as a general measure of association between random variables The covariance is useful in contexts when relationships are at least approximately linear The second major property of covariance involves covariances between linear functions Property CoV2 For any constants a1 b1 a2 and b2 Cov1a1X 1 b1 a2Y 1 b22 5 a1a2Cov1X Y2 B28 An important implication of COV2 is that the covariance between two random variables can be al tered simply by multiplying one or both of the random variables by a constant This is important in economics because monetary variables inflation rates and so on can be defined with different units of measurement without changing their meaning Finally it is useful to know that the absolute value of the covariance between any two random variables is bounded by the product of their standard deviations this is known as the CauchySchwartz inequality Property CoV3 0Cov1X Y2 0 sd1X2sd1Y2 B4c Correlation Coefficient Suppose we want to know the relationship between amount of education and annual earnings in the working population We could let X denote education and Y denote earnings and then compute their covariance But the answer we get will depend on how we choose to measure education and earnings Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 660 Property COV2 implies that the covariance between education and earnings depends on whether earnings are measured in dollars or thousands of dollars or whether education is measured in months or years It is pretty clear that how we measure these variables has no bearing on how strongly they are related But the covariance between them does depend on the units of measurement The fact that the covariance depends on units of measurement is a deficiency that is overcome by the correlation coefficient between X and Y Corr1X Y2 Cov1X Y2 sd1X2 sd1Y2 5 sXY sXsY B29 the correlation coefficient between X and Y is sometimes denoted rXY and is sometimes called the population correlation Because sX and sY are positive Cov1X Y2 and Corr1X Y2 always have the same sign and Corr1X Y2 5 0 if and only if Cov1X Y2 5 0 Some of the properties of covariance carry over to correlation If X and Y are independent then Corr1X Y2 5 0 but zero correlation does not imply in dependence Like the covariance the correlation coefficient is also a measure of linear dependence However the magnitude of the correlation coefficient is easier to interpret than the size of the covari ance due to the following property Property CoRR1 21 Corr1X Y2 1 If Corr1X Y2 5 0 or equivalently Cov1X Y2 5 0 then there is no linear relationship between X and Y and X and Y are said to be uncorrelated random variables otherwise X and Y are corre lated Corr1X Y2 5 1 implies a perfect positive linear relationship which means that we can write Y 5 a 1 bX for some constant a and some constant b 0 Corr1X Y2 5 21 implies a perfect nega tive linear relationship so that Y 5 a 1 bX for some b 0 The extreme cases of positive or negative 1 rarely occur Values of rXY closer to 1 or 21 indicate stronger linear relationships As mentioned earlier the correlation between X and Y is invariant to the units of measurement of either X or Y This is stated more generally as follows Property CoRR2 For constants a1 b1 a2 and b2 with a1a2 0 Corr1a1X 1 b1 a2Y 1 b22 5 Corr1X Y2 If a1a2 0 then Corr1a1X 1 b1 a2Y 1 b22 5 2Corr1X Y2 As an example suppose that the correlation between earnings and education in the working popula tion is 15 This measure does not depend on whether earnings are measured in dollars thousands of dollars or any other unit it also does not depend on whether education is measured in years quarters months and so on B4d Variance of Sums of Random Variables Now that we have defined covariance and correlation we can complete our list of major properties of the variance Property VAR3 For constants a and b Var1aX 1 bY2 5 a2Var1X2 1 b2Var1Y2 1 2abCov1X Y2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 661 It follows immediately that if X and Y are uncorrelatedso that Cov1X Y2 5 0then Var1X 1 Y2 5 Var1X2 1 Var1Y2 B30 and Var1X 2 Y2 5 Var1X2 1 Var1Y2 B31 In the latter case note how the variance of the difference is the sum of the variances not the differ ence in the variances As an example of B30 let X denote profits earned by a restaurant during a Friday night and let Y be profits earned on the following Saturday night Then Z 5 X 1 Y is profits for the two nights Sup pose X and Y each have an expected value of 300 and a standard deviation of 15 so that the vari ance is 225 Expected profits for the two nights is E1Z2 5 E1X2 1 E1Y2 5 2 13002 5 600 dollars If X and Y are independent and therefore uncorrelated then the variance of total profits is the sum of the variances Var1Z2 5 Var1X2 1 Var1Y2 5 2 12252 5 450 It follows that the standard deviation of total profits is 450 or about 2121 Expressions B30 and B31 extend to more than two random variables To state this extension we need a definition The random variables 5X1 p Xn6 are pairwise uncorrelated random variables if each variable in the set is uncorrelated with every other variable in the set That is Cov1Xi Xj2 5 0 for all i 2 j Property VAR4 If 5X1 p Xn6 are pairwise uncorrelated random variables and ai i 5 1 p n are constants then Var1a1X1 1 p 1 anXn2 5 a2 1Var1X12 1 p 1 a2 nVar1Xn2 In summation notation we can write Vara a n i51 aiXib 5 a n i51 a2 iVar1Xi2 B32 A special case of Property VAR4 occurs when we take ai 5 1 for all i Then for pairwise uncorre lated random variables the variance of the sum is the sum of the variances Vara a n i51 Xib 5 a n i51 Var1Xi2 B33 Because independent random variables are uncorrelated see Property COV1 the variance of a sum of independent random variables is the sum of the variances If the Xi are not pairwise uncorrelated then the expression for Var1 g n i51aiXi2 is much more com plicated we must add to the righthand side of B32 the terms 2aiajCov1xi xj2 for all i j We can use B33 to derive the variance for a binomial random variable Let X Binomial1n u2 and write X 5 Y1 1 p 1 Yn where the Yi are independent Bernoulli 1u2 random variables Then by B33 Var1X2 5 Var1Y12 1 p 1 Var1Yn2 5 nu11 2 u2 In the airline reservation example with n 5 120 and u 5 85 the variance of the number of pas sengers arriving for their reservations is 1201852 1152 5 153 so the standard deviation is about 39 B4e Conditional Expectation Covariance and correlation measure the linear relationship between two random variables and treat them symmetrically More often in the social sciences we would like to explain one variable called Y in terms of another variable say X Further if Y is related to X in a nonlinear fashion we would like Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 662 to know this Call Y the explained variable and X the explanatory variable For example Y might be hourly wage and X might be years of formal education We have already introduced the notion of the conditional probability density function of Y given X Thus we might want to see how the distribution of wages changes with education level However we usually want to have a simple way of summarizing this distribution A single number will no longer suffice since the distribution of Y given X 5 x generally depends on the value of x Nevertheless we can summarize the relationship between Y and X by looking at the conditional expectation of Y given X sometimes called the conditional mean The idea is this Suppose we know that X has taken on a particular value say x Then we can compute the expected value of Y given that we know this outcome of X We denote this expected value by E1Y 0 X 5 x2 or sometimes E1Y 0 x2 for shorthand Generally as x changes so does E1Y 0 x2 When Y is a discrete random variable taking on values y1 p ym then E1Y 0 x2 5 a m j51 yj fY 0 X1yj 0 x2 When Y is continuous E1Y 0 x2 is defined by integrating yfY 0 X1y 0 x2 over all possible values of y As with unconditional expectations the conditional expectation is a weighted average of possible values of Y but now the weights reflect the fact that X has taken on a specific value Thus E1Y 0 x2 is just some function of x which tells us how the expected value of Y varies with x As an example let 1X Y2 represent the population of all working individuals where X is years of education and Y is hourly wage Then E1Y 0 X 5 122 is the average hourly wage for all people in the population with 12 years of education roughly a high school education E1Y 0 X 5 162 is the average hourly wage for all people with 16 years of education Tracing out the expected value for various levels of education provides important information on how wages and education are related See Figure B5 for an illustration 4 8 12 EWAGEEDUC 16 20 EDUC Figure B5 The expected value of hourly wage given various levels of education Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 663 In principle the expected value of hourly wage can be found at each level of education and these expectations can be summarized in a table Because education can vary widelyand can even be measured in fractions of a yearthis is a cumbersome way to show the relationship between average wage and amount of education In econometrics we typically specify simple functions that capture this relationship As an example suppose that the expected value of WAGE given EDUC is the linear function E1WAGE 0 EDUC2 5 105 1 45 EDUC If this relationship holds in the population of working people the average wage for people with eight years of education is 105 1 45182 5 465 or 465 The average wage for people with 16 years of education is 825 or 825 The coefficient on EDUC implies that each year of education increases the expected hourly wage by 45 or 45 Conditional expectations can also be nonlinear functions For example suppose that E1Y 0 x2 5 10x where X is a random variable that is always greater than zero This function is graphed in Figure B6 This could represent a demand function where Y is quantity demanded and X is price If Y and X are related in this way an analysis of linear association such as correlation analysis would be incomplete B4f Properties of Conditional Expectation Several basic properties of conditional expectations are useful for derivations in econometric analysis Property CE1 E3c1X2 0 X4 5 c1X2 for any function cX This first property means that functions of X behave as constants when we compute expectations con ditional on X For example E1X2 0 X2 5 X2 Intuitively this simply means that if we know X then we also know X2 Property CE2 For functions aX and bX E3a1X2Y 1 b1X2 0 X4 5 a1X2E1Y 0 X2 1 b1X2 1 5 10 1 2 EYx 10 EYx 10x x Figure B6 Graph of E1Y 0X2 5 10x Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 664 For example we can easily compute the conditional expectation of a function such as XY 1 2X2 E1XY 1 2X2 0 X2 5 XE1Y 0 X2 1 2X2 The next property ties together the notions of independence and conditional expectations Property CE3 If X and Y are independent then E1Y 0 X2 5 E1Y2 This property means that if X and Y are independent then the expected value of Y given X does not depend on X in which case E1Y 0 X2 always equals the unconditional expectedvalue of Y In the wage and education example if wages were independent of education then the average wages of high school and college graduates would be the same Since this is almost certainly false we cannot as sume that wage and education are independent A special case of Property CE3 is the following if U and X are independent and E1U2 5 0 then E1U 0 X2 5 0 There are also properties of the conditional expectation that have to do with the fact that E1Y 0 X2 is a function of X say E1Y 0 X2 5 m1X2 Because X is a random variable m1X2 is also a random vari able Furthermore m1X2 has a probability distribution and therefore an expected value Generally the expected value of m1X2 could be very difficult to compute directly The law of iterated expectations says that the expected value of m1X2 is simply equal to the expected value of Y We write this as follows Property CE4 E3E1Y 0 X2 4 5 E1Y2 This property is a little hard to grasp at first It means that if we first obtain E1Y 0 X2 as a function of X and take the expected value of this with respect to the distribution of X of course then we end up with E1Y2 This is hardly obvious but it can be derived using the definition of expected values As an example of how to use Property CE4 let Y 5 WAGE and X 5 EDUC where WAGE is mea sured in hours and EDUC is measured in years Suppose the expected value of WAGE given EDUC is E1WAGE 0 EDUC2 5 4 1 60 EDUC Further E1EDUC2 5 115 Then the law of iterated expecta tions implies that E1WAGE2 5 E14 1 60 EDUC2 5 4 1 60 E1EDUC2 5 4 1 6011152 5 1090 or 1090 an hour The next property states a more general version of the law of iterated expectations Property CE4 E1Y 0 X2 5 E3E1Y 0 X Z2 0 X4 In other words we can find E1Y 0 X2 in two steps First find E1Y 0 X Z2 for any other random vari able Z Then find the expected value of E1Y 0 X Z2 conditional on X Property CE5 If E1Y 0 X2 5 E1Y2 then Cov1X Y2 5 0 and so Corr1X Y2 5 0 In fact every function of X is uncorrelated with Y This property means that if knowledge of X does not change the expected value of Y then X and Y must be uncorrelated which implies that if X and Y are correlated then E1Y 0 X2 must depend on X The converse of Property CE5 is not true if X and Y are uncorrelated E1Y 0 X2 could still depend on X For example suppose Y 5 X2 Then E1Y 0 X2 5 X2 which is clearly a function of X However as we mentioned in our discussion of covariance and correlation it is possible that X and X2 are un correlated The conditional expectation captures the nonlinear relationship between X and Y that cor relation analysis would miss entirely Properties CE4 and CE5 have two important implications if U and X are random variables such that E1U 0 X2 5 0 then E1U2 5 0 and U and X are uncorrelated Property CE6 If E1Y22 and E3g1X2 24 for some function g then E53Y 2 m1X2 42 0 X6 E53Y 2 g1X2 420X6 and E53Y 2 m1X2 426 E53Y 2 g1X2 426 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 665 Property CE6 is very useful in predicting or forecasting contexts The first inequality says that if we measure prediction inaccuracy as the expected squared prediction error conditional on X then the conditional mean is better than any other function of X for predicting Y The conditional mean also minimizes the unconditional expected squared prediction error B4g Conditional Variance Given random variables X and Y the variance of Y conditional on X 5 x is simply the variance as sociated with the conditional distribution of Y given X 5 x E53Y 2 E1Y 0 x2 42 0 x6 The formula Var1Y 0 X 5 x2 5 E1Y2 0 x2 2 3E1Y 0 x2 42 is often useful for calculations Only occasionally will we have to compute a conditional variance But we will have to make assumptions about and manipulate conditional variances for certain topics in regression analysis As an example let Y 5 SAVING and X 5 INCOME both of these measured annually for the population of all families Suppose that Var1SAVING 0 INCOME2 5 400 1 25 INCOME This says that as income increases the variance in saving levels also increases It is important to see that the relationship between the variance of SAVING and INCOME is totally separate from that between the expected value of SAVING and INCOME We state one useful property about the conditional variance Property CV1 If X and Y are independent then Var1Y 0 X2 5 Var1Y2 This property is pretty clear since the distribution of Y given X does not depend on X and Var1Y 0 X2 is just one feature of this distribution B5 The Normal and Related Distributions B5a The Normal Distribution The normal distribution and those derived from it are the most widely used distributions in statistics and econometrics Assuming that random variables defined over populations are normally distributed simplifies probability calculations In addition we will rely heavily on the normal and related distri butions to conduct inference in statistics and econometricseven when the underlying population is not necessarily normal We must postpone the details but be assured that these distributions will arise many times throughout this text A normal random variable is a continuous random variable that can take on any value Its prob ability density function has the familiar bell shape graphed in Figure B7 Mathematically the pdf of X can be written as f 1x2 5 1 s2p exp321x 2 m2 22s24 2 x B34 where m 5 E1X2 and s2 5 Var1X2 We say that X has a normal distribution with expected value m and variance s2 written as X Normal1m s22 Because the normal distribution is symmetric about m m is also the median of X The normal distribution is sometimes called the Gaussian distribution after the famous mathematician C F Gauss Certain random variables appear to roughly follow a normal distribution Human heights and weights test scores and county unemployment rates have pdfs roughly the shape in Figure B7 Other dis tributions such as income distributions do not appear to follow the normal probability function In most countries income is not symmetrically distributed about any value the distribution is skewed toward the upper tail In some cases a variable can be transformed to achieve normality A popular transformation is Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 666 the natural log which makes sense for positive random variables If X is a positive random variable such as income and Y 5 log1X2 has a normal distribution then we say that X has a lognormal distribution It turns out that the lognormal distribution fits income distribution pretty well in many countries Other variables such as prices of goods appear to be well described as lognormally distributed B5b The Standard Normal Distribution One special case of the normal distribution occurs when the mean is zero and the variance and there fore the standard deviation is unity If a random variable Z has a Normal01 distribution then we say it has a standard normal distribution The pdf of a standard normal random variable is denoted f1z2 from B34 with m 5 0 and s2 5 1 it is given by f1z2 5 1 2p exp12z222 2 z B35 The standard normal cumulative distribution function is denoted F1z2 and is obtained as the area under f to the left of z see Figure B8 Recall that F1z2 5 P1Z z2 because Z is continuous F1z2 5 P1Z z2 as well No simple formula can be used to obtain the values of F1z2 because F1z2 is the integral of the function in B35 and this integral has no closed form Nevertheless the values for F1z2 are easily tabulated they are given for z between 31 and 31 in Table G1 in Appendix G For z 231 F1z2 is less than 001 and for z 31 F1z2 is greater than 999 Most statistics and econometrics software packages include simple commands for computing values of the standard normal cdf so we can often avoid printed tables entirely and obtain the probabilities for any value of z Using basic facts from probabilityand in particular properties B7 and B8 concerning cdfswe can use the standard normal cdf for computing the probability of any event involving a standard normal random variable The most important formulas are P1Z z2 5 1 2 F1z2 B36 P1Z 2z2 5 P1Z z2 B37 m x fX for a normal random variable Figure B7 The general shape of the normal probability density function Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 667 and P1a Z b2 5 F1b2 2 F1a2 B38 Because Z is a continuous random variable all three formulas hold whether or not the in equalities are strict Some examples include P1Z 442 5 1 2 67 5 33 P1Z 2922 5 P1Z 922 5 1 2 821 5 179 and P121 Z 52 5 692 2 159 5 533 Another useful expression is that for any c 0 P1 0Z0 c2 5 P1Z c2 1 P1Z 2c2 B39 5 2 P1Z c2 5 231 2 F1c2 4 Thus the probability that the absolute value of Z is bigger than some positive constant c is simply twice the probability P1Z c2 this reflects the symmetry of the standard normal distribution In most applications we start with a normally distributed random variable X Normal1m s22 where m is different from zero and s2 2 1 Any normal random variable can be turned into a standard normal using the following property Property Normal1 If X Normal1m s22 then 1X 2 m2s Normal10 12 Property Normal1 shows how to turn any normal random variable into a standard normal Thus suppose X Normal13 42 and we would like to compute P1X 12 The steps always involve the normalization of X to a standard normal P1X 12 5 P1X 2 3 1 2 32 5 P aX 2 3 2 21b 5 P1Z 212 5 F1212 5 159 0 z 1 0 5 23 3 Figure B8 The standard normal cumulative distribution function Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 668 ExamplE B6 probabilities for a Normal Random Variable First let us compute P12 X 62 when X Normal49 whether we use or is irrelevant because X is a continuous random variable Now P12 X 62 5 P a2 2 4 3 X 2 4 3 6 2 4 3 b 5 P1223 Z 232 5 F1672 2 F12672 5 749 2 251 5 498 Now let us compute P1 0X0 22 P1 0X0 22 5 P1X 22 1 P1X 222 5 P3 1X 2 423 12 2 4234 1 P3 1X 2 423 122 2 4234 5 1 2 F12232 1 F1222 5 1 2 251 1 023 5 772 B5c Additional Properties of the Normal Distribution We end this subsection by collecting several other facts about normal distributions that we will later use Property Normal2 If X Normal1m s22 then aX 1 b Normal1am 1 b a2s22 Thus if X Normal19 then Y 5 2X 1 3 is distributed as normal with mean 2E1X2 1 3 5 5 and variance 22 9 5 36 sd1Y2 5 2sd1X2 5 2 3 5 6 Earlier we discussed how in general zero correlation and independence are not the same In the case of normally distributed random variables it turns out that zero correlation suffices for independence Property Normal3 If X and Y are jointly normally distributed then they are independent if and only if Cov1X Y2 5 0 Property Normal4 Any linear combination of independent identically distributed normal random variables has a normal distribution For example let Xi for i 5 1 2 and 3 be independent random variables distributed as Normal1m s22 Define W 5 X1 1 2X2 2 3X3 Then W is normally distributed we must simply find its mean and variance Now E1W2 5 E1X12 1 2E1X22 2 3E1X32 5 m 1 2m 2 3m 5 0 Also Var1W2 5 Var1X12 1 4Var1X22 1 9Var1X32 5 14s2 Property Normal4 also implies that the average of independent normally distributed random variables has a normal distribution If Y1 Y2 p Yn are independent random variables and each is distributed as Normal1m s22 then Y Normal1m s2n2 B40 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 669 This result is critical for statistical inference about the mean in a normal population Other features of the normal distribution are worth knowing although they do not play a central role in the text Because a normal random variable is symmetric about its mean it has zero skewness that is E3 1X 2 m2 34 5 0 Further it can be shown that E3 1X 2 m2 44s4 5 3 or E1Z42 5 3 where Z has a standard normal distribution Because the normal distribution is so prev alent in probability and statistics the measure of kurtosis for any given random variable X whose fourth moment exists is often defined to be E3 1X 2 m2 44s4 2 3 that is relative to the value for the standard normal distribution If E3 1X 2 m2 44s4 3 then the distribution of X has fatter tails than the normal distribution a somewhat common occurrence such as with the t distribution to be intro duced shortly if E3 1X 2 m2 44 s4 3 then the distribution has thinner tails than the normal a rarer situation B5d The ChiSquare Distribution The chisquare distribution is obtained directly from independent standard normal random variables Let Zi i 5 1 2 p n be independent random variables each distributed as standard normal Define a new random variable as the sum of the squares of the Zi X 5 a n i51 Z2 i B41 Then X has what is known as a chisquare distribution with n degrees of freedom or df for short We write this as X x2 n The df in a chisquare distribution corresponds to the number of terms in the sum in B41 The concept of degrees of freedom will play an important role in our statistical and econometric analyses The pdf for chisquare distributions with varying degrees of freedom is given in Figure B9 we will not need the formula for this pdf and so we do not reproduce it here From equation B41 it is clear that a chisquare random variable is always nonnegative and that unlike the normal distribution the chisquare distribution is not symmetric about any point It can be shown that if X x2 n then the expected value of X is n the number of terms in B41 and the variance of X is 2n B5e The t Distribution The t distribution is the workhorse in classical statistics and multiple regression analysis We obtain a t distribution from a standard normal and a chisquare random variable Let Z have a standard normal distribution and let X have a chisquare distribution with n degrees of freedom Further assume that Z and X are independent Then the random variable T 5 Z Xn B42 has a t distribution with n degrees of freedom We will denote this by T tn The t distribution gets its degrees of freedom from the chisquare random variable in the denominator of B42 The pdf of the t distribution has a shape similar to that of the standard normal distribution except that it is more spread out and therefore has more area in the tails The expected value of a t distributed random variable is zero strictly speaking the expected value exists only for n 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 670 and the variance is n1n 2 22 for n 2 The variance does not exist for n 2 because the distri bution is so spread out The pdf of the t distribution is plotted in Figure B10 for various degrees of freedom As the degrees of freedom gets large the t distribution approaches the standard normal distribution B5f The F Distribution Another important distribution for statistics and econometrics is the F distribution In particular the F distribution will be used for testing hypotheses in the context of multiple regression analysis To define an F random variable let X1 x2 k1 and X2 x2 k2 and assume that X1 and X2 are inde pendent Then the random variable F 5 1X1k12 1X2k22 B43 has an F distribution with 1k1 k22 degrees of freedom We denote this as F Fk1 k2 The pdf of the F distribution with different degrees of freedom is given in Figure B11 The order of the degrees of freedom in Fk1 k2 is critical The integer k1 is called the numera tor degrees of freedom because it is associated with the chisquare variable in the numerator Like wise the integer k2 is called the denominator degrees of freedom because it is associated with the chisquare variable in the denominator This can be a little tricky because B43 can also be writ ten as 1X1k221X2k12 so that k1 appears in the denominator Just remember that the numerator df is the integer associated with the chisquare variable in the numerator of B43 and similarly for the denominator df x df 2 fx df 4 df 8 Figure B9 The chisquare distribution with various degrees of freedom Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 671 0 3 df 1 23 df 2 df 24 Figure B10 The t distribution with various degrees of freedom x df 2 8 fx df 6 8 df 6 20 0 Figure B11 The Fk1 k2 distribution for various degrees of freedom k1 and k2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 672 Appendices Summary In this appendix we have reviewed the probability concepts that are needed in econometrics Most of the concepts should be familiar from your introductory course in probability and statistics Some of the more advanced topics such as features of conditional expectations do not need to be mastered nowthere is time for that when these concepts arise in the context of regression analysis in Part 1 In an introductory statistics course the focus is on calculating means variances covariances and so on for particular distributions In Part 1 we will not need such calculations we mostly rely on the properties of expectations variances and so on that have been stated in this appendix Key Terms Bernoulli or Binary Random Variable Binomial Distribution ChiSquare Distribution Conditional Distribution Conditional Expectation Continuous Random Variable Correlation Coefficient Covariance Cumulative Distribution Function cdf Degrees of Freedom Discrete Random Variable Expected Value Experiment F Distribution Independent Random Variables Joint Distribution Kurtosis Law of Iterated Expectations Median Normal Distribution Pairwise Uncorrelated Random Variables Probability Density Function pdf Random Variable Skewness Standard Deviation Standard Normal Distribution Standardized Random Variable Symmetric Distribution t Distribution Uncorrelated Random Variables Variance Problems 1 Suppose that a high school student is preparing to take the SAT exam Explain why his or her eventual SAT score is properly viewed as a random variable 2 Let X be a random variable distributed as Normal54 Find the probabilities of the following events i P1X 62 ii P1X 42 iii P1 0X 2 50 12 3 Much is made of the fact that certain mutual funds outperform the market year after year that is the return from holding shares in the mutual fund is higher than the return from holding a portfolio such as the SP 500 For concreteness consider a 10year period and let the population be the 4170 mutual funds reported in The Wall Street Journal on January 1 1995 By saying that performance relative to the market is random we mean that each fund has a 5050 chance of outperforming the market in any year and that performance is independent from year to year i If performance relative to the market is truly random what is the probability that any particular fund outperforms the market in all 10 years ii Of the 4170 mutual funds what is the expected number of funds that will outperform the market in all 10 years iii Find the probability that at least one fund out of 4170 funds outperforms the market in all 10 years What do you make of your answer iv If you have a statistical package that computes binomial probabilities find the probability that at least five funds outperform the market in all 10 years Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix B Fundamentals of Probability 673 4 For a randomly selected county in the United States let X represent the proportion of adults over age 65 who are employed or the elderly employment rate Then X is restricted to a value between zero and one Suppose that the cumulative distribution function for X is given by F1x2 5 3x2 2 2x3 for 0 x 1 Find the probability that the elderly employment rate is at least 6 60 5 Just prior to jury selection for O J Simpsons murder trial in 1995 a poll found that about 20 of the adult population believed Simpson was innocent after much of the physical evidence in the case had been revealed to the public Ignore the fact that this 20 is an estimate based on a subsample from the population for illustration take it as the true percentage of people who thought Simpson was innocent prior to jury selection Assume that the 12 jurors were selected randomly and independently from the population although this turned out not to be true i Find the probability that the jury had at least one member who believed in Simpsons innocence prior to jury selection Hint Define the Binomial1220 random variable X to be the number of jurors believing in Simpsons innocence ii Find the probability that the jury had at least two members who believed in Simpsons innocence Hint P1X 22 5 1 2 P1X 12 and P1X 12 5 P1X 5 02 1 P1X 5 12 6 Requires calculus Let X denote the prison sentence in years for people convicted of auto theft in a particular state in the United States Suppose that the pdf of X is given by f 1x2 5 1192x2 0 x 3 Use integration to find the expected prison sentence 7 If a basketball player is a 74 free throw shooter then on average how many free throws will he or she make in a game with eight free throw attempts 8 Suppose that a college student is taking three courses a twocredit course a threecredit course and a fourcredit course The expected grade in the twocredit course is 35 while the expected grade in the three and fourcredit courses is 30 What is the expected overall grade point average for the semester Remember that each course grade is weighted by its share of the total number of units 9 Let X denote the annual salary of university professors in the United States measured in thousands of dollars Suppose that the average salary is 523 with a standard deviation of 146 Find the mean and standard deviation when salary is measured in dollars 10 Suppose that at a large university college grade point average GPA and SAT score SAT are related by the conditional expectation E1GPA 0 SAT2 5 70 1 002 SAT i Find the expected GPA when SAT 5 800 Find E1GPA 0 SAT 5 14002 Comment on the difference ii If the average SAT in the university is 1100 what is the average GPA Hint Use Property CE4 iii If a students SAT score is 1100 does this mean he or she will have the GPA found in part ii Explain 11 i L et X be a random variable taking on the values 21 and 1 each with probability 12 Find E1X2 and E1X22 ii Now let X be a random variable taking on the values 1 and 2 each with probability 12 Find E1X2 and E11X2 iii Conclude from parts i and ii that in general E3g1X2 4 2 g3E1X2 4 for a nonlinear function g12 iv Given the definition of the F random variable in equation B43 show that E1F2 5 E c 1 1X2k22 d Can you conclude that E1F2 5 1 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 674 C1 Populations Parameters and Random Sampling Statistical inference involves learning something about a population given the availability of a sam ple from that population By population we mean any welldefined group of subjects which could be individuals firms cities or many other possibilities By learning we can mean several things which are broadly divided into the categories of estimation and hypothesis testing A couple of examples may help you understand these terms In the population of all working adults in the United States labor economists are interested in learning about the return to education as measured by the average percentage increase in earnings given another year of education It would be impractical and costly to obtain information on earnings and education for the entire working population in the United States but we can obtain data on a subset of the population Using the data collected a labor economist may report that his or her best estimate of the return to another year of education is 75 This is an example of a point estimate Or she or he may report a range such as the return to education is between 56 and 94 This is an example of an interval estimate An urban economist might want to know whether neighborhood crime watch programs are associ ated with lower crime rates After comparing crime rates of neighborhoods with and without such pro grams in a sample from the population he or she can draw one of two conclusions neighborhood watch programs do affect crime or they do not This example falls under the rubric of hypothesis testing The first step in statistical inference is to identify the population of interest This may seem obvi ous but it is important to be very specific Once we have identified the population we can specify a model for the population relationship of interest Such models involve probability distributions or features of probability distributions and these depend on unknown parameters Parameters are simply constants that determine the directions and strengths of relationships among variables In the labor eco nomics example just presented the parameter of interest is the return to education in the population C1a Sampling For reviewing statistical inference we focus on the simplest possible setting Let Y be a random variable representing a population with a probability density function f1y u2 which depends on the single parameter u The probability density function pdf of Y is assumed to be known except for the value of u different values of u imply different population distributions and therefore we are interested in the value of u If we can obtain certain kinds of samples from the population then we can learn something about u The easiest sampling scheme to deal with is random sampling Fundamentals of Mathematical Statistics Appendix C Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 675 Random Sampling If Y1 Y2 p Yn are independent random variables with a common prob ability density function f 1y u2 then 5Y1 p Yn6 is said to be a random sample from f 1y u2 or a random sample from the population represented by f 1y u2 When 5Y1 p Yn6 is a random sample from the density f 1y u2 we also say that the Yi are indepen dent identically distributed or iid random variables from f 1y u2 In some cases we will not need to entirely specify what the common distribution is The random nature of Y1 Y2 p Yn in the definition of random sampling reflects the fact that many different outcomes are possible before the sampling is actually carried out For example if fam ily income is obtained for a sample of n 5 100 families in the United States the incomes we observe will usually differ for each different sample of 100 families Once a sample is obtained we have a set of numbers say 5y1 y2 p yn6 which constitute the data that we work with Whether or not it is ap propriate to assume the sample came from a random sampling scheme requires knowledge about the actual sampling process Random samples from a Bernoulli distribution are often used to illustrate statistical concepts and they also arise in empirical applications If Y1 Y2 p Yn are independent random variables and each is distributed as Bernoulliu so that P1Yi 5 12 5 u and P1Yi 5 02 5 1 2 u then 5Y1 Y2 p Yn6 constitutes a random sample from the Bernoulliu distribution As an illustration consider the airline reservation example carried along in Appendix B Each Yi denotes whether customer i shows up for his or her reservation Yi 5 1 if passenger i shows up and Yi 5 0 otherwise Here u is the probability that a randomly drawn person from the population of all people who make airline reservations shows up for his or her reservation For many other applications random samples can be assumed to be drawn from a normal distri bution If 5Y1 p Yn6 is a random sample from the Normal1m s22 population then the population is characterized by two parameters the mean m and the variance s2 Primary interest usually lies in m but s2 is of interest in its own right because making inferences about m often requires learning about s2 C2 Finite Sample Properties of Estimators In this section we study what are called finite sample properties of estimators The term finite sample comes from the fact that the properties hold for a sample of any size no matter how large or small Sometimes these are called small sample properties In Section C3 we cover asymptotic properties which have to do with the behavior of estimators as the sample size grows without bound C2a Estimators and Estimates To study properties of estimators we must define what we mean by an estimator Given a random sam ple 5Y1 Y2 p Yn6 drawn from a population distribution that depends on an unknown parameter u an estimator of u is a rule that assigns each possible outcome of the sample a value of u The rule is specified before any sampling is carried out in particular the rule is the same regardless of the data actually obtained As an example of an estimator let 5Y1 p Yn6 be a random sample from a population with mean m A natural estimator of m is the average of the random sample Y 5 n21 a n i51 Yi C1 Y is called the sample average but unlike in Appendix A where we defined the sample average of a set of numbers as a descriptive statistic Y is now viewed as an estimator Given any outcome of the random variables Y1 p Yn we use the same rule to estimate m we simply average them For actual data outcomes 5y1 p yn6 the estimate is just the average in the sample y 5 1y1 1 y2 1 p 1 yn2n Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 676 ExamplE C1 City Unemployment Rates Suppose we obtain the following sample of unemployment rates for 10 cities in the United States City Unemployment Rate 1 51 2 64 3 92 4 41 5 75 6 83 7 26 8 35 9 58 10 75 Our estimate of the average city unemployment rate in the United States is y 5 60 Each sample gen erally results in a different estimate But the rule for obtaining the estimate is the same regardless of which cities appear in the sample or how many More generally an estimator W of a parameter u can be expressed as an abstract mathematical formula W 5 h1Y1 Y2 p Yn2 C2 for some known function h of the random variables Y1 Y2 p Yn As with the special case of the sample average W is a random variable because it depends on the random sample as we obtain different random samples from the population the value of W can change When a particular set of numbers say 5y1 y2 p yn6 is plugged into the function h we obtain an estimate of u denoted w 5 h1y1 p yn2 Sometimes W is called a point estimator and w a point estimate to distinguish these from interval estimators and estimates which we will come to in Section C5 For evaluating estimation procedures we study various properties of the probability distribution of the random variable W The distribution of an estimator is often called its sampling distribution because this distribution describes the likelihood of various outcomes of W across different random samples Because there are unlimited rules for combining data to estimate parameters we need some sensible criteria for choosing among estimators or at least for eliminating some estimators from con sideration Therefore we must leave the realm of descriptive statistics where we compute things such as the sample average to simply summarize a body of data In mathematical statistics we study the sampling distributions of estimators C2b Unbiasedness In principle the entire sampling distribution of W can be obtained given the probability distribution of Yi and the function h It is usually easier to focus on a few features of the distribution of W in evaluating it as an estimator of u The first important property of an estimator involves its expected value Unbiased Estimator An estimator W of u is an unbiased estimator if E1W2 5 u C3 for all possible values of u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 677 If an estimator is unbiased then its probability distribution has an expected value equal to the parameter it is supposed to be estimating Unbiasedness does not mean that the estimate we get with any particular sample is equal to u or even very close to u Rather if we could indefinitely draw random samples on Y from the population compute an estimate each time and then average these estimates over all random samples we would obtain u This thought experiment is abstract because in most applications we just have one random sample to work with For an estimator that is not unbiased we define its bias as follows Bias of an Estimator If W is a biased estimator of u its bias is defined Bias1W2 E1W2 2 u C4 Figure C1 shows two estimators the first one is unbiased and the second one has a positive bias The unbiasedness of an estimator and the size of any possible bias depend on the distribution of Y and on the function h The distribution of Y is usually beyond our control although we often choose a model for this distribution it may be determined by nature or social forces But the choice of the rule h is ours and if we want an unbiased estimator then we must choose h accordingly Some estimators can be shown to be unbiased quite generally We now show that the sample average Y is an unbiased estimator of the population mean m regardless of the underlying population distribution We use the properties of expected values E1 and E2 that we covered in Section B3 E1Y2 5 Ea 11n2 a n i51 Yib 5 11n2Ea a n i51 Yib 5 11n2 a a n i51 E1Yi2 b 5 11n2 a a n i51 mb 5 11n2 1nm2 5 m w u EW1 EW2 pdf of W1 pdf of W2 fw Figure C1 An unbiased estimator W1 and an estimator with positive bias W2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 678 For hypothesis testing we will need to estimate the variance s2 from a population with mean m Letting 5Y1 p Yn6 denote the random sample from the population with E1Y2 5 m and Var1Y2 5 s2 define the estimator as S2 5 1 n 2 1 a n i51 1Yi 2 Y2 2 C5 which is usually called the sample variance It can be shown that S2 is unbiased for s2 E1S22 5 s2 The division by n 2 1 rather than n accounts for the fact that the mean m is estimated rather than known If m were known an unbiased estimator of s2 would be n21g n i511Yi 2 m2 2 but m is rarely known in practice Although unbiasedness has a certain appeal as a property for an estimatorindeed its antonym biased has decidedly negative connotationsit is not without its problems One weakness of unbi asedness is that some reasonable and even some very good estimators are not unbiased We will see an example shortly Another important weakness of unbiasedness is that unbiased estimators exist that are actually quite poor estimators Consider estimating the mean m from a population Rather than using the sample average Y to estimate m suppose that after collecting a sample of size n we discard all of the observa tions except the first That is our estimator of m is simply W Y1 This estimator is unbiased because E1Y12 5 m Hopefully you sense that ignoring all but the first observation is not a prudent approach to estimation it throws out most of the information in the sample For example with n 5 100 we obtain 100 outcomes of the random variable Y but then we use only the first of these to estimate EY C2d The Sampling Variance of Estimators The example at the end of the previous subsection shows that we need additional criteria to evaluate estimators Unbiasedness only ensures that the sampling distribution of an estimator has a mean value equal to the parameter it is supposed to be estimating This is fine but we also need to know how spread out the distribution of an estimator is An estimator can be equal to u on average but it can also be very far away with large probability In Figure C2 W1 and W2 are both unbiased estimators of u But the distribution of W1 is more tightly centered about u the probability that W1 is greater than any given distance from u is less than the probability that W2 is greater than that same distance from u Using W1 as our estimator means that it is less likely that we will obtain a random sample that yields an estimate very far from u To summarize the situation shown in Figure C2 we rely on the variance or standard deviation of an estimator Recall that this gives a single measure of the dispersion in the distribution The vari ance of an estimator is often called its sampling variance because it is the variance associated with a sampling distribution Remember the sampling variance is not a random variable it is a constant but it might be unknown We now obtain the variance of the sample average for estimating the mean m from a population Var1Y2 5 Vara 11n2 a n i51 Yib 5 11n22Vara a n i51 Yib 5 11n22 a a n i51 Var1Yi2 b 5 11n22 a a n i51 s2b 5 11n22 1ns22 5 s2n C6 Notice how we used the properties of variance from Sections B3 and B4 VAR2 and VAR4 as well as the independence of the Yi To summarize If 5Yi i 5 1 2 p n6 is a random sample from a population with mean m and variance s2 then Y has the same mean as the population but its sampling variance equals the population variance s2 divided by the sample size Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 679 An important implication of Var1Y2 5 s2n is that it can be made very close to zero by increasing the sample size n This is a key feature of a reasonable estimator and we return to it in Section C3 As suggested by Figure C2 among unbiased estimators we prefer the estimator with the small est variance This allows us to eliminate certain estimators from consideration For a random sample from a population with mean m and variance s2 we know that Y is unbiased and Var1Y2 5 s2n What about the estimator Y1 which is just the first observation drawn Because Y1 is a random draw from the population Var1Y12 5 s2 Thus the difference between Var1Y12 and Var1Y2 can be large even for small sample sizes If n 5 10 then Var1Y12 is 10 times as large as Var1Y2 5 s210 This gives us a formal way of excluding Y1 as an estimator of m To emphasize this point Table C1 contains the outcome of a small simulation study Using the statistical package Stata 20 random samples of size 10 were generated from a normal distribution with m 5 2 and s2 5 1 we are interested in estimating m here For each of the 20 random samples we compute two estimates y1 and y these values are listed in Table C1 As can be seen from the table the values for y1 are much more spread out than those for y y1 ranges from 2064 to 427 while y ranges only from 116 to 258 Further in 16 out of 20 cases y is closer than y1 to m 5 2 The aver age of y1 across the simulations is about 189 while that for y is 196 The fact that these averages are close to 2 illustrates the unbiasedness of both estimators and we could get these averages closer to 2 by doing more than 20 replications But comparing just the average outcomes across random draws masks the fact that the sample average Y is far superior to Y1 as an estimator of m C2e Efficiency Comparing the variances of Y and Y1 in the previous subsection is an example of a general approach to comparing different unbiased estimators Relative Efficiency If W1 and W2 are two unbiased estimators of u W1 is efficient relative to W2 when Var1W12 Var1W22 for all u with strict inequality for at least one value of u w u fw pdf of W1 pdf of W2 Figure C2 The sampling distributions of two unbiased estimators of u Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 680 TAble C1 Simulation of Estimators for a Normal1m 12 Distribution with m 5 2 Replication y1 y 1 2064 198 2 106 143 3 427 165 4 103 188 5 316 234 6 277 258 7 168 158 8 298 223 9 225 196 10 204 211 11 095 215 12 136 193 13 262 202 14 297 210 15 193 218 16 114 210 17 208 194 18 152 221 19 133 116 20 121 175 Earlier we showed that for estimating the population mean m Var1Y2 Var1Y12 for any value of s2 whenever n 1 Thus Y is efficient relative to Y1 for estimating m We cannot always choose between unbiased estimators based on the smallest variance criterion given two unbiased estimators of u one can have smaller variance from some values of u while the other can have smaller variance for other values of u If we restrict our attention to a certain class of estimators we can show that the sample average has the smallest variance Problem C2 asks you to show that Y has the smallest variance among all unbiased estimators that are also linear functions of Y1 Y2 p Yn The assumptions are that the Yi have common mean and variance and that they are pairwise uncorrelated If we do not restrict our attention to unbiased estimators then comparing variances is meaning less For example when estimating the population mean m we can use a trivial estimator that is equal to zero regardless of the sample that we draw Naturally the variance of this estimator is zero since it is the same value for every random sample But the bias of this estimator is 2m so it is a very poor estimator when 0m0 is large One way to compare estimators that are not necessarily unbiased is to compute the mean squared error MSE of the estimators If W is an estimator of u then the MSE of W is defined as MSE1W2 5 E3 1W 2 u2 24 The MSE measures how far on average the estimator is away from u It can be shown that MSE1W2 5 Var1W2 1 3Bias1W2 42 so that MSEW depends on the variance and bias if any is present This allows us to compare two estimators when one or both are biased Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 681 C3 Asymptotic or Large Sample Properties of Estimators In Section C2 we encountered the estimator Y1 for the population mean m and we saw that even though it is unbiased it is a poor estimator because its variance can be much larger than that of the sample mean One notable feature of Y1 is that it has the same variance for any sample size It seems reasonable to require any estimation procedure to improve as the sample size increases For estimat ing a population mean m Y improves in the sense that its variance gets smaller as n gets larger Y1 does not improve in this sense We can rule out certain silly estimators by studying the asymptotic or large sample properties of estimators In addition we can say something positive about estimators that are not unbiased and whose variances are not easily found Asymptotic analysis involves approximating the features of the sampling distribution of an es timator These approximations depend on the size of the sample Unfortunately we are necessarily limited in what we can say about how large a sample size is needed for asymptotic analysis to be appropriate this depends on the underlying population distribution But large sample approximations have been known to work well for sample sizes as small as n 5 20 C3a Consistency The first asymptotic property of estimators concerns how far the estimator is likely to be from the parameter it is supposed to be estimating as we let the sample size increase indefinitely Consistency Let Wn be an estimator of u based on a sample Y1 Y2 p Yn of size n Then Wn is a consistent estimator of u if for every e 0 P1 0Wn 2 u0 e2 S 0 as n S C7 If Wn is not consistent for u then we say it is inconsistent When Wn is consistent we also say that u is the probability limit of Wn written as plim1Wn2 5 u Unlike unbiasednesswhich is a feature of an estimator for a given sample sizeconsistency involves the behavior of the sampling distribution of the estimator as the sample size n gets large To emphasize this we have indexed the estimator by the sample size in stating this definition and we will continue with this convention throughout this section Equation C7 looks technical and it can be rather difficult to establish based on fundamental probability principles By contrast interpreting C7 is straightforward It means that the distribution of Wn becomes more and more concentrated about u which roughly means that for larger sample sizes Wn is less and less likely to be very far from u This tendency is illustrated in Figure C3 If an estimator is not consistent then it does not help us to learn about u even with an unlimited amount of data For this reason consistency is a minimal requirement of an estimator used in statis tics or econometrics We will encounter estimators that are consistent under certain assumptions and inconsistent when those assumptions fail When estimators are inconsistent we can usually find their probability limits and it will be important to know how far these probability limits are from u As we noted earlier unbiased estimators are not necessarily consistent but those whose vari ances shrink to zero as the sample size grows are consistent This can be stated formally If Wn is an unbiased estimator of u and Var1Wn2 S 0 as n S then plim1Wn2 5 u Unbiased estimators that use the entire data sample will usually have a variance that shrinks to zero as the sample size grows thereby being consistent A good example of a consistent estimator is the average of a random sample drawn from a popu lation with mean m and variance s2 We have already shown that the sample average is unbiased for m Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 682 In Equation C6 we derived Var1Yn2 5 s2n for any sample size n Therefore Var1Yn2 S 0 as n S so Yn is a consistent estimator of m in addition to being unbiased The conclusion that Yn is consistent for m holds even if Var1Yn2 does not exist This classic result is known as the law of large numbers LLN Law of Large Numbers Let Y1 Y2 p Yn be independent identically distributed random variables with mean m Then plim1Yn2 5 m C8 The law of large numbers means that if we are interested in estimating the population average m we can get arbitrarily close to m by choosing a sufficiently large sample This fundamental result can be combined with basic properties of plims to show that fairly complicated estimators are consistent Property PLim1 Let u be a parameter and define a new parameter g 5 g1u2 for some continuous function g1u2 Suppose that plim1Wn2 5 u Define an estimator of g by Gn 5 g1Wn2 Then plim1Gn2 5 g C9 This is often stated as plim g1Wn2 5 g1plim Wn2 C10 for a continuous function g1u2 The assumption that g1u2 is continuous is a technical requirement that has often been described nontechnically as a function that can be graphed without lifting your pencil from the paper Because all the functions we encounter in this text are continuous we do not provide a formal definition of a continuous function Examples of continuous functions are g1u2 5 a 1 bu for constants a and b g1u2 5 u2 g1u2 5 1u g1u2 5 u g1u2 5 exp1u2 and many variants on these We will not need to mention the continuity assumption again fWnw u n 40 n 16 n 4 w Figure C3 The sampling distributions of a consistent estimator for three sample sizes Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 683 As an important example of a consistent but biased estimator consider estimating the standard deviation s from a population with mean m and variance s2 We already claimed that the sample variance S2 n 5 1n 2 12 21g n i511Yi 2 Yn2 2 is unbiased for s2 Using the law of large numbers and some algebra S2 n can also be shown to be consistent for s2 The natural estimator of s 5 s2 is Sn 5 S2 n where the square root is always the positive square root Sn which is called the sample standard deviation is not an unbiased estimator because the expected value of the square root is not the square root of the expected value see Section B3 Nevertheless by PLIM1 plim Sn 5 plim S2 n 5 s2 5 s so Sn is a consistent estimator of s Here are some other useful properties of the probability limit Property PLim2 If plim1Tn2 5 a and plim1Un2 5 b then i plim1Tn 1 Un2 5 a 1 b ii plim1TnUn2 5 ab iii plim1TnUn2 5 ab provided b 2 0 These three facts about probability limits allow us to combine consistent estimators in a variety of ways to get other consistent estimators For example let 5Y1 p Yn6 be a random sample of size n on annual earnings from the population of workers with a high school education and denote the population mean by mY Let 5Z1 p Zn6 be a random sample on annual earnings from the population of workers with a college education and denote the population mean by mZ We wish to estimate the percentage difference in annual earnings between the two groups which is g 5 100 1mZ 2 mY2mY This is the percentage by which average earnings for college graduates differs from average earnings for high school graduates Because Yn is consistent for mY and Zn is consistent for mZ it follows from PLIM1 and part iii of PLIM2 that Gn 100 1Zn 2 Yn2Yn is a consistent estimator of g Gn is just the percentage difference between Zn and Yn in the sample so it is a natural estimator Gn is not an unbiased estimator of g but it is still a good estimator except possibly when n is small C3b Asymptotic Normality Consistency is a property of point estimators Although it does tell us that the distribution of the esti mator is collapsing around the parameter as the sample size gets large it tells us essentially nothing about the shape of that distribution for a given sample size For constructing interval estimators and testing hypotheses we need a way to approximate the distribution of our estimators Most econo metric estimators have distributions that are well approximated by a normal distribution for large samples which motivates the following definition Asymptotic Normality Let 5Zn n 5 1 2 p 6 be a sequence of random variables such that for all numbers z P1Zn z2 S F1z2 as n S C11 where F1z2 is the standard normal cumulative distribution function Then Zn is said to have an as ymptotic standard normal distribution In this case we often write Zn a Normal10 12 The a above the tilde stands for asymptotically or approximately Property C11 means that the cumulative distribution function for Zn gets closer and closer to the cdf of the standard normal distribution as the sample size n gets large When asymptotic normality holds for large n we have the approximation P1Zn z2 F1z2 Thus probabilities concerning Zn can be approximated by standard normal probabilities Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 684 The central limit theorem CLT is one of the most powerful results in probability and statis tics It states that the average from a random sample for any population with finite variance when standardized has an asymptotic standard normal distribution Central Limit Theorem Let 5Y1 Y2 p Yn6 be a random sample with mean m and variance s2 Then Zn 5 Yn 2 m sn C12 has an asymptotic standard normal distribution The variable Zn in C12 is the standardized version of Yn we have subtracted off E1Yn2 5 m and divided by sd1Yn2 5 sn Thus regardless of the population distribution of Y Zn has mean zero and variance one which coincides with the mean and variance of the standard normal distribution Remarkably the entire distribution of Zn gets arbitrarily close to the standard normal distribution as n gets large We can write the standardized variable in equation C12 as n1Yn 2 m2s which shows that we must multiply the difference between the sample mean and the population mean by the square root of the sample size in order to obtain a useful limiting distribution Without the multiplication by n we would just have 1Yn 2 m2s which converges in probability to zero In other words the distribu tion of 1Yn 2 m2s simply collapses to a single point as n S which we know cannot be a good approximation to the distribution of 1Yn 2 m2s for reasonable sample sizes Multiplying by n ensures that the variance of Zn remains constant Practically we often treat Yn as being approximately normally distributed with mean m and variance s2n and this gives us the correct statistical proce dures because it leads to the standardized variable in equation C12 Most estimators encountered in statistics and econometrics can be written as functions of sample averages in which case we can apply the law of large numbers and the central limit theorem When two consistent estimators have asymptotic normal distributions we choose the estimator with the smallest asymptotic variance In addition to the standardized sample average in C12 many other statistics that depend on sample averages turn out to be asymptotically normal An important one is obtained by replacing s with its consistent estimator Sn in equation C12 Yn 2 m Snn C13 also has an approximate standard normal distribution for large n The exact finite sample distribu tions of C12 and C13 are definitely not the same but the difference is often small enough to be ignored for large n Throughout this section each estimator has been subscripted by n to emphasize the nature of as ymptotic or large sample analysis Continuing this convention clutters the notation without providing additional insight once the fundamentals of asymptotic analysis are understood Henceforth we drop the n subscript and rely on you to remember that estimators depend on the sample size and properties such as consistency and asymptotic normality refer to the growth of the sample size without bound C4 General Approaches to Parameter Estimation Until this point we have used the sample average to illustrate the finite and large sample properties of estimators It is natural to ask Are there general approaches to estimation that produce estimators with good properties such as unbiasedness consistency and efficiency The answer is yes A detailed treatment of various approaches to estimation is beyond the scope of this text here we provide only an informal discussion A thorough discussion is given in Larsen and Marx 1986 Chapter 5 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 685 C4a Method of Moments Given a parameter u appearing in a population distribution there are usually many ways to obtain unbiased and consistent estimators of u Trying all different possibilities and comparing them on the basis of the criteria in Sections C2 and C3 is not practical Fortunately some methods have been shown to have good general properties and for the most part the logic behind them is intuitively appealing In the previous sections we have studied the sample average as an unbiased estimator of the popu lation average and the sample variance as an unbiased estimator of the population variance These estimators are examples of method of moments estimators Generally method of moments estimation proceeds as follows The parameter u is shown to be related to some expected value in the distribution of Y usually EY or E1Y22 although more exotic choices are sometimes used Suppose for example that the parameter of interest u is related to the population mean as u 5 g1m2 for some function g Because the sample average Y is an unbiased and consistent estimator of m it is natural to replace m with Y which gives us the estimator g1Y2 of u The estimator g1Y2 is consistent for u and if g1m2 is a linear function of m then g1Y2 is unbiased as well What we have done is replace the population mo ment m with its sample counterpart Y This is where the name method of moments comes from We cover two additional method of moments estimators that will be useful for our discus sion of regression analysis Recall that the covariance between two random variables X and Y is defined as sXY 5 E3 1X 2 mX2 1Y 2 mY2 4 The method of moments suggests estimating sXY by n21g n i511Xi 2 X2 1Yi 2 Y2 This is a consistent estimator of sXY but it turns out to be biased for es sentially the same reason that the sample variance is biased if n rather than n 2 1 is used as the divi sor The sample covariance is defined as SXY 5 1 n 2 1 a n i51 1Xi 2 X2 1Yi 2 Y2 C14 It can be shown that this is an unbiased estimator of sXY Replacing n with n 2 1 makes no difference as the sample size grows indefinitely so this estimator is still consistent As we discussed in Section B4 the covariance between two variables is often difficult to in terpret Usually we are more interested in correlation Because the population correlation is rXY 5 sXY1sXsY2 the method of moments suggests estimating rXY as RXY 5 SXY SXSY 5 a n i51 1Xi 2 X2 1Yi 2 Y2 a a n i51 1Xi 2 X2 2b 12 a a n i51 1Yi 2 Y2 2b 12 C15 which is called the sample correlation coefficient or sample correlation for short Notice that we have canceled the division by n 2 1 in the sample covariance and the sample standard deviations In fact we could divide each of these by n and we would arrive at the same final formula It can be shown that the sample correlation coefficient is always in the interval 32114 as it should be Because SXY SX and SY are consistent for the corresponding population pa rameter RXY is a consistent estimator of the population correlation rXY However RXY is a biased estimator for two reasons First SX and SY are biased estimators of sX and sY respectively Second RXY is a ratio of estimators so it would not be unbiased even if SX and SY were For our purposes this is not important although the fact that no unbiased estimator of rXY exists is a classical result in mathematical statistics C4b Maximum Likelihood Another general approach to estimation is the method of maximum likelihood a topic covered in many introductory statistics courses A brief summary in the simplest case will suffice here Let 5Y1 Y2 p Yn6 be a random sample from the population distribution f 1y u2 Because of the random Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 686 sampling assumption the joint distribution of 5Y1 Y2 p Yn6 is simply the product of the densities f 1y1 u2f 1y2 u2 c f 1yn u2 In the discrete case this is P1Y1 5 y1 Y2 5 y2 p Yn 5 yn2 Now de fine the likelihood function as L1u Y1 p Yn2 5 f 1Y1 u2f 1Y2 u2 c f 1Yn u2 which is a random variable because it depends on the outcome of the random sample 5Y1 Y2 p Yn6 The maximum likelihood estimator of u call it W is the value of u that maximizes the likelihood function This is why we write L as a function of u followed by the random sample Clearly this value depends on the random sample The maximum likelihood principle says that out of all the pos sible values for u the value that makes the likelihood of the observed data largest should be chosen Intuitively this is a reasonable approach to estimating u Usually it is more convenient to work with the loglikelihood function which is obtained by tak ing the natural log of the likelihood function log3L1u Y1 p Yn2 4 5 a n i51 log3 f 1Yi u2 4 C16 where we use the fact that the log of the product is the sum of the logs Because C16 is the sum of independent identically distributed random variables analyzing estimators that come from C16 is relatively easy Maximum likelihood estimation MLE is usually consistent and sometimes unbiased But so are many other estimators The widespread appeal of MLE is that it is generally the most asymp totically efficient estimator when the population model f 1y u2 is correctly specified In addition the MLE is sometimes the minimum variance unbiased estimator that is it has the smallest variance among all unbiased estimators of u See Larsen and Marx 1986 Chapter 5 for verification of these claims In Chapter 17 we will need maximum likelihood to estimate the parameters of more advanced econometric models In econometrics we are almost always interested in the distribution of Y con ditional on a set of explanatory variables say X1 X2 p Xk Then we replace the density in C16 with f 1Yi0Xi1 p Xik u1 p up2 where this density is allowed to depend on p parameters u1 p up Fortunately for successful application of maximum likelihood methods we do not need to delve much into the computational issues or the largesample statistical theory Wooldridge 2010 Chapter 13 covers the theory of MLE C4c Least Squares A third kind of estimator and one that plays a major role throughout the text is called a least squares estimator We have already seen an example of least squares the sample mean Y is a least squares estimator of the population mean m We already know Y is a method of moments estimator What makes it a least squares estimator It can be shown that the value of m that makes the sum of squared deviations a n i51 1Yi 2 m2 2 as small as possible is m 5 Y Showing this is not difficult but we omit the algebra For some important distributions including the normal and the Bernoulli the sample average Y is also the maximum likelihood estimator of the population mean m Thus the principles of least squares method of moments and maximum likelihood often result in the same estimator In other cases the estimators are similar but not identical Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 687 C5 Interval Estimation and Confidence Intervals C5a The Nature of Interval Estimation A point estimate obtained from a particular sample does not by itself provide enough information for testing economic theories or for informing policy discussions A point estimate may be the re searchers best guess at the population value but by its nature it provides no information about how close the estimate is likely to be to the population parameter As an example suppose a researcher reports on the basis of a random sample of workers that job training grants increase hourly wage by 64 How are we to know whether or not this is close to the effect in the population of workers who could have been trained Because we do not know the population value we cannot know how close an estimate is for a particular sample However we can make statements involving probabilities and this is where interval estimation comes in We already know one way of assessing the uncertainty in an estimator find its sampling standard deviation Reporting the standard deviation of the estimator along with the point estimate provides some information on the accuracy of our estimate However even if the problem of the standard de viations dependence on unknown population parameters is ignored reporting the standard deviation along with the point estimate makes no direct statement about where the population value is likely to lie in relation to the estimate This limitation is overcome by constructing a confidence interval We illustrate the concept of a confidence interval with an example Suppose the population has a Normal1m 12 distribution and let 5Y1 p Yn6 be a random sample from this population We assume that the variance of the population is known and equal to unity for the sake of illustration we then show what to do in the more realistic case that the variance is unknown The sample average Y has a normal distribution with mean m and variance 1n Y Normal1m 1n2 From this we can standard ize Y and because the standardized version of Y has a standard normal distribution we have Pa2196 Y 2 m 1n 196b 5 95 The event in parentheses is identical to the event Y 2 196n m Y 1 196n so P1Y 2 196n m Y 1 196n2 5 95 C17 Equation C17 is interesting because it tells us that the probability that the random interval 3Y 2 196n Y 1 196n4 contains the population mean m is 95 or 95 This information al lows us to construct an interval estimate of m which is obtained by plugging in the sample outcome of the average y Thus 3y 2 196n y 1 196n4 C18 is an example of an interval estimate of m It is also called a 95 confidence interval A shorthand notation for this interval is y 6 196n The confidence interval in equation C18 is easy to compute once the sample data 5y1 y2 p yn6 are observed y is the only factor that depends on the data For example suppose that n 5 16 and the average of the 16 data points is 73 Then the 95 confidence interval for m is 73 6 19616 5 73 6 49 which we can write in interval form as 681779 By construction y 5 73 is in the center of this interval Unlike its computation the meaning of a confidence interval is more difficult to understand When we say that equation C18 is a 95 confidence interval for m we mean that the random interval 3Y 2 196n Y 1 196n4 C19 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 688 contains m with probability 95 In other words before the random sample is drawn there is a 95 chance that C19 contains m Equation C19 is an example of an interval estimator It is a random interval since the endpoints change with different samples A confidence interval is often interpreted as follows The probability that m is in the interval C18 is 95 This is incorrect Once the sample has been observed and y has been computed the limits of the confidence interval are simply numbers 681 and 779 in the example just given The population parameter m though unknown is also just some number Therefore m either is or is not in the interval C18 and we will never know with certainty which is the case Probability plays no role once the confidence interval is computed for the particular data at hand The probabilistic inter pretation comes from the fact that for 95 of all random samples the constructed confidence interval will contain m To emphasize the meaning of a confidence interval Table C2 contains calculations for 20 ran dom samples or replications from the Normal21 distribution with sample size n 5 10 For each of the 20 samples y is obtained and C18 is computed as y 6 19610 5 y 6 62 each rounded to two decimals As you can see the interval changes with each random sample Nineteen of the twenty intervals contain the population value of m Only for replication number 19 is m not in the confidence interval In other words 95 of the samples result in a confidence interval that contains m This did not have to be the case with only 20 replications but it worked out that way for this particular simulation TAble C2 Simulated Confidence Intervals from a Normal1m 12 Distribution with m 5 2 Replication y 95 Interval Contains m 1 198 136260 Yes 2 143 081205 Yes 3 165 103227 Yes 4 188 126250 Yes 5 234 172296 Yes 6 258 196320 Yes 7 158 96220 Yes 8 223 161285 Yes 9 196 134258 Yes 10 211 149273 Yes 11 215 153277 Yes 12 193 131255 Yes 13 202 140264 Yes 14 210 148272 Yes 15 218 156280 Yes 16 210 148272 Yes 17 194 132256 Yes 18 221 159283 Yes 19 116 54178 No 20 175 113237 Yes Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 689 C5b Confidence Intervals for the Mean from a Normally Distributed Population The confidence interval derived in equation C18 helps illustrate how to construct and interpret con fidence intervals In practice equation C18 is not very useful for the mean of a normal population because it assumes that the variance is known to be unity It is easy to extend C18 to the case where the standard deviation s is known to be any value the 95 confidence interval is 3y 2 196sn y 1 196sn4 C20 Therefore provided s is known a confidence interval for m is readily constructed To allow for unknown s we must use an estimate Let s 5 a 1 n 2 1 a n i51 1yi 2 y2 2b 12 C21 denote the sample standard deviation Then we obtain a confidence interval that depends entirely on the observed data by replacing s in equation C20 with its estimate s Unfortunately this does not preserve the 95 level of confidence because s depends on the particular sample In other words the random interval 3Y 6 1961Sn2 4 no longer contains m with probability 95 because the constant s has been replaced with the random variable S How should we proceed Rather than using the standard normal distribution we must rely on the t distribution The t distribution arises from the fact that Y 2 m Sn tn21 C22 where Y is the sample average and S is the sample standard deviation of the random sample 5Y1 p Yn6 We will not prove C22 a careful proof can be found in a variety of places for example Larsen and Marx 1986 Chapter 7 To construct a 95 confidence interval let c denote the 975th percentile in the tn21 distri bution In other words c is the value such that 95 of the area in the tn21 is between 2c and c P12c tn21 c2 5 95 The value of c depends on the degrees of freedom n 2 1 but we do not 0 2c area 025 area 025 c area 95 Figure C4 The 975th percentile c in a t distribution Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 690 make this explicit The choice of c is illustrated in Figure C4 Once c has been properly chosen the random interval 3Y 2 c Sn Y 1 c Sn4 contains m with probability 95 For a particular sample the 95 confidence interval is calculated as 3y 2 c sn y 1 c sn4 C23 The values of c for various degrees of freedom can be obtained from Table G2 in Appendix G For example if n 5 20 so that the df is n 2 1 5 19 then c 5 2093 Thus the 95 confidence interval is 3y 6 20931s202 4 where y and s are the values obtained from the sample Even if s 5 s which is very unlikely the confidence interval in C23 is wider than that in C20 because c 196 For small degrees of freedom C23 is much wider More generally let ca denote the 10011 2 a2 percentile in the tn21 distribution Then a 10011 2 a2 confidence interval is obtained as 3y 2 Ca2Sn y 1 Ca2Sn4 C24 Obtaining ca2 requires choosing a and knowing the degrees of freedom n 2 1 then Table G2 can be used For the most part we will concentrate on 95 confidence intervals There is a simple way to remember how to construct a confidence interval for the mean of a nor mal distribution Recall that sd1Y2 5 sn Thus sn is the point estimate of sd1Y2 The associ ated random variable Sn is sometimes called the standard error of Y Because what shows up in formulas is the point estimate sn we define the standard error of y as se1y2 5 sn Then C24 can be written in shorthand as 3y 6 ca2 se1y2 4 C25 This equation shows why the notion of the standard error of an estimate plays an important role in econometrics ExamplE C2 Effect of Job Training Grants on Worker productivity Holzer Block Cheatham and Knott 1993 studied the effects of job training grants on worker pro ductivity by collecting information on scrap rates for a sample of Michigan manufacturing firms receiving job training grants in 1988 Table C3 lists the scrap ratesmeasured as number of items per 100 produced that are not usable and therefore need to be scrappedfor 20 firms Each of these firms received a job training grant in 1988 there were no grants awarded in 1987 We are interested in constructing a confidence interval for the change in the scrap rate from 1987 to 1988 for the popula tion of all manufacturing firms that could have received grants We assume that the change in scrap rates has a normal distribution Since n 5 20 a 95 confi dence interval for the mean change in scrap rates m is 3y 6 2093 se1y2 4 where se1y2 5 sn The value 2093 is the 975th percentile in a t19 distribution For the particular sample values y 5 2115 and se1y2 5 54 each rounded to two decimals so the 95 confidence interval is 32228 2024 The value zero is excluded from this interval so we conclude that with 95 confidence the average change in scrap rates in the population is not zero Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 691 At this point Example C2 is mostly illustrative because it has some potentially serious flaws as an econometric analysis Most importantly it assumes that any systematic reduction in scrap rates is due to the job training grants But many things can happen over the course of the year to change worker productivity From this analysis we have no way of knowing whether the fall in average scrap rates is attributable to the job training grants or if at least partly some external force is responsible C5c A Simple Rule of Thumb for a 95 Confidence Interval The confidence interval in C25 can be computed for any sample size and any confidence level As we saw in Section B5 the t distribution approaches the standard normal distribution as the degrees of freedom gets large In particular for a 5 05 ca2 S 196 as n S although c2 is always greater than 196 for each n A rule of thumb for an approximate 95 confidence interval is 3y 6 2 se1y2 4 C26 In other words we obtain y and its standard error and then compute y plus or minus twice its standard error to obtain the confidence interval This is slightly too wide for very large n and it is too narrow for small n As we can see from Example C2 even for n as small as 20 C26 is in the ball park for a 95 confidence interval for the mean from a normal distribution This means we can get pretty close to a 95 confidence interval without having to refer to t tables TAble C3 Scrap Rates for 20 Michigan Manufacturing Firms Firm 1987 1988 Change 1 10 3 27 2 1 1 0 3 6 5 21 4 45 5 05 5 125 154 29 6 13 15 2 7 106 8 226 8 3 2 21 9 818 67 2751 10 167 117 25 11 98 51 247 12 1 5 25 13 45 61 16 14 503 67 167 15 8 4 24 16 9 7 22 17 18 19 1 18 28 2 208 19 7 5 22 20 397 383 214 Average 438 323 2115 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 692 C5d Asymptotic Confidence Intervals for Nonnormal Populations In some applications the population is clearly nonnormal A leading case is the Bernoulli distribution where the random variable takes on only the values zero and one In other cases the nonnormal popu lation has no standard distribution This does not matter provided the sample size is sufficiently large for the central limit theorem to give a good approximation for the distribution of the sample average Y For large n an approximate 95 confidence interval is 3y 6 196 se1y2 4 C27 where the value 196 is the 975th percentile in the standard normal distribution Mechanically com puting an approximate confidence interval does not differ from the normal case A slight difference is that the number multiplying the standard error comes from the standard normal distribution rather than the t distribution because we are using asymptotics Because the t distribution approaches the standard normal as the df increases equation C25 is also perfectly legitimate as an approximate 95 interval some prefer this to C27 because the former is exact for normal populations ExamplE C3 Race Discrimination in Hiring The Urban Institute conducted a study in 1988 in Washington DC to examine the extent of race discrimination in hiring Five pairs of people interviewed for several jobs In each pair one person was black and the other person was white They were given résumés indicating that they were virtu ally the same in terms of experience education and other factors that determine job qualification The idea was to make individuals as similar as possible with the exception of race Each person in a pair interviewed for the same job and the researchers recorded which applicant received a job offer This is an example of a matched pairs analysis where each trial consists of data on two people or two firms two cities and so on that are thought to be similar in many respects but different in one important characteristic Let uB denote the probability that the black person is offered a job and let uW be the probability that the white person is offered a job We are primarily interested in the difference uB 2 uW Let Bi denote a Bernoulli variable equal to one if the black person gets a job offer from employer i and zero otherwise Similarly Wi 5 1 if the white person gets a job offer from employer i and zero otherwise Pooling across the five pairs of people there were a total of n 5 241 trials pairs of interviews with employers Unbiased estimators of uB and uW are B and W the fractions of interviews for which blacks and whites were offered jobs respectively To put this into the framework of computing a confidence interval for a population mean define a new variable Yi 5 Bi 2 Wi Now Yi can take on three values 21 if the black person did not get the job but the white person did 0 if both people either did or did not get the job and 1 if the black person got the job and the white person did not Then m E1Yi2 5 E1Bi2 2 E1Wi2 5 uB 2 uW The distribution of Yi is certainly not normalit is discrete and takes on only three values Nevertheless an approximate confidence interval for uB 2 uW can be obtained by using large sample methods The data from the Urban Institute audit study are in the file AUDIT Using the 241 observed data points b 5 224 and w 5 357 so y 5 224 2 357 5 2133 Thus 224 of black applicants were offered jobs while 357 of white applicants were offered jobs This is prima facie evidence of discrimination against blacks but we can learn much more by computing a confidence interval for m To compute an approximate 95 confidence interval we need the sample standard deviation This turns out to be s 5 482 using equation C21 Using C27 we obtain a 95 CI for m 5 uB 2 uW as 2133 6 19614822412 5 2133 6 031 5 32164 21024 The approximate 99 CI is 2133 6 25814822412 5 32213 20534 Naturally this contains a wider range of values than the 95 CI But even the 99 CI does not contain the value zero Thus we are very confident that the population difference uB 2 uW is not zero Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 693 Before we turn to hypothesis testing it is useful to review the various population and sample quantities that measure the spreads in the population distributions and the sampling distributions of the estimators These quantities appear often in statistical analysis and extensions of them are impor tant for the regression analysis in the main text The quantity s is the unknown population standard deviation it is a measure of the spread in the distribution of Y When we divide s by n we obtain the sampling standard deviation of Y the sample average While s is a fixed feature of the popula tion sd1Y2 5 sn shrinks to zero as n S our estimator of m gets more and more precise as the sample size grows The estimate of s for a particular sample s is called the sample standard deviation because it is obtained from the sample We also call the underlying random variable S which changes across different samples the sample standard deviation Like y as an estimate of m s is our best guess at s given the sample at hand The quantity sn is what we call the standard error of y and it is our best estimate of sn Confidence intervals for the population parameter m depend directly on se1y2 5 sn Because this standard error shrinks to zero as the sample size grows a larger sample size generally means a smaller confidence interval Thus we see clearly that one benefit of more data is that they result in narrower confidence intervals The notion of the standard error of an estimate which in the vast majority of cases shrinks to zero at the rate 1n plays a fundamental role in hy pothesis testing as we will see in the next section and for confidence intervals and testing in the context of multiple regression as discussed in Chapter 4 C6 Hypothesis Testing So far we have reviewed how to evaluate point estimators and we have seenin the case of a popu lation meanhow to construct and interpret confidence intervals But sometimes the question we are interested in has a definite yes or no answer Here are some examples 1 Does a job training program effectively increase average worker productivity see Example C2 2 Are blacks discriminated against in hiring see Example C3 3 Do stiffer state drunk driving laws reduce the number of drunk driving arrests Devising methods for answering such questions using a sample of data is known as hypothesis testing C6a Fundamentals of Hypothesis Testing To illustrate the issues involved with hypothesis testing consider an election example Suppose there are two candidates in an election Candidates A and B Candidate A is reported to have received 42 of the popular vote while Candidate B received 58 These are supposed to represent the true per centages in the voting population and we treat them as such Candidate A is convinced that more people must have voted for him so he would like to in vestigate whether the election was rigged Knowing something about statistics Candidate A hires a consulting agency to randomly sample 100 voters to record whether or not each person voted for him Suppose that for the sample collected 53 people voted for Candidate A This sample estimate of 53 clearly exceeds the reported population value of 42 Should Candidate A conclude that the election was indeed a fraud While it appears that the votes for Candidate A were undercounted we cannot be certain Even if only 42 of the population voted for Candidate A it is possible that in a sample of 100 we observe 53 people who did vote for Candidate A The question is How strong is the sample evidence against the officially reported percentage of 42 One way to proceed is to set up a hypothesis test Let u denote the true proportion of the popula tion voting for Candidate A The hypothesis that the reported results are accurate can be stated as H0 u 5 42 C28 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 694 This is an example of a null hypothesis We always denote the null hypothesis by H0 In hypothesis testing the null hypothesis plays a role similar to that of a defendant on trial in many judicial systems just as a defendant is presumed to be innocent until proven guilty the null hypothesis is presumed to be true until the data strongly suggest otherwise In the current example Candidate A must present fairly strong evidence against C28 in order to win a recount The alternative hypothesis in the election example is that the true proportion voting for Candi date A in the election is greater than 42 H1 u 42 C29 In order to conclude that H0 is false and that H1 is true we must have evidence beyond reason able doubt against H0 How many votes out of 100 would be needed before we feel the evidence is strongly against H0 Most would agree that observing 43 votes out of a sample of 100 is not enough to overturn the original election results such an outcome is well within the expected sampling varia tion On the other hand we do not need to observe 100 votes for Candidate A to cast doubt on H0 Whether 53 out of 100 is enough to reject H0 is much less clear The answer depends on how we quantify beyond reasonable doubt Before we turn to the issue of quantifying uncertainty in hypothesis testing we should head off some possible confusion You may have noticed that the hypotheses in equations C28 and C29 do not exhaust all possibilities it could be that u is less than 42 For the application at hand we are not particularly interested in that possibility it has nothing to do with overturning the results of the election Therefore we can just state at the outset that we are ignoring alternatives u with u 42 Nevertheless some authors prefer to state null and alternative hypotheses so that they are exhaustive in which case our null hypothesis should be H0 u 42 Stated in this way the null hypothesis is a composite null hypothesis because it allows for more than one value under H0 By contrast equa tion C28 is an example of a simple null hypothesis For these kinds of examples it does not mat ter whether we state the null as in C28 or as a composite null the most difficult value to reject if u 42 is u 5 42 That is if we reject the value u 5 42 against u 42 then logically we must reject any value less than 42 Therefore our testing procedure based on C28 leads to the same test as if H0 u 42 In this text we always state a null hypothesis as a simple null hypothesis In hypothesis testing we can make two kinds of mistakes First we can reject the null hypothesis when it is in fact true This is called a Type I error In the election example a Type I error occurs if we reject H0 when the true proportion of people voting for Candidate A is in fact 42 The second kind of error is failing to reject H0 when it is actually false This is called a Type II error In the election example a Type II error occurs if u 42 but we fail to reject H0 After we have made the decision of whether or not to reject the null hypothesis we have either decided correctly or we have committed an error We will never know with certainty whether an error was committed However we can compute the probability of making either a Type I or a Type II error Hypothesis testing rules are constructed to make the probability of committing a Type I error fairly small Generally we define the significance level or simply the level of a test as the probability of a Type I error it is typically denoted by a Symbolically we have a 5 P1Reject H0 0 H02 C30 The righthand side is read as The probability of rejecting H0 given that H0 is true Classical hypothesis testing requires that we initially specify a significance level for a test When we specify a value for a we are essentially quantifying our tolerance for a Type I error Common val ues for a are 10 05 and 01 If a 5 05 then the researcher is willing to falsely reject H0 5 of the time in order to detect deviations from H0 Once we have chosen the significance level we would then like to minimize the probability of a Type II error Alternatively we would like to maximize the power of a test against all relevant alter natives The power of a test is just one minus the probability of a Type II error Mathematically p1u2 5 P1Reject H0 0 u2 5 1 2 P1Type II 0 u2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 695 where u denotes the actual value of the parameter Naturally we would like the power to equal unity whenever the null hypothesis is false But this is impossible to achieve while keeping the significance level small Instead we choose our tests to maximize the power for a given significance level C6b Testing Hypotheses about the Mean in a Normal Population In order to test a null hypothesis against an alternative we need to choose a test statistic or statistic for short and a critical value The choices for the statistic and critical value are based on convenience and on the desire to maximize power given a significance level for the test In this subsection we re view how to test hypotheses for the mean of a normal population A test statistic denoted T is some function of the random sample When we compute the sta tistic for a particular outcome we obtain an outcome of the test statistic which we will denote by t Given a test statistic we can define a rejection rule that determines when H0 is rejected in fa vor of H1 In this text all rejection rules are based on comparing the value of a test statistic t to a critical value c The values of t that result in rejection of the null hypothesis are collectively known as the rejection region To determine the critical value we must first decide on a significance level of the test Then given a the critical value associated with a is determined by the distribution of T assuming that H0 is true We will write this critical value as c suppressing the fact that it depends on a Testing hypotheses about the mean m from a Normal1m s22 population is straightforward The null hypothesis is stated as H0 m 5 m0 C31 where m0 is a value that we specify In the majority of applications m0 5 0 but the general case is no more difficult The rejection rule we choose depends on the nature of the alternative hypothesis The three alter natives of interest are H1 m m0 C32 H1 m m0 C33 and H1 m 2 m0 C34 Equation C32 gives a onesided alternative as does C33 When the alternative hypothesis is C32 the null is effectively H0 m m0 since we reject H0 only when m m0 This is appropriate when we are interested in the value of m only when m is at least as large as m0 Equation C34 is a twosided alternative This is appropriate when we are interested in any departure from the null hypothesis Consider first the alternative in C32 Intuitively we should reject H0 in favor of H1 when the value of the sample average y is sufficiently greater than m0 But how should we determine when y is large enough for H0 to be rejected at the chosen significance level This requires knowing the prob ability of rejecting the null hypothesis when it is true Rather than working directly with y we use its standardized version where s is replaced with the sample standard deviation s t 5 n1y 2 m02s 5 1y 2 m02se1y2 C35 where se1y2 5 sn is the standard error of y Given the sample of data it is easy to obtain t We work with t because under the null hypothesis the random variable T 5 n1Y 2 m02S Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 696 has a tn21 distribution Now suppose we have settled on a 5 significance level Then the critical value c is chosen so that P1T c 0 H02 5 05 that is the probability of a Type I error is 5 Once we have found c the rejection rule is t c C36 where c is the 10011 2 a2 percentile in a tn21 distribution as a percent the significance level is 100 a This is an example of a onetailed test because the rejection region is in one tail of the t dis tribution For a 5 significance level c is the 95th percentile in the tn21 distribution this is illustrated in Figure C5 A different significance level leads to a different critical value The statistic in equation C35 is often called the t statistic for testing H0 m 5 m0 The t statistic measures the distance from y to m0 relative to the standard error of y se1y2 ExamplE C4 Effect of Enterprise Zones on Business Investments In the population of cities granted enterprise zones in a particular state see Papke 1994 for Indiana let Y denote the percentage change in investment from the year before to the year after a city became an enterprise zone Assume that Y has a Normal1m s22 distribution The null hypothesis that enter prise zones have no effect on business investment is H0 m 5 0 the alternative that they have a posi tive effect is H1 m 0 We assume that they do not have a negative effect Suppose that we wish to test H0 at the 5 level The test statistic in this case is t 5 y sn 5 y se1y2 C37 Suppose that we have a sample of 36 cities that are granted enterprise zones Then the critical value is c 5 169 see Table G2 and we reject H0 in favor of H1 if t 169 Suppose that the sample yields y 5 82 and s 5 239 Then t 206 and H0 is therefore rejected at the 5 level Thus we conclude 0 c rejection area 05 area 95 Figure C5 Rejection region for a 5 significance level test against the onesided alternative m m0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 697 that at the 5 significance level enterprise zones have an effect on average investment The 1 criti cal value is 244 so H0 is not rejected at the 1 level The same caveat holds here as in Example C2 we have not controlled for other factors that might affect investment in cities over time so we cannot claim that the effect is causal The rejection rule is similar for the onesided alternative C33 A test with a significance level of 100 a rejects H0 against C33 whenever t 2 c C38 in other words we are looking for negative values of the t statisticwhich implies y m0that are sufficiently far from zero to reject H0 For twosided alternatives we must be careful to choose the critical value so that the significance level of the test is still a If H1 is given by H1 m 2 m0 then we reject H0 if y is far from m0 in abso lute value a y much larger or much smaller than m0 provides evidence against H0 in favor of H1 A 100 a level test is obtained from the rejection rule 0t0 c C39 where 0t0 is the absolute value of the t statistic in C35 This gives a twotailed test We must now be careful in choosing the critical value c is the 10011 2 a22 percentile in the tn21 distribution For ex ample if a 5 05 then the critical value is the 975th percentile in the tn21 distribution This ensures that H0 is rejected only 5 of the time when it is true see Figure C6 For example if n 5 22 then the critical value is c 5 208 the 975th percentile in a t21 distribution see Table G2 The absolute value of the t statistic must exceed 208 in order to reject H0 against H1 at the 5 level It is important to know the proper language of hypothesis testing Sometimes the appropriate phrase we fail to reject H0 in favor of H1 at the 5 significance level is replaced with we accept H0 at the 5 significance level The latter wording is incorrect With the same set of data there are 0 c area 025 area 025 c area 95 rejection region rejection region Figure C6 Rejection region for a 5 significance level test against the twosided alternative H1 m 2 m0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 698 usually many hypotheses that cannot be rejected In the earlier election example it would be logically inconsistent to say that H0 u 5 42 and H0 u 5 43 are both accepted since only one of these can be true But it is entirely possible that neither of these hypotheses is rejected For this reason we al ways say fail to reject H0 rather than accept H0 C6c Asymptotic Tests for Nonnormal Populations If the sample size is large enough to invoke the central limit theorem see Section C3 the mechanics of hypothesis testing for population means are the same whether or not the population distribution is normal The theoretical justification comes from the fact that under the null hypothesis T 5 n1Y 2 m02S a Normal1012 Therefore with large n we can compare the t statistic in C35 with the critical values from a stan dard normal distribution Because the tn21 distribution converges to the standard normal distribution as n gets large the t and standard normal critical values will be very close for extremely large n Be cause asymptotic theory is based on n increasing without bound it cannot tell us whether the standard normal or t critical values are better For moderate values of n say between 30 and 60 it is traditional to use the t distribution because we know this is correct for normal populations For n 120 the choice between the t and standard normal distributions is largely irrelevant because the critical values are practically the same Because the critical values chosen using either the standard normal or t distribution are only approximately valid for nonnormal populations our chosen significance levels are also only approxi mate thus for nonnormal populations our significance levels are really asymptotic significance lev els Thus if we choose a 5 significance level but our population is nonnormal then the actual significance level will be larger or smaller than 5 and we cannot know which is the case When the sample size is large the actual significance level will be very close to 5 Practically speaking the distinction is not important so we will now drop the qualifier asymptotic ExamplE C5 Race Discrimination in Hiring In the Urban Institute study of discrimination in hiring see Example C3 using the data in AUDIT we are primarily interested in testing H0 m 5 0 against H1 m 0 where m 5 uB 2 uW is the differ ence in probabilities that blacks and whites receive job offers Recall that m is the population mean of the variable Y 5 B 2 W where B and W are binary indicators Using the n 5 241 paired compari sons in the data file AUDIT we obtained y 5 2133 and se1y2 5 482241 031 The t statis tic for testing H0 m 5 0 is t 5 2133031 2429 You will remember from Appendix B that the standard normal distribution is for practical purposes indistinguishable from the t distribution with 240 degrees of freedom The value 2429 is so far out in the left tail of the distribution that we reject H0 at any reasonable significance level In fact the 005 onehalf of a percent critical value for the onesided test is about 2258 A t value of 2429 is very strong evidence against H0 in favor of H1 Hence we conclude that there is discrimination in hiring C6d Computing and Using pValues The traditional requirement of choosing a significance level ahead of time means that different re searchers using the same data and same procedure to test the same hypothesis could wind up with different conclusions Reporting the significance level at which we are carrying out the test solves this problem to some degree but it does not completely remove the problem Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 699 To provide more information we can ask the following question What is the largest significance level at which we could carry out the test and still fail to reject the null hypothesis This value is known as the pvalue of a test sometimes called the probvalue Compared with choosing a signifi cance level ahead of time and obtaining a critical value computing a pvalue is somewhat more diffi cult But with the advent of quick and inexpensive computing pvalues are now fairly easy to obtain As an illustration consider the problem of testing H0 m 5 0 in a Normal1m s22 population Our test statistic in this case is T 5 n YS and we assume that n is large enough to treat T as hav ing a standard normal distribution under H0 Suppose that the observed value of T for our sample is t 5 152 Note how we have skipped the step of choosing a significance level Now that we have seen the value t we can find the largest significance level at which we would fail to reject H0 This is the significance level associated with using t as our critical value Because our test statistic T has a standard normal distribution under H0 we have pvalue 5 P1T 152 0 H02 5 1 2 F11522 5 065 C40 where F1 2 denotes the standard normal cdf In other words the pvalue in this example is simply the area to the right of 152 the observed value of the test statistic in a standard normal distribution See Figure C7 for illustration Because the pvalue 5 065 the largest significance level at which we can carry out this test and fail to reject is 65 If we carry out the test at a level below 65 such as at 5 we fail to reject H0 If we carry out the test at a level larger than 65 such as 10 we reject H0 With the pvalue at hand we can carry out the test at any level The pvalue in this example has another useful interpretation it is the probability that we observe a value of T as large as 152 when the null hypothesis is true If the null hypothesis is actually true we would observe a value of T as large as 152 due to chance only 65 of the time Whether this is small enough to reject H0 depends on our tolerance for a Type I error The pvalue has a similar interpreta tion in all other cases as we will see Generally small pvalues are evidence against H0 since they indicate that the outcome of the data occurs with small probability if H0 is true In the previous example if t had been a larger value say t 5 285 then the pvalue would be 1 2 F12852 002 This means that if the null hypoth esis were true we would observe a value of T as large as 285 with probability 002 How do we 0 152 area 065 pvalue Figure C7 The pvalue when t 5 152 for the onesided alternative m m0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 700 interpret this Either we obtained a very unusual sample or the null hypothesis is false Unless we have a very small tolerance for Type I error we would reject the null hypothesis On the other hand a large pvalue is weak evidence against H0 If we had gotten t 5 47 in the previous example then the pvalue 5 1 2 F1472 5 32 Observing a value of T larger than 47 happens with probability 32 even when H0 is true this is large enough so that there is insufficient doubt about H0 unless we have a very high tolerance for Type I error For hypothesis testing about a population mean using the t distribution we need detailed tables in order to compute pvalues Table G2 only allows us to put bounds on pvalues Fortunately many statistics and econometrics packages now compute pvalues routinely and they also provide calcula tion of cdfs for the t and other distributions used for computing pvalues ExamplE C6 Effect of Job Training Grants on Worker productivity Consider again the Holzer et al 1993 data in Example C2 From a policy perspective there are two questions of interest First what is our best estimate of the mean change in scrap rates m We have already obtained this for the sample of 20 firms listed in Table C3 the sample average of the change in scrap rates is 2115 Relative to the initial average scrap rate in 1987 this represents a fall in the scrap rate of about 263 12115438 22632 which is a nontrivial effect We would also like to know whether the sample provides strong evidence for an effect in the population of manufacturing firms that could have received grants The null hypothesis is H0 m 5 0 and we test this against H1 m 0 where m is the average change in scrap rates Under the null the job training grants have no effect on average scrap rates The alternative states that there is an effect We do not care about the alternative m 0 so the null hypothesis is effectively H0 m 0 Since y 5 2115 and se1y2 5 54 t 5 211554 5 2213 This is below the 5 critical value of 2173 from a t19 distribution but above the 1 critical value 2254 The pvalue in this case is computed as pvalue 5 P1T19 22132 C41 0 area pvalue 023 213 Figure C8 The pvalue when t 5 2213 with 19 degrees of freedom for the onesided alternative m 0 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 701 where T19 represents a t distributed random variable with 19 degrees of freedom The inequality is reversed from C40 because the alternative has the form in C33 The probability in C41 is the area to the left of 2213 in a t19 distribution see Figure C8 Using Table G2 the most we can say is that the pvalue is between 025 and 01 but it is closer to 025 since the 975th percentile is about 209 Using a statistical package such as Stata we can compute the exact pvalue It turns out to be about 023 which is reasonable evidence against H0 This is certainly enough evidence to reject the null hypothesis that the training grants had no effect at the 25 significance level and therefore at the 5 level Computing a pvalue for a twosided test is similar but we must account for the twosided nature of the rejection rule For t testing about population means the pvalue is computed as P1 0Tn210 0t0 2 5 2P1Tn21 0t0 2 C42 where t is the value of the test statistic and Tn21 is a t random variable For large n replace Tn21 with a standard normal random variable Thus compute the absolute value of the t statistic find the area to the right of this value in a tn21 distribution and multiply the area by two For nonnormal populations the exact pvalue can be difficult to obtain Nevertheless we can find asymptotic pvalues by using the same calculations These pvalues are valid for large sample sizes For n larger than say 120 we might as well use the standard normal distribution Table G1 is detailed enough to get accurate pvalues but we can also use a statistics or econometrics program ExamplE C7 Race Discrimination in Hiring Using the matched pairs data from the Urban Institute in the AUDIT data file n 5 241 we obtained t 5 2429 If Z is a standard normal random variable P1Z 24292 is for practical purposes zero In other words the asymptotic pvalue for this example is essentially zero This is very strong evi dence against H0 Summary of How to Use pValues i Choose a test statistic T and decide on the nature of the alternative This determines whether the rejection rule is t c t 2c or 0t0 c ii Use the observed value of the t statistic as the critical value and compute the correspond ing significance level of the test This is the pvalue If the rejection rule is of the form t c then pvalue 5 P1T t2 If the rejection rule is t 2c then pvalue 5 P1T t2 if the rejection rule is 0t0 c then pvalue 5 P1 0T0 0t0 2 iii If a significance level a has been chosen then we reject H0 at the 100 a level if pvalue a If pvalue a then we fail to reject H0 at the 100 a level Therefore it is a small pvalue that leads to rejection of the null hypothesis C6e The Relationship between Confidence Intervals and Hypothesis Testing Because constructing confidence intervals and hypothesis tests both involve probability statements it is natural to think that they are somehow linked It turns out that they are After a confidence interval has been constructed we can carry out a variety of hypothesis tests The confidence intervals we have discussed are all twosided by nature In this text we will have no need to construct onesided confidence intervals Thus confidence intervals can be used to Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 702 test against twosided alternatives In the case of a population mean the null is given by C31 and the alternative is C34 Suppose we have constructed a 95 confidence interval for m Then if the hypothesized value of m under H0 m0 is not in the confidence interval then H0 m 5 m0 is rejected against H1 m 2 m0 at the 5 level If m0 lies in this interval then we fail to reject H0 at the 5 level Notice how any value for m0 can be tested once a confidence interval is constructed and since a confi dence interval contains more than one value there are many null hypotheses that will not be rejected ExamplE C8 Training Grants and Worker productivity In the Holzer et al example we constructed a 95 confidence interval for the mean change in scrap rate m as 32228 2024 Since zero is excluded from this interval we reject H0 m 5 0 against H1 m 2 0 at the 5 level This 95 confidence interval also means that we fail to reject H0 m 5 22 at the 5 level In fact there is a continuum of null hypotheses that are not rejected given this confi dence interval C6f Practical versus Statistical Significance In the examples covered so far we have produced three kinds of evidence concerning population parameters point estimates confidence intervals and hypothesis tests These tools for learning about population parameters are equally important There is an understandable tendency for students to focus on confidence intervals and hypothesis tests because these are things to which we can attach confidence or significance levels But in any study we must also interpret the magnitudes of point estimates The sign and magnitude of y determine its practical significance and allow us to discuss the direction of an intervention or policy effect and whether the estimated effect is large or small On the other hand statistical significance of y depends on the magnitude of its t statistic For testing H0 m 5 0 the t statistic is simply t 5 yse1y2 In other words statistical significance depends on the ratio of y to its standard error Consequently a t statistic can be large because y is large or se1y2 is small In applications it is important to discuss both practical and statistical significance being aware that an estimate can be statistically significant without being especially large in a practical sense Whether an estimate is practically important depends on the context as well as on ones judgment so there are no set rules for determining practical significance ExamplE C9 Effect of Freeway Width on Commute Time Let Y denote the change in commute time measured in minutes for commuters in a metropolitan area from before a freeway was widened to after the freeway was widened Assume that Y Normal1ms22 The null hypothesis that the widening did not reduce average commute time is H0 m 5 0 the alterna tive that it reduced average commute time is H1 m 0 Suppose a random sample of commuters of size n 5 900 is obtained to determine the effectiveness of the freeway project The average change in commute time is computed to be y 5 236 and the sample standard deviation is s 5 327 thus se1y2 5 327900 5 109 The t statistic is t 5 236109 2330 which is very statistically sig nificant the pvalue is about 0005 Thus we conclude that the freeway widening had a statistically significant effect on average commute time If the outcome of the hypothesis test is all that were reported from the study it would be mis leading Reporting only statistical significance masks the fact that the estimated reduction in average commute time 36 minutes seems pretty meager although this depends to some extent on what the average commute time was prior to widening the freeway To be up front we should report the point estimate of 236 along with the significance test Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 703 Finding point estimates that are statistically significant without being practically significant can occur when we are working with large samples To discuss why this happens it is useful to have the following definition Test Consistency A consistent test rejects H0 with probability approaching one as the sam ple size grows whenever H1 is true Another way to say that a test is consistent is that as the sample size tends to infinity the power of the test gets closer and closer to unity whenever H1 is true All of the tests we cover in this text have this property In the case of testing hypotheses about a population mean test consistency follows because the variance of Y converges to zero as the sample size gets large The t statistic for testing H0 m 5 0 is T 5 Y1Sn2 Since plim1Y2 5 m and plim1S2 5 s it follows that if say m 0 then T gets larger and larger with high probability as n S In other words no matter how close m is to zero we can be almost certain to reject H0 m 5 0 given a large enough sample size This says nothing about whether m is large in a practical sense C7 Remarks on Notation In our review of probability and statistics here and in Appendix B we have been careful to use stan dard conventions to denote random variables estimators and test statistics For example we have used W to indicate an estimator random variable and w to denote a particular estimate outcome of the random variable W Distinguishing between an estimator and an estimate is important for un derstanding various concepts in estimation and hypothesis testing However making this distinction quickly becomes a burden in econometric analysis because the models are more complicated many random variables and parameters will be involved and being true to the usual conventions from prob ability and statistics requires many extra symbols In the main text we use a simpler convention that is widely used in econometrics If u is a popu lation parameter the notation u theta hat will be used to denote both an estimator and an estimate of u This notation is useful in that it provides a simple way of attaching an estimator to the popula tion parameter it is supposed to be estimating Thus if the population parameter is b then b denotes an estimator or estimate of b if the parameter is s2 s 2 is an estimator or estimate of s2 and so on Sometimes we will discuss two estimators of the same parameter in which case we will need a dif ferent notation such as u theta tilde Although dropping the conventions from probability and statistics to indicate estimators random variables and test statistics puts additional responsibility on you it is not a big deal once the differ ence between an estimator and an estimate is understood If we are discussing statistical properties of usuch as deriving whether or not it is unbiased or consistentthen we are necessarily viewing u as an estimator On the other hand if we write something like u 5 173 then we are clearly denoting a point estimate from a given sample of data The confusion that can arise by using u to denote both should be minimal once you have a good understanding of probability and statistics Summary We have discussed topics from mathematical statistics that are heavily relied upon in econometric analysis The notion of an estimator which is simply a rule for combining data to estimate a popula tion parameter is fundamental We have covered various properties of estimators The most important small sample properties are unbiasedness and efficiency the latter of which depends on comparing variances when estimators are unbiased Large sample properties concern the sequence of estimators Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 704 obtained as the sample size grows and they are also depended upon in econometrics Any useful esti mator is consistent The central limit theorem implies that in large samples the sampling distribution of most estimators is approximately normal The sampling distribution of an estimator can be used to construct confidence intervals We saw this for estimating the mean from a normal distribution and for computing approximate confidence intervals in nonnormal cases Classical hypothesis testing which requires specifying a null hypoth esis an alternative hypothesis and a significance level is carried out by comparing a test statistic to a critical value Alternatively a pvalue can be computed that allows us to carry out a test at any significance level Key Terms Alternative Hypothesis Asymptotic Normality Bias Biased Estimator Central Limit Theorem CLT Confidence Interval Consistent Estimator Consistent Test Critical Value Estimate Estimator Hypothesis Test Inconsistent Interval Estimator Law of Large Numbers LLN Least Squares Estimator Maximum Likelihood Estimator Mean Squared Error MSE Method of Moments Minimum Variance Unbiased Estimator Null Hypothesis OneSided Alternative OneTailed Test Population Power of a Test Practical Significance Probability Limit pValue Random Sample Rejection Region Sample Average Sample Correlation Coefficient Sample Covariance Sample Standard Deviation Sample Variance Sampling Distribution Sampling Standard Deviation Sampling Variance Significance Level Standard Error Statistical Significance t Statistic Test Statistic TwoSided Alternative TwoTailed Test Type I Error Type II Error Unbiased Estimator Problems 1 Let Y1 Y2 Y3 and Y4 be independent identically distributed random variables from a population with mean m and variance s2 Let Y 5 1 41Y1 1 Y2 1 Y3 1 Y42 denote the average of these four random variables i What are the expected value and variance of Y in terms of m and s2 ii Now consider a different estimator of m W 5 1 8Y1 1 1 8Y2 1 1 4Y3 1 1 2Y4 This is an example of a weighted average of the Yi Show that W is also an unbiased estimator of m Find the variance of W iii Based on your answers to parts i and ii which estimator of m do you prefer Y or W 2 This is a more general version of Problem C1 Let Y1 Y2 c Yn be n pairwise uncorrelated random variables with common mean m and common variance s2 Let Y denote the sample average i Define the class of linear estimators of m by Wa 5 a1Y1 1 a2Y2 1 c 1 anYn where the ai are constants What restriction on the ai is needed for Wa to be an unbiased estimator of m ii Find Var1Wa2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 705 iii For any numbers a1 a2 c an the following inequality holds a1 1 a2 1 p 1 an2 2n a2 1 1 a2 2 1 p 1 a2 n Use this along with parts i and ii to show that Var1Wa2 VarY whenever Wa is unbiased so that Y is the best linear unbiased estimator Hint What does the inequality become when the ai satisfy the restriction from part i 3 Let Y denote the sample average from a random sample with mean m and variance s2 Consider two alternative estimators of m W1 5 3 1n 2 12n4Y and W2 5 Y2 i Show that W1 and W2 are both biased estimators of m and find the biases What happens to the biases as n S Comment on any important differences in bias for the two estimators as the sample size gets large ii Find the probability limits of W1 and W2 Hint Use Properties PLIM1 and PLIM2 for W1 note that plim 3 1n 2 12n4 5 16 Which estimator is consistent iii Find Var1W12 and Var1W22 iv Argue that W1 is a better estimator than Y if m is close to zero Consider both bias and variance 4 For positive random variables X and Y suppose the expected value of Y given X is E1Y 0 X2 5 uX The unknown parameter u shows how the expected value of Y changes with X i Define the random variable Z 5 YX Show that E1Z2 5 u Hint Use Property CE2 along with the law of iterated expectations Property CE4 In particular first show that E1Z 0 X2 5 u and then use CE4 ii Use part i to prove that the estimator W1 5 n21g n i511YiXi2 is unbiased for u where 5 1Xi Yi2 i 5 1 2p n6 is a random sample iii Explain why the estimator W2 5 YX where the overbars denote sample averages is not the same as W1 Nevertheless show that W2 is also unbiased for u iv The following table contains data on corn yields for several counties in Iowa The USDA predicts the number of hectares of corn in each county based on satellite photos Researchers count the num ber of pixels of corn in the satellite picture as opposed to for example the number of pixels of soybeans or of uncultivated land and use these to predict the actual number of hectares To develop a prediction equation to be used for counties in general the USDA surveyed farmers in selected counties to obtain corn yields in hectares Let Yi 5 corn yield in county i and let Xi 5 number of corn pixels in the satellite picture for county i There are n 5 17 observations for eight counties Use this sample to compute the estimates of u devised in parts ii and iii Are the estimates similar Plot Corn Yield Corn Pixels 1 16576 374 2 9632 209 3 7608 253 4 18535 432 5 11643 367 6 16208 361 7 15204 288 8 16175 369 9 9288 206 10 14994 316 11 6475 145 12 12707 355 13 13355 295 14 7770 223 15 20639 459 16 10833 290 17 11817 307 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 706 5 Let Y denote a Bernoulliu random variable with 0 u 1 Suppose we are interested in estimat ing the odds ratio g 5 u11 2 u2 which is the probability of success over the probability of failure Given a random sample 5Y1 c Yn6 we know that an unbiased and consistent estimator of u is Y the proportion of successes in n trials A natural estimator of g is G 5 Y11 2 Y2 the proportion of suc cesses over the proportion of failures in the sample i Why is G not an unbiased estimator of g ii Use PLIM2 iii to show that G is a consistent estimator of g 6 You are hired by the governor to study whether a tax on liquor has decreased average liquor consump tion in your state You are able to obtain for a sample of individuals selected at random the difference in liquor consumption in ounces for the years before and after the tax For person i who is sampled randomly from the population Yi denotes the change in liquor consumption Treat these as a random sample from a Normal1m s22 distribution i The null hypothesis is that there was no change in average liquor consumption State this formally in terms of m ii The alternative is that there was a decline in liquor consumption state the alternative in terms of m iii Now suppose your sample size is n 5 900 and you obtain the estimates y 5 2328 and s 5 4664 Calculate the t statistic for testing H0 against H1 obtain the pvalue for the test Because of the large sample size just use the standard normal distribution tabulated in Table G1 Do you reject H0 at the 5 level At the 1 level iv Would you say that the estimated fall in consumption is large in magnitude Comment on the practical versus statistical significance of this estimate v What has been implicitly assumed in your analysis about other determinants of liquor consumption over the twoyear period in order to infer causality from the tax change to liquor consumption 7 The new management at a bakery claims that workers are now more productive than they were under old management which is why wages have generally increased Let Wb i be Worker is wage under the old management and let Wa i be Worker is wage after the change The difference is Di Wa i 2 Wb i Assume that the Di are a random sample from a Normal 1m s22 distribution i Using the following data on 15 workers construct an exact 95 confidence interval for m ii Formally state the null hypothesis that there has been no change in average wages In particular what is E1Di2 under H0 If you are hired to examine the validity of the new managements claim what is the relevant alternative hypothesis in terms of m 5 E1Di2 iii Test the null hypothesis from part ii against the stated alternative at the 5 and 1 levels iv Obtain the pvalue for the test in part iii Worker Wage Before Wage After 1 830 925 2 940 900 3 900 925 4 1050 1000 5 1140 1200 6 875 950 7 1000 1025 8 950 950 9 1080 1150 10 1255 1310 11 1200 1150 12 865 900 13 775 775 14 1125 1150 15 1265 1300 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix C Fundamentals of Mathematical Statistics 707 8 The New York Times 2590 reported threepoint shooting performance for the top 10 threepoint shoot ers in the NBA The following table summarizes these data Player FGAFGM Mark Price 429188 Trent Tucker 833345 Dale Ellis 1149472 Craig Hodges 1016396 Danny Ainge 1051406 Byron Scott 676260 Reggie Miller 416159 Larry Bird 1206455 Jon Sundvold 440166 Brian Taylor 417157 Note FGA 5 field goals attempted and FGM 5 field goals made For a given player the outcome of a particular shot can be modeled as a Bernoulli zeroone variable if Yi is the outcome of shot i then Yi 5 1 if the shot is made and Yi 5 0 if the shot is missed Let u denote the probability of making any particular threepoint shot attempt The natural estimator of u is Y 5 FGMFGA i Estimate u for Mark Price ii Find the standard deviation of the estimator Y in terms of u and the number of shot attempts n iii The asymptotic distribution of 1Y 2 u2se1Y2 is standard normal where se1Y2 5 Y11 2 Y2n Use this fact to test H0 u 5 5 against H1 u 5 for Mark Price Use a 1 significance level 9 Suppose that a military dictator in an unnamed country holds a plebiscite a yesno vote of confidence and claims that he was supported by 65 of the voters A human rights group suspects foul play and hires you to test the validity of the dictators claim You have a budget that allows you to randomly sample 200 voters from the country i Let X be the number of yes votes obtained from a random sample of 200 out of the entire voting population What is the expected value of X if in fact 65 of all voters supported the dictator ii What is the standard deviation of X again assuming that the true fraction voting yes in the plebiscite is 65 iii Now you collect your sample of 200 and you find that 115 people actually voted yes Use the CLT to approximate the probability that you would find 115 or fewer yes votes from a random sample of 200 if in fact 65 of the entire population voted yes iv How would you explain the relevance of the number in part iii to someone who does not have training in statistics 10 Before a strike prematurely ended the 1994 major league baseball season Tony Gwynn of the San Di ego Padres had 165 hits in 419 at bats for a 394 batting average There was discussion about whether Gwynn was a potential 400 hitter that year This issue can be couched in terms of Gwynns probabil ity of getting a hit on a particular at bat call it u Let Yi be the Bernoulliu indicator equal to unity if Gwynn gets a hit during his ith at bat and zero otherwise Then Y1 Y2 c Yn is a random sample from a Bernoulliu distribution where u is the probability of success and n 5 419 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 708 Our best point estimate of u is Gwynns batting average which is just the proportion of successes y 5 394 Using the fact that se1y2 5 y11 2 y2n construct an approximate 95 confidence in terval for u using the standard normal distribution Would you say there is strong evidence against Gwynns being a potential 400 hitter Explain 11 Suppose that between their first and second years in college 400 students are randomly selected and given a university grant to purchase a new computer For student i yi denotes the change in GPA from the first year to the second year If the average change is y 5 132 with standard deviation s 5 127 is the average change in GPAs statistically greater than zero Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 709 Summary of Matrix Algebra T his appendix summarizes the matrix algebra concepts including the algebra of probability needed for the study of multiple linear regression models using matrices in Appendix E None of this material is used in the main text D1 Basic Definitions Definition D1 Matrix A matrix is a rectangular array of numbers More precisely an m 3 n matrix has m rows and n columns The positive integer m is called the row dimension and n is called the column dimension We use uppercase boldface letters to denote matrices We can write an m 3 n matrix generi cally as A 5 3aij4 5 D a11 a12 a13 p a1n a21 a22 a23 p a2n am1 am2 am3 p amn T where aij represents the element in the ith row and the jth column For example a25 stands for the num ber in the second row and the fifth column of A A specific example of a 2 3 3 matrix is A 5 c 2 21 7 24 5 0d D1 where a13 5 7 The shorthand A 5 3aij4 is often used to define matrix operations Definition D2 Square Matrix A square matrix has the same number of rows and col umns The dimension of a square matrix is its number of rows and columns Definition D3 Vectors i A 1 3 m matrix is called a row vector of dimension m and can be written as x 1x1 x2 p xm2 Appendix D Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 710 ii An n 3 1 matrix is called a column vector and can be written as y D y1 y2 yn T Definition D4 Diagonal Matrix A square matrix A is a diagonal matrix when all of its offdiagonal elements are zero that is aij 5 0 for all i 2 j We can always write a diagonal matrix as A 5 D a11 0 0 p 0 0 a22 0 p 0 0 0 0 p ann T Definition D5 Identity and Zero Matrices i The n 3 n identity matrix denoted I or sometimes In to emphasize its dimension is the di agonal matrix with unity one in each diagonal position and zero elsewhere I In D 1 0 0 p 0 0 1 0 p 0 0 0 0 p 1 T ii The m 3 n zero matrix denoted 0 is the m 3 n matrix with zero for all entries This need not be a square matrix D2 Matrix Operations D2a Matrix Addition Two matrices A and B each having dimension m 3 n can be added element by element A 1 B 5 3aij 1 bij4 More precisely A 1 B 5 D a11 1 b11 a12 1 b12 p a1n 1 b1n a21 1 b21 a22 1 b22 p a2n 1 b2n am1 1 bm1 am2 1 bm2 p amn 1 bmn T For example c 2 21 7 24 5 0d 1 c1 0 24 4 2 3d 5 c3 21 3 0 7 3d Matrices of different dimensions cannot be added D2b Scalar Multiplication Given any real number g often called a scalar scalar multiplication is defined as gA 3gaij4 or gA 5 D ga11 ga12 p ga1n ga21 ga22 p ga2n gam1 gam2 p gamn T Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix D Summary of Matrix Algebra 711 For example if g 5 2 and A is the matrix in equation D1 then gA 5 B 4 22 14 28 10 0R D2c Matrix Multiplication To multiply matrix A by matrix B to form the product AB the column dimension of A must equal the row dimension of B Therefore let A be an m 3 n matrix and let B be an n 3 p matrix Then matrix multiplication is defined as AB 5 B a n k51 aikbkjR In other words the 1i j2 th element of the new matrix AB is obtained by multiplying each element in the ith row of A by the corresponding element in the jth column of B and adding these n products to gether A schematic may help make this process more transparent A B AB ith row S Eai1ai2ai3 p ainU E b1j b2j b3j bnj U 5 E a n k51 aikbkjU jth column 1i j2 th element where by the definition of the summation operator in Appendix A a n k51 aikbkj 5 ai1b1j 1 ai2b2j 1 p 1 ainbnj For example B 2 21 0 24 1 0R C 0 1 6 0 21 2 0 1 3 0 0 0 S 5 B 1 0 12 21 21 22 224 1R We can also multiply a matrix and a vector If A is an n 3 m matrix and y is an m 3 1 vector then Ay is an n 3 1 vector If x is a 1 3 n vector then xA is a 1 3 m vector Matrix addition scalar multiplication and matrix multiplication can be combined in various ways and these operations satisfy several rules that are familiar from basic operations on numbers In the following list of properties A B and C are matrices with appropriate dimensions for applying each operation and a and b are real numbers Most of these properties are easy to illustrate from the definitions Properties of Matrix Operations 1 1a 1 b2A 5 aA 1 bA 2 a1A 1 B2 5 aA 1 aB 3 1ab2A 5 a1bA2 4 a1AB2 5 1aA2B 5 A 1 B 5 B 1 A 6 1A 1 B2 1 C 5 A 1 1B 1 C2 7 1AB2C 5 A1BC2 8 A1B 1 C2 5 AB 1 AC 9 1A 1 B2C 5 AC 1 BC 10 IA 5 AI 5 A 11 A 1 0 5 0 1 A 5 A 12 A 2 A 5 0 13 A0 5 0A 5 0 and 14 AB 2 BA even when both products are defined Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 712 The last property deserves further comment If A is n 3 m and B is m 3 p then AB is defined but BA is defined only if n 5 p the row dimension of A equals the column dimension of B If A is m 3 n and B is n 3 m then AB and BA are both defined but they are not usually the same in fact they have different dimensions unless A and B are both square matrices Even when A and B are both square AB 2 BA except under special circumstances D2d Transpose Definition D6 Transpose Let A 5 3aij4 be an m 3 n matrix The transpose of A de noted Ar called A prime is the n 3 m matrix obtained by interchanging the rows and columns of A We can write this as Ar 3aji4 For example A 5 B 2 21 7 24 5 0R Ar 5 C 2 24 21 5 7 0 S Properties of Transpose 1 1Ar2 r 5 A 2 1aA2 r 5 aAr for any scalar a 3 1A 1 B2r 5 Ar 1 Br 4 1AB2r 5 BrAr where A is m 3 n and B is n 3 k 5 xrx 5 g n i51x2 i where x is an n 3 1 vector and 6 If A is an n 3 k matrix with rows given by the 1 3 k vectors a1 a2 p an so that we can write A 5 D a1 a2 an T then Ar 5 1a1r a2r p anr2 Definition D7 Symmetric Matrix A square matrix A is a symmetric matrix if and only if Ar 5 A If X is any n 3 k matrix then XrX is always defined and is a symmetric matrix as can be seen by applying the first and fourth transpose properties see Problem 3 D2e Partitioned Matrix Multiplication Let A be an n 3 k matrix with rows given by the 1 3 k vectors a1 a2 p an and let B be an n 3 m matrix with rows given by 1 3 m vectors b1 b2 p bn A 5 D a1 a2 an T B 5 D b1 b2 bn T Then ArB 5 a n i51 air bi Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix D Summary of Matrix Algebra 713 where for each i air bi is a k 3 m matrix Therefore ArB can be written as the sum of n matrices each of which is k 3 m As a special case we have ArA 5 a n i51 air ai where ari ai is a k 3 k matrix for all i A more general form of partitioned matrix multiplication holds when we have matrices A 1m 3 n2 and B 1n 3 p2 written as A 5 aA11 A12 A21 A22 b B 5 aB11 B12 B21 B22 b where A11 is m1 3 n1 A12 is m1 3 n2 A21 is m2 3 n1 A22 is m2 3 n2 B11 is n1 3 p1 B12 is n1 3 p2 B21 is n2 3 p1 and B22 is n2 3 p2 Naturally m1 1 m2 5 m n1 1 n2 5 n and p1 1 p2 5 p When we form the product AB the expression looks just when the entries are scalars AB 5 aA11B11 1 A12B21 A11B12 1 A12B22 A21B11 1 A22B21 A21B12 1 A22B22 b Note that each of the matrix multiplications that form the partition on the right is well defined because the column and row dimensions are compatible for multiplication D2f Trace The trace of a matrix is a very simple operation defined only for square matrices Definition D8 Trace For any n 3 n matrix A the trace of a matrix A denoted trA is the sum of its diagonal elements Mathematically tr1A2 5 a n i51 aii Properties of Trace 1 tr1In2 5 n 2 tr1Ar2 5 tr1A2 3 tr1A 1 B2 5 tr1A2 1 tr1B2 4 tr1aA2 5 atr1A2 for any scalar a and 5 tr1AB2 5 tr1BA2 where A is m 3 n and B is n 3 m D2g Inverse The notion of a matrix inverse is very important for square matrices Definition D9 Inverse An n 3 n matrix A has an inverse denoted A21 provided that A21A 5 In and AA21 5 In In this case A is said to be invertible or nonsingular Otherwise it is said to be noninvertible or singular Properties of Inverse 1 If an inverse exists it is unique 2 1aA2 21 5 11a2A21 if a 2 0 and A is invertible 3 1AB2 21 5 B21A21 if A and B are both n 3 n and invertible and 4 1Ar2 21 5 1A212r We will not be concerned with the mechanics of calculating the inverse of a matrix Any matrix alge bra text contains detailed examples of such calculations Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 714 D3 Linear Independence and Rank of a Matrix For a set of vectors having the same dimension it is important to know whether one vector can be expressed as a linear combination of the remaining vectors Definition D10 Linear Independence Let 5x1 x2 p xr6 be a set of n 3 1 vectors These are linearly independent vectors if and only if a1x1 1 a2x2 1 p 1 arxr 5 0 D2 implies that a1 5 a2 5 p 5 ar 5 0 If D2 holds for a set of scalars that are not all zero then 5x1 x2 p xr6 is linearly dependent The statement that 5x1 x2 p xr6 is linearly dependent is equivalent to saying that at least one vector in this set can be written as a linear combination of the others Definition D11 Rank i Let A be an n 3 m matrix The rank of a matrix A denoted rankA is the maximum num ber of linearly independent columns of A ii If A is n 3 m and rank1A2 5 m then A has full column rank If A is n 3 m its rank can be at most m A matrix has full column rank if its columns form a lin early independent set For example the 3 3 2 matrix C 1 3 2 6 0 0 S can have at most rank two In fact its rank is only one because the second column is three times the first column Properties of Rank 1 rank1Ar2 5 rank1A2 2 If A is n 3 k then rank1A2 min1n k2 and 3 If A is k 3 k and rank1A2 5 k then A is invertible D4 Quadratic Forms and Positive Definite Matrices Definition D12 Quadratic Form Let A be an n 3 n symmetric matrix The quadratic form associated with the matrix A is the realvalued function defined for all n 3 1 vectors x f 1x2 5 xrAx 5 a n i51 aii x2 i 1 2 a n i51 a n ji aij xi xj Definition D13 Positive Definite and Positive SemiDefinite i A symmetric matrix A is said to be positive definite pd if xrAx 0 for all n 3 1 vectors x except x 5 0 ii A symmetric matrix A is positive semidefinite psd if xrAx 0 for all n 3 1 vectors If a matrix is positive definite or positive semidefinite it is automatically assumed to be symmetric Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix D Summary of Matrix Algebra 715 Properties of Positive Definite and Positive SemiDefinite Matrices 1 A pd matrix has diagonal elements that are strictly positive while a psd matrix has nonnegative diagonal elements 2 If A is pd then A21 exists and is pd 3 If X is n 3 k then XrX and XXr are psd and 4 If X is n 3 k and rank1X2 5 k then XrX is pd and therefore nonsingular D5 Idempotent Matrices Definition D14 Idempotent Matrix Let A be an n 3 n symmetric matrix Then A is said to be an idempotent matrix if and only if AA 5 A For example C 1 0 0 0 0 0 0 0 1 S is an idempotent matrix as direct multiplication verifies Properties of Idempotent Matrices Let A be an n 3 n idempotent matrix 1 rank1A2 5 tr1A2 and 2 A is positive semidefinite We can construct idempotent matrices very generally Let X be an n 3 k matrix with rank1X2 5 k Define P X1XrX2 21Xr M In 2 X1XrX2 21Xr 5 In 2 P Then P and M are symmetric idempotent matrices with rank1P2 5 k and rank1M2 5 n 2 k The ranks are most easily obtained by using Property 1 tr1P2 5 tr3 1XrX2 21XrX4 from Property 5 for trace 5 tr1Ik2 5 k by Property 1 for trace It easily follows that tr1M2 5 tr1In2 2 tr1P2 5 n 2 k D6 Differentiation of Linear and Quadratic Forms For a given n 3 1 vector a consider the linear function defined by f 1x2 5 arx for all n 3 1 vectors x The derivative of f with respect to x is the 1 3 n vector of partial derivatives which is simply f 1x2x 5 ar For an n 3 n symmetric matrix A define the quadratic form g1x2 5 xrAx Then g1x2x 5 2xrA which is a 1 3 n vector Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 716 D7 Moments and Distributions of Random Vectors In order to derive the expected value and variance of the OLS estimators using matrices we need to define the expected value and variance of a random vector As its name suggests a random vector is simply a vector of random variables We also need to define the multivariate normal distribution These concepts are simply extensions of those covered in Appendix B D7a Expected Value Definition D15 Expected Value i If y is an n 3 1 random vector the expected value of y denoted Ey is the vector of ex pected values E1y2 5 3E1y12 E1y22 p E1yn2 4r ii If Z is an n 3 m random matrix EZ is the n 3 m matrix of expected values E1Z2 5 3E1zij2 4 Properties of Expected Value 1 If A is an m 3 n matrix and b is an n 3 1 vector where both are nonrandom then E1Ay 1 b2 5 AE1y2 1 b and 2 If A is p 3 n and B is m 3 k where both are nonrandom then E1AZB2 5 AE1Z2B D7b VarianceCovariance Matrix Definition D16 VarianceCovariance Matrix If y is an n 3 1 random vector its variancecovariance matrix denoted Vary is defined as Var1y2 5 D s2 1 s12 p s1n s21 s2 2 p s2n sn1 sn2 p s2 n T where s2 j 5 Var1yj2 and sij 5 Cov1yi yj2 In other words the variancecovariance matrix has the variances of each element of y down its diagonal with covariance terms in the off diagonals Because Cov1yi yj2 5 Cov1yj yi2 it immediately follows that a variancecovariance matrix is symmetric P r o p e r t i e s o f Va r i a n c e 1 If a is an n 3 1 nonrandom vector then Var1ary2 5 ar3Var1y2a 0 2 If Var1ary2 0 for all a 2 0 Vary is positive definite 3 Var1y2 5 E3 1y 2 m2 1y 2 m2r4 where m 5 E1y2 4 If the elements of y are uncorrelated Vary is a diagonal matrix If in addition Var1yj2 5 s2 for j 5 1 2 p n then Var1y2 5 s2In and 5 If A is an m 3 n nonrandom matrix and b is an n 3 1 nonrandom vector then Var1Ay 1 b2 5 A3Var1y2 4Ar D7c Multivariate Normal Distribution The normal distribution for a random variable was discussed at some length in Appendix B We need to extend the normal distribution to random vectors We will not provide an expression for the probability distribution function as we do not need it It is important to know that a multivariate normal random vector is completely characterized by its mean and its variancecovariance matrix Therefore if y is an n 3 1 multivariate normal random vector with mean m and variancecovariance matrix S we write y Normal1m S2 We now state several useful properties of the multivariate normal distribution Properties of the Multivariate Normal Distribution 1 If y Normal1mS2 then each element of y is normally distributed 2 If y Normal1mS2 then yi and yj any two elements of y are independent if and only if they are uncorrelated that is sij 5 0 3 If y Normal1mS2 then Ay 1 b Normal1Am 1 bASAr2 where A and b are nonrandom 4 If y Normal10S2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix D Summary of Matrix Algebra 717 then for nonrandom matrices A and B Ay and By are independent if and only if ASBr 5 0 In particular if S 5 s2In then ABr 5 0 is necessary and sufficient for independence of Ay and By 5 If y Normal10s2In2 A is a k 3 n nonrandom matrix and B is an n 3 n symmetric idempo tent matrix then Ay and yrBy are independent if and only if AB 5 0 and 6 If y Normal10s2In2 and A and B are nonrandom symmetric idempotent matrices then yrAy and yrBy are independent if and only if AB 5 0 D7d ChiSquare Distribution In Appendix B we defined a chisquare random variable as the sum of squared independent standard normal random variables In vector notation if u Normal10 In2 then uru x2 n Properties of the ChiSquare Distribution 1 If u Normal10In2 and A is an n 3 n symmetric idempotent matrix with rank1A2 5 q then urAu x2 q 2 If u Normal10In2 and A and B are n 3 n symmetric idempotent matrices such that AB 5 0 then urAu and urBu are independent chisquare random variables and 3 If z Normal10C2 where C is an m 3 m nonsingular matrix then zrC21z x2 m D7e t Distribution We also defined the t distribution in Appendix B Now we add an important property Property of the t Distribution If u Normal10In2 c is an n 3 1 nonrandom vec tor A is a nonrandom n 3 n symmetric idempotent matrix with rank q and Ac 5 0 then 5cru1crc2 1261urAuq2 12 tq D7f F Distribution Recall that an F random variable is obtained by taking two independent chisquare random variables and finding the ratio of each standardized by degrees of freedom Property of the F Distribution If u Normal10In2 and A and B are n 3 n non random symmetric idempotent matrices with rank1A2 5 k1 rank1B2 5 k2 and AB 5 0 then 1urAuk121urBuk22 Fk1 k2 Summary This appendix contains a condensed form of the background information needed to study the classical linear model using matrices Although the material here is selfcontained it is primarily intended as a review for readers who are familiar with matrix algebra and multivariate statistics and it will be used extensively in Appendix E Key Terms ChiSquare Random Variable Column Vector Diagonal Matrix Expected Value F Random Variable Idempotent Matrix Identity Matrix Inverse Linearly Independent Vectors Matrix Matrix Multiplication Multivariate Normal Distribution Positive Definite pd Positive SemiDefinite psd Quadratic Form Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 718 Appendices Random Vector Rank of a Matrix Row Vector Scalar Multiplication Square Matrix Symmetric Matrix t Distribution Trace of a Matrix Transpose VarianceCovariance Matrix Zero Matrix Problems 1 i Find the product AB using A 5 c 2 2 1 7 24 5 0d B 5 0 1 6 1 8 0 3 0 0 S ii Does BA exist 2 If A and B are n 3 n diagonal matrices show that AB 5 BA 3 Let X be any n 3 k matrix Show that XrX is a symmetric matrix 4 i Use the properties of trace to argue that tr1ArA2 5 tr1AAr2 for any n 3 m matrix A ii For A 5 c2 0 2 1 0 3 0d verify that tr1ArA2 5 tr1AAr2 5 i Use the definition of inverse to prove the following if A and B are n 3 n nonsingular matrices then 1AB2 21 5 B21A21 ii If A B and C are all n 3 n nonsingular matrices find 1ABC2 21 in terms of A21 B21 and C21 6 i Show that if A is an n 3 n symmetric positive definite matrix then A must have strictly positive diagonal elements ii Write down a 2 3 2 symmetric matrix with strictly positive diagonal elements that is not positive definite 7 Let A be an n 3 n symmetric positive definite matrix Show that if P is any n 3 n nonsingular matrix then PrAP is positive definite 8 Prove Property 5 of variances for vectors using Property 3 9 Let a be an n 3 1 nonrandom vector and let u be an n 3 1 random vector with E1uur2 5 In Show that E3tr1auu9a92 4 5 g n i51a2 i 10 Take as given the properties of the chisquare distribution listed in the text Show how those properties along with the definition of an F random variable imply the stated property of the F distribution con cerning ratios of quadratic forms 11 Let X be an n 3 k matrix partitioned as X 5 1X1 X22 where X1 is n 3 k1 and X2 is n 3 k2 i Show that XrX 5 aXr1X1 Xr1X2 Xr2X1 Xr2X2 b Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix D Summary of Matrix Algebra 719 What are the dimensions of each of the matrices ii Let b be a k 3 1 vector partitioned as b 5 ab1 b2 b where b1 is k1 3 1 and b2 is k2 3 1 Show that 1XrX2b 5 a 1Xr1X12b1 1 1Xr1X22b2 1Xr2X12b1 1 1Xr2X22b2 b Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 720 T his appendix derives various results for ordinary least squares estimation of the multiple linear regression model using matrix notation and matrix algebra see Appendix D for a summary The material presented here is much more advanced than that in the text E1 The Model and Ordinary Least Squares Estimation Throughout this appendix we use the t subscript to index observations and an n to denote the sample size It is useful to write the multiple linear regression model with k parameters as follows yt 5 b0 1 b1xt1 1 b2xt2 1 p 1 bkxtk 1 ut t 5 1 2 p n E1 where yt is the dependent variable for observation t and xtj j 5 1 2 p k are the independent vari ables As usual b0 is the intercept and b1 p bk denote the slope parameters For each t define a 1 3 1k 1 12 vector xt 5 11 xt1 p xtk2 and let b 5 1b0 b1 p bk2 r be the 1k 1 12 3 1 vector of all parameters Then we can write E1 as yt 5 xtb 1 ut t 5 1 2 p n E2 Some authors prefer to define xt as a column vector in which case xt is replaced with xtr in E2 Mathematically it makes more sense to define it as a row vector We can write E2 in full matrix notation by appropriately defining data vectors and matrices Let y denote the n 3 1 vector of ob servations on y the tth element of y is yt Let X be the n 3 1k 1 12 vector of observations on the explanatory variables In other words the tth row of X consists of the vector xt Written out in detail X n 3 1k 1 12 D x1 x2 xn T 5 D 1 x11 x12 p x1k 1 x21 x22 p x2k 1 xn1 xn2 p xnk T The Linear Regression Model in Matrix Form Appendix E Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 721 Finally let u be the n 3 1 vector of unobservable errors or disturbances Then we can write E2 for all n observations in matrix notation y 5 Xb 1 u E3 Remember because X is n 3 1k 1 12 and b is 1k 1 12 3 1 Xb is n 3 1 Estimation of b proceeds by minimizing the sum of squared residuals as in Section 32 Define the sum of squared residuals function for any possible 1k 1 12 3 1 parameter vector b as SSR1b2 a n t51 1yt 2 xtb2 2 The 1k 1 12 3 1 vector of ordinary least squares estimates b 5 1b 0 b 1 p b k2 r minimizes SSRb over all possible 1k 1 12 3 1 vectors b This is a problem in multivariable calculus For b to mini mize the sum of squared residuals it must solve the first order condition SSR1b 2b 0 E4 Using the fact that the derivative of 1yt 2 xtb2 2 with respect to b is the 1 3 1k 1 12 vector 221yt 2 xtb2xt E4 is equivalent to a n t51 xtr1yt 2 xtb 2 0 E5 We have divided by 2 and taken the transpose We can write this first order condition as a n t51 1yt 2 b 0 2 b 1xt1 2 p 2 b kxtk2 5 0 a n t51 xt11yt 2 b 0 2 b 1xt1 2 p 2 b kxtk2 5 0 a n t51 xtk1yt 2 b 0 2 b 1xt1 2 p 2 b kxtk2 5 0 which is identical to the first order conditions in equation 313 We want to write these in matrix form to make them easier to manipulate Using the formula for partitioned multiplication in Appen dix D we see that E5 is equivalent to Xr1y 2 Xb 2 5 0 E6 or 1XrX2b 5 Xry E7 It can be shown that E7 always has at least one solution Multiple solutions do not help us as we are looking for a unique set of OLS estimates given our data set Assuming that the 1k 1 12 3 1k 1 12 symmetric matrix XrX is nonsingular we can premultiply both sides of E7 by 1XrX2 21 to solve for the OLS estimator b b 5 1XrX2 21Xry E8 This is the critical formula for matrix analysis of the multiple linear regression model The assump tion that XrX is invertible is equivalent to the assumption that rank1X2 5 1k 1 12 which means that the columns of X must be linearly independent This is the matrix version of MLR3 in Chapter 3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 722 Before we continue E8 warrants a word of warning It is tempting to simplify the formula for b as follows b 5 1XrX2 21Xry 5 X211Xr2 21Xry 5 X21y The flaw in this reasoning is that X is usually not a square matrix so it cannot be inverted In other words we cannot write 1XrX2 21 5 X211Xr2 21 unless n 5 1k 1 12 a case that virtually never arises in practice The n 3 1 vectors of OLS fitted values and residuals are given by y 5 Xb u 5 y 2 y 5 y 2 Xb respectively From E6 and the definition of u we can see that the first order condition for b is the same as Xru 5 0 E9 Because the first column of X consists entirely of ones E9 implies that the OLS residuals always sum to zero when an intercept is included in the equation and that the sample covariance between each independent variable and the OLS residuals is zero We discussed both of these properties in Chapter 3 The sum of squared residuals can be written as SSR 5 a n t51 u 2 t 5 u ru 5 1y 2 Xb 2 r1y 2 Xb 2 E10 All of the algebraic properties from Chapter 3 can be derived using matrix algebra For example we can show that the total sum of squares is equal to the explained sum of squares plus the sum of squared residuals see 327 The use of matrices does not provide a simpler proof than summation notation so we do not provide another derivation The matrix approach to multiple regression can be used as the basis for a geometrical interpreta tion of regression This involves mathematical concepts that are even more advanced than those we covered in Appendix D See Goldberger 1991 or Greene 1997 E1a The FrischWaugh Theorem In Section 32 we described a partialling out interpretation of the ordinary least squares estimates We can establish the partialling out interpretation very generally using matrix notation Partition the n 3 1k 1 12 matrix X as X 5 1X10X22 where X1 is n 3 1k1 1 12 and includes the interceptalthough that is not required for the result to holdand X2 is n 3 k2 We still assume that X has rank k 1 1 which means X1 has rank k1 1 1 and X2 has rank k2 Consider the OLS estimates b 1 and b 2 from the long regression y on X1 X2 As we know the multiple regression coefficients on X2 b 2 generally differs from b 2 from the regres sion y on X2 One way to describe the difference is to understand that we can obtain b 2 from a shorter regression but first we must partial out X1 from X2 Consider the following twostep method i Regress each column of X2 on X1 and obtain the matrix of residuals say X 2 We can write X 2 as X 2 5 3In 2 X11Xr 1X12 21Xr 14X2 5 1In 2 P12X2 5 M1X2 where P1 5 X11Xr1X12 21Xr1 and M1 5 In 2 P1 are n 3 n symmetric idempotent matrices ii Regress y on X 2 and call the k2 3 1 vector of coefficient b 2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 723 The FrischWaugh FW theorem states that b 2 5 b 2 Importantly the FW theorem generally says nothing about equality of the estimates from the long regression b 2 and those from the short regression b 2 Usually b 2 2 b 2 However if Xr1X2 5 0 then X 2 5 M1X2 5 X2 in which case b 2 5 b 2 then b 2 5 b 2 follows from FW It is also worth not ing that we obtain b 2 if we also partial X1 out of y In other words let ÿ be the residuals from regress ing y on X1 so that y 5 M1y Then b 2 is obtained from the regression y on X 2 It is important to understand that it is not enough to only partial out X1 from y The important step is partialling out X1 from X2 Problem 6 at the end of this chapter asks you to derive the FW theorem and to investigate some related issues Another useful algebraic result is that when we regress y on X 2 and save the residuals say u these are identical to the OLS residuals from the original long regression y 5 X 2b 2 5 u 5 u 5 y 2 X1b 1 2 X2b 2 where we have used the FW result b 2 5 b 2 We do not obtain the original OLS residuals if we regress y on X 2 but we do obtain b 2 Before the advent of powerful computers the FrischWaugh result was sometimes used as a com putational device Today the result is more of theoretical interest and it is very helpful in under standing the mechanics of OLS For example recall that in Chapter 10 we used the FW theorem to establish that adding a time trend to a multiple regression is algebraically equivalent to first linearly detrending all of the explanatory variables before running the regression The FW theorem also can be used in Chapter 14 to establish that the fixed effects estimator which we introduced as being obtained from OLS on timedemeaned data can also be obtained from the long dummy variable regression E2 Finite Sample Properties of OLS Deriving the expected value and variance of the OLS estimator b is facilitated by matrix algebra but we must show some care in stating the assumptions Assumption E1 Linear in Parameters The model can be written as in E3 where y is an observed n 3 1 vector X is an n 3 1k 1 12 observed matrix and u is an n 3 1 vector of unobserved errors or disturbances Assumption E2 No Perfect Collinearity The matrix X has rank 1k 1 12 This is a careful statement of the assumption that rules out linear dependencies among the explanatory variables Under Assumption E2 XrX is nonsingular so b is unique and can be written as in E8 Assumption E3 Zero Conditional Mean Conditional on the entire matrix X each error ut has zero mean E1ut0X2 5 0 t 5 1 2 p n Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 724 In vector form Assumption E3 can be written as E1u0X2 5 0 E11 This assumption is implied by MLR4 under the random sampling assumption MLR2 In time series applications Assumption E3 imposes strict exogeneity on the explanatory variables something dis cussed at length in Chapter 10 This rules out explanatory variables whose future values are correlated with ut in particular it eliminates lagged dependent variables Under Assumption E3 we can condi tion on the xtj when we compute the expected value of b UNbiasedNess of oLs Under Assumptions E1 E2 and E3 the OLS estimator b is unbiased for b PROOF Use Assumptions E1 and E2 and simple algebra to write b 5 1XrX2 21Xry 5 1XrX2 21Xr1Xb 1 u2 5 1XrX2 211XrX2b 1 1XrX2 21Xru 5 b 1 1XrX2 21Xru E12 where we use the fact that 1XrX2 211XrX2 5 Ik11 Taking the expectation conditional on X gives E1b 0X2 5 b 1 1XrX2 21XrE1u0X2 5 b 1 1XrX2 21Xr0 5 b because E1u0X2 5 0 under Assumption E3 This argument clearly does not depend on the value of b so we have shown that b is unbiased Theorem e1 To obtain the simplest form of the variancecovariance matrix of b we impose the assumptions of homoskedasticity and no serial correlation Assumption E4 Homoskedasticity and No serial Correlation i Var1ut0X2 5 s2 t 5 1 2 p n ii Cov1utus0X2 5 0 for all t 2 s In matrix form we can write these two assumptions as Var1u0X2 5 s2Inr E13 where In is the n n identity matrix Part i of Assumption E4 is the homoskedasticity assumption the variance of ut cannot depend on any element of X and the variance must be constant across observations t Part ii is the no serial correlation assumption the errors cannot be correlated across observations Under random sampling and in any other crosssectional sampling schemes with independent observations part ii of As sumption E4 automatically holds For time series applications part ii rules out correlation in the errors over time both conditional on X and unconditionally Because of E13 we often say that u has a scalar variancecovariance matrix when Assump tion E4 holds We can now derive the variancecovariance matrix of the OLS estimator Theorem e2 VariaNCeCoVariaNCe Matrix of tHe oLs estiMator Under Assumptions E1 through E4 Var1b 0X2 5 s21XrX2 21 E14 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 725 Formula E14 means that the variance of b j conditional on X is obtained by multiplying s2 by the jth diagonal element of 1XrX2 21 For the slope coefficients we gave an interpretable formula in equation 351 Equation E14 also tells us how to obtain the covariance between any two OLS estimates multiply s2 by the appropriate offdiagonal element of 1XrX2 21 In Chapter 4 we showed how to avoid explicitly finding covariances for obtaining confidence intervals and hypothesis tests by appropriately rewriting the model The GaussMarkov Theorem in its full generality can be proven PROOF From the last formula in equation E12 we have Var1b 0X2 5 Var3 1XrX2 21Xru0X4 5 1XrX2 21Xr3Var1u0X2 4X1XrX2 21 Now we use Assumption E4 to get Var1b 0X2 5 1XrX2 21Xr1s2In2X1XrX2 21 5 s21XrX2 21XrX1XrX2 21 5 s21XrX2 21 Theorem e3 GaUssMarkoV tHeoreM Under Assumptions E1 through E4 b is the best linear unbiased estimator PROOF Any other linear estimator of b can be written as b 5 Ary E15 where A is an n 3 1k 1 12 matrix In order for b to be unbiased conditional on X A can consist of nonrandom numbers and functions of X For example A cannot be a function of y To see what fur ther restrictions on A are needed write b 5 Ar1Xb 1 u2 5 1ArX2b 1 Aru E16 Then E1b0X2 5 ArXb 1 E1Aru0X2 5 ArXb 1 ArE1u0X2 because A is a function of X 5 ArXb because E1u0X2 5 0 For b to be an unbiased estimator of b it must be true that E1b0X2 5 b for all 1k 1 12 3 1 vectors b that is ArXb 5 b for all 1k 1 12 3 1 vectors b E17 Because ArX is a 1k 1 12 3 1k 1 12 matrix E17 holds if and only if ArX 5 lk11 Equations E15 and E17 characterize the class of linear unbiased estimators for b Next from E16 we have Var1b0X2 5 Ar3Var1u0X2 4A 5 s2ArA by Assumption E4 Therefore Var1b0X2 2 Var1b0X2 5 s23ArA 2 1XrX2 214 5 s23ArA 2 ArX1XrX2 21XrA4 because ArX 5 Ik11 5 s2Ar3In 2 X1XrX2 21Xr4A s2ArMA Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 726 The unbiased estimator of the error variance s2 can be written as s 2 5 u ru1n 2 k 2 12 which is the same as equation 356 where M In 2 X1XrX2 21Xr Because M is symmetric and idempotent ArMA is positive semidefinite for any n 3 1k 1 12 matrix A This establishes that the OLS estimator b is BLUE Why is this important Let c be any 1k 1 12 3 1 vector and consider the linear combination crb 5 c0b0 1 c1b1 1 p 1 ckbk which is a scalar The unbiased estimators of crb are crb and crb But Var1crb0X2 2 Var1crb 0X2 5 cr3Var1b0X2 2 Var1b 0X2 4c 0 because 3Var1b0X2 2 Var1b 0X2 4 is psd Therefore when it is used for estimating any linear combi nation of b OLS yields the smallest variance In particular Var1b j0X2 Var1b j X2 for any other linear unbiased estimator of bj UNbiasedNess of s 2 Under Assumptions E1 through E4 s 2 is unbiased for s2 E1s 20X2 5 s2 for all s2 0 PROOF Write u 5 y 2 Xb 5 y 2 X1XrX2 21Xry 5 My 5 Mu where M 5 In 2 X1XrX2 21Xr and the last equality follows because MX 5 0 Because M is symmetric and idempotent u ru 5 urMrMu 5 urMu Because urMu is a scalar it equals its trace Therefore E1urMu0X2 5 E3tr1urMu2 0X4 5 E3tr1Muur2 0X4 5 tr3E1Muur0X2 4 5 tr3ME1uur0X2 4 5 tr1Ms2In2 5 s2tr1M2 5 s21n 2 k 2 12 The last equality follows from tr1M2 5 tr1In2 2 tr3X1XrX2 21Xr4 5 n 2 tr3 1XrX2 21XrX4 5 n 2 tr1Ik112 5 n 2 1k 1 12 5 n 2 k 2 1 Therefore E1s 20X2 5 E1urMu0X21n 2 k 2 12 5 s2 Theorem e4 E3 Statistical Inference When we add the final classical linear model assumption b has a multivariate normal distribution which leads to the t and F distributions for the standard test statistics covered in Chapter 4 Assumption E5 Normality of errors Conditional on X the ut are independent and identically distributed as Normal10s22 Equivalently u given X is distributed as multivariate normal with mean zero and variancecovariance matrix s2In u Normal10s2In2 Under Assumption E5 each ut is independent of the explanatory variables for all t In a time series setting this is essentially the strict exogeneity assumption Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 727 Theorem E5 is the basis for statistical inference involving b In fact along with the properties of the chisquare t and F distributions that we summarized in Appendix D we can use Theorem E5 to establish that t statistics have a t distribution under Assumptions E1 through E5 under the null hypothesis and likewise for F statistics We illustrate with a proof for the t statistics NorMaLity of b Under the classical linear model Assumptions E1 through E5 b conditional on X is distributed as multivariate normal with mean b and variancecovariance matrix s21XrX2 21 Theorem e5 distribUtioN of t statistiC Under Assumptions E1 through E5 1b j 2 bj2se1b j2 tn2k21 j 5 0 1 p k PROOF The proof requires several steps the following statements are initially conditional on X First by Theorem E5 1b j 2 bj2sd1b j2 Normal1012 where sd1b j2 5 sCj j and cj j is the jth diagonal ele ment of 1XrX2 21 Next under Assumptions E1 through E5 conditional on X 1n 2 k 2 12s 2s2 x2 n2k21 E18 This follows because 1n 2 k 2 12s 2s2 5 1us2 rM1us2 where M is the n 3 n symmetric idem potent matrix defined in Theorem E4 But us Normal10In2 by Assumption E5 It follows from Property 1 for the chisquare distribution in Appendix D that 1us2 rM1us2 x2 n2k21 because M has rank n 2 k 2 1 We also need to show that b and s 2 are independent But b 5 b 1 1XrX2 21Xru and s 2 5 urMu1n 2 k 2 12 Now 3 1XrX2 21Xr4M 5 0 because XrM 5 0 It follows from Property 5 of the multivariate normal distribution in Appendix D that b and Mu are independent Because s 2 is a func tion of Mu b and s 2 are also independent 1b j 2 bj2se1b j2 5 3 1b j 2 bj2sd1b j2 41s 2s22 12 which is the ratio of a standard normal random variable and the square root of a x2 n2k211n 2 k 2 12 random variable We just showed that these are independent so by definition of a t random variable 1b j 2 bj2se1b j2 has the tn2k21 distribution Because this distribution does not depend on X it is the unconditional distribution of 1b j 2 bj2se1b j2 as well Theorem e6 From this theorem we can plug in any hypothesized value for bj and use the t statistic for testing hypotheses as usual Under Assumptions E1 through E5 we can compute what is known as the CramerRao lower bound for the variancecovariance matrix of unbiased estimators of b again conditional on X see Greene 1997 Chapter 4 This can be shown to be s21XrX2 21 which is exactly the variance covariance matrix of the OLS estimator This implies that b is the minimum variance unbiased estimator of b conditional on X Var1b0X2 2 Var1b 0X2 is positive semidefinite for any other unbiased estimator b we no longer have to restrict our attention to estimators linear in y It is easy to show that the OLS estimator is in fact the maximum likelihood estimator of b un der Assumption E5 For each t the distribution of yt given X is Normal1xt bs22 Because the yt are Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 728 independent conditional on X the likelihood function for the sample is obtained from the product of the densities P n t5112ps22 212exp321yt 2 xtb2 212s22 4 where P denotes product Maximizing this function with respect to b and s2 is the same as maximiz ing its natural logarithm a n t51 321122log12ps22 2 1yt 2 xtb2 212s22 4 For obtaining b this is the same as minimizing g n t511yt 2 xtb2 2the division by 2s2 does not affect the optimizationwhich is just the problem that OLS solves The estimator of s2 that we have used SSR1n 2 k2 turns out not to be the MLE of s2 the MLE is SSRn which is a biased estimator Be cause the unbiased estimator of s2 results in t and F statistics with exact t and F distributions under the null it is always used instead of the MLE That the OLS estimator is the MLE under Assumption E5 implies an interesting robustness property of the MLE based on the normal distribution The reasoning is simple We know that the OLS estimator is unbiased under Assumptions E1 to E3 normality of the errors is used nowhere in the proof and neither is Assumption E4 As the next section shows the OLS estimator is also consis tent without normality provided the law of large numbers holds as is widely true These statistical properties of the OLS estimator imply that the MLE based on the normal loglikelihood function is robust to distributional specification the distribution can be almost anything and yet we still obtain a consistent and under E1 to E3 unbiased estimator As discussed in Section 173 a maximum likelihood estimator obtained without assuming the distribution is correct is often called a quasi maximum likelihood estimator QMLE Generally consistency of the MLE relies on having a correct distribution in order to con clude that it is consistent for the parameters We have just seen that the normal distribution is a no table exception There are some other distributions that share this property including the Poisson distributionas discussed in Section 173 Wooldridge 2010 Chapter 18 discusses some other useful examples E4 Some Asymptotic Analysis The matrix approach to the multiple regression model can also make derivations of asymptotic prop erties more concise In fact we can give general proofs of the claims in Chapter 11 We begin by proving the consistency result of Theorem 111 Recall that these assumptions con tain as a special case the assumptions for crosssectional analysis under random sampling Proof of Theorem 111 As in Problem E1 and using Assumption TS1r we write the OLS estimator as b 5 a a n t51 xtr xtb 21 a a n t51 xtr ytb 5 a a n t51 xtr xtb 21 a a n t51 xtr1xtb 1 ut2 b 5 b 1 a a n t51 xtrxtb 21 a a n t51 xtrutb E19 5 b 1 an21 a n t51 xtrxtb 21 an21 a n t51 xtrutb Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 729 Now by the law of large numbers n21 a n t51 xtrxt S p A and n21 a n t51 xtrut S p 0 E20 where A 5 E1xtr xt2 is a 1k 1 12 3 1k 1 12 nonsingular matrix under Assumption TS2r and we have used the fact that E1xtrut2 5 0 under Assumption TS3r Now we must use a matrix version of Property PLIM1 in Appendix C Namely because A is nonsingular an21 a n t51 xrt xtb 21 S p A21 E21 Wooldridge 2010 Chapter 3 contains a discussion of these kinds of convergence results It now follows from E19 E20 and E21 that plim1b 2 5 b 1 A21 0 5 b This completes the proof Next we sketch a proof of the asymptotic normality result in Theorem 112 Proof of Theorem 112 From equation E19 we can write n1b 2 b2 5 an21 a n t51 xtrxtb 21 an212 a n t51 xtrutb 5 A21an212 a n t51 xtrutb 1 op112 E22 where the term op1 is a remainder term that converges in probability to zero This term is equal to 3 1n21g n t51xtrxt2 21 2 A2141n212g n t51xtrut2 The term in brackets converges in probability to zero by the same argument used in the proof of Theorem 111 while 1n212g n t51xtrut2 is bounded in prob ability because it converges to a multivariate normal distribution by the central limit theorem A well known result in asymptotic theory is that the product of such terms converges in probability to zero Further n1b 2 b2 inherits its asymptotic distribution from A211n212g n t51xrt ut2 See Wooldridge 2010 Chapter 3 for more details on the convergence results used in this proof By the central limit theorem n212g n t51xtrut has an asymptotic normal distribution with mean zero and say 1k 1 12 3 1k 1 12 variancecovariance matrix B Then n1b 2 b2 has an asymp totic multivariate normal distribution with mean zero and variancecovariance matrix A21BA21 We now show that under Assumptions TS4r and TS5r B 5 s2A The general expression is useful be cause it underlies heteroskedasticityrobust and serial correlationrobust standard errors for OLS of the kind discussed in Chapter 12 First under Assumption TS5r xtrut and xsrus are uncorrelated for t 2 s Why Suppose s t for concreteness Then by the law of iterated expectations E1xtrutusxs2 5 E3E1utus0xtrxs2xtrxs4 5 E3E1utus0xtrxs2xtrxs4 5 E30 xtrxs4 5 0 The zero covariances imply that the variance of the sum is the sum of the variances But Var1xtrut2 5 E1xtrututxt2 5 E1u2 y xtrxt2 By the law of iterated expectations E1u2 t xtrxt2 5E3E1u2 txtrxt0xt2 5E3E1u2 t 0xt2xtrxt4 5 E3s2xtrxt4 5 s2E1xtrxt2 5 s2A where we use E1u2 t 0xt2 5 s2 under Assumptions TS3r and TS4r This shows that B 5 s2A and so under Assumptions TS1r to TS5r we have n1b 2 b2 a Normal10s2A212 E23 This completes the proof From equation E23 we treat b as if it is approximately normally distributed with mean b and variancecovariance matrix s2A21n The division by the sample size n is expected here the ap proximation to the variancecovariance matrix of b shrinks to zero at the rate 1n When we replace Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 730 s2 with its consistent estimator s 2 5 SSR1n 2 k 2 12 and replace A with its consistent estimator n21g n t51xtrxt 5 XrXn we obtain an estimator for the asymptotic variance of b Anar1b 2 5 s 21XrX2 21 E24 Notice how the two divisions by n cancel and the righthand side of E24 is just the usual way we estimate the variance matrix of the OLS estimator under the GaussMarkov assumptions To sum marize we have shown that under Assumptions TS1r to TS5rwhich contain MLR1 to MLR5 as special casesthe usual standard errors and t statistics are asymptotically valid It is perfectly legiti mate to use the usual t distribution to obtain critical values and pvalues for testing a single hypoth esis Interestingly in the general setup of Chapter 11 assuming normality of the errorssay ut given xt ut21 xt21 p u1 x1 is distributed as Normal10 s22does not necessarily help as the t statistics would not generally have exact t statistics under this kind of normality assumption When we do not assume strict exogeneity of the explanatory variables exact distributional results are difficult if not impossible to obtain If we modify the argument above we can derive a heteroskedasticityrobust variancecovariance matrix The key is that we must estimate E1u2 txtrxt2 separately because this matrix no longer equals s2E1xtrxt2 But if the u t are the OLS residuals a consistent estimator is 1n 2 k 2 12 21 a n t51 u 2 txtrxt E25 where the division by n 2 k 2 1 rather than n is a degrees of freedom adjustment that typically helps the finite sample properties of the estimator When we use the expression in equation E25 we obtain Anar1b 2 5 3n1n 2 k 2 12 41XrX2 21a a n t51 u 2 txtrxtb 1XrX2 21 E26 The square roots of the diagonal elements of this matrix are the same heteroskedasticityrobust stan dard errors we obtained in Section 82 for the pure crosssectional case A matrix extension of the serial correlation and heteroskedasticity robust standard errors we obtained in Section 125 is also available but the matrix that must replace E25 is complicated because of the serial correlation See for example Hamilton 1994 Section 105 E4 Wald Statistics for Testing Multiple Hypotheses Similar arguments can be used to obtain the asymptotic distribution of the Wald statistic for testing multiple hypotheses Let R be a q 3 1k 1 12 matrix with q 1k 1 12 Assume that the q restric tions on the 1k 1 12 3 1 vector of parameters b can be expressed as H0 Rb 5 r where r is a q 3 1 vector of known constants Under Assumptions TS1r to TS5r it can be shown that under H0 3 n1Rb 2 r2 4r1s2RA21Rr2 213 n1Rb 2 r2 4 a x2 q E27 where A 5 E1xtrxt2 as in the proofs of Theorems 111 and 112 The intuition behind equa tion E25 is simple Because n1b 2 b2 is roughly distributed as Normal10s2A212 R3 n1b 2 b2 4 5 nR1b 2 b2 i s a p p r o x i m a t e l y Normal10s2RA21Rr2 b y P r o p erty 3 of the multivariate normal distribution in Appendix D Under H0 Rb 5 r so n1Rb 2 r2 a Normal10s2RA21Rr2 under H0 By Property 3 of the chisquare distribution zr1s2RA21Rr2 21z x2 q if z Normal10s2RA21Rr2 To obtain the final result formally we need to use an asymptotic version of this property which can be found in Wooldridge 2010 Chapter 3 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 731 Given the result in E25 we obtain a computable statistic by replacing A and s2 with their consistent estimators doing so does not change the asymptotic distribution The result is the socalled Wald statistic which after canceling the sample sizes and doing a little algebra can be written as W 5 1Rb 2 r2 r3R1XrX2 21Rr4211Rb 2 r2s 2 E28 Under H0 W a x2 q where we recall that q is the number of restrictions being tested If s 2 5 SSR1n 2 k 2 12 it can be shown that Wq is exactly the F statistic we obtained in Chapter 4 for testing multiple linear restrictions See for example Greene 1997 Chapter 7 Therefore under the classical linear model assumptions TS1 to TS6 in Chapter 10 Wq has an exact Fq n2k21 distri bution Under Assumptions TS1r to TS5r we only have the asymptotic result in E26 Neverthe less it is appropriate and common to treat the usual F statistic as having an approximate Fq n2k21 distribution A Wald statistic that is robust to heteroskedasticity of unknown form is obtained by using the matrix in E26 in place of s 21XrX2 21 and similarly for a test statistic robust to both heteroskedas ticity and serial correlation The robust versions of the test statistics cannot be computed via sums of squared residuals or Rsquareds from the restricted and unrestricted regressions Summary This appendix has provided a brief treatment of the linear regression model using matrix notation This material is included for more advanced classes that use matrix algebra but it is not needed to read the text In effect this appendix proves some of the results that we either stated without proof proved only in special cases or proved through a more cumbersome method of proof Other topicssuch as asymptotic properties instrumental variables estimation and panel data modelscan be given concise treatments us ing matrices Advanced texts in econometrics including Davidson and MacKinnon 1993 Greene 1997 Hayashi 2000 and Wooldridge 2010 can be consulted for details Key Terms Problems 1 Let xt be the 1 3 1k 1 12 vector of explanatory variables for observation t Show that the OLS estima tor b can be written as b 5 a a n t51 xtrxtb 21 a a n t51 xtrytb Dividing each summation by n shows that b is a function of sample averages 2 Let b be the 1k 1 12 3 1 vector of OLS estimates i Show that for any 1k 1 12 3 1 vector b we can write the sum of squared residuals as SSR1b2 5 u ru 1 1b 2 b2 rXrX1b 2 b2 Hint Write 1y 2 Xb2 r1y 2 Xb2 5 3u 1 X1b 2 b2 4r3u 1 X1b 2 b2 4 and use the fact that Xru 5 0 First Order Condition FrischWaugh FW theorem Matrix Notation Minimum Variance Unbiased Estimator Scalar VarianceCovariance Matrix VarianceCovariance Matrix of the OLS Estimator Wald Statistic QuasiMaximum Likelihood Estimator QMLE Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 732 Appendices ii Explain how the expression for SSRb in part i proves that b uniquely minimizes SSRb over all possible values of b assuming X has rank k 1 1 3 Let b be the OLS estimate from the regression of y on X Let A be a 1k 1 12 3 1k 1 12 nonsingular matrix and define zt xtA t 5 1 p n Therefore zt is 1 3 1k 1 12 and is a nonsingular linear com bination of xt Let Z be the n 3 1k 1 12 matrix with rows zt Let b denote the OLS estimate from a regression of y on Z i Show that b 5 A21b ii Let yt be the fitted values from the original regression and let yt be the fitted values from regress ing y on Z Show that y t 5 yt for all t 5 1 2 p n How do the residuals from the two regres sions compare iii Show that the estimated variance matrix for b is s 2A211XrX2 21A21r where s 2 is the usual vari ance estimate from regressing y on X iv Let the b j be the OLS estimates from regressing yt on 1 xt1 p xtk and let the b j be the OLS es timates from the regression of yt on 1 a1xt1 p akxtk where ai 2 0 j 5 1 p k Use the results from part i to find the relationship between the b j and the b j v Assuming the setup of part iv use part iii to show that se1b j2 5 se1b j20aj0 vi Assuming the setup of part iv show that the absolute values of the t statistics for b j and b j are identical 4 Assume that the model y 5 Xb 1 u satisfies the GaussMarkov assumptions let G be a 1k 1 12 3 1k 1 12 nonsingular nonrandom matrix and define d 5 Gb so that d is also a 1k 1 12 3 1 vector Let b be the 1k 1 12 3 1 vector of OLS estimators and define d 5 Gb as the OLS estimator of d i Show that E1d 0X2 5 d ii Find Var 1d 0X2 in terms of s2 X and G iii Use Problem E3 to verify that d and the appropriate estimate of Var1d 0X2 are obtained from the regression of y on XG21 iv Now let c be a 1k 1 12 3 1 vector with at least one nonzero entry For concreteness assume that ck 2 0 Define u 5 crb so that u is a scalar Define dj 5 bj j 5 0 1 p k 2 1 and dk 5 u Show how to define a 1k 1 12 3 1k 1 12 nonsingular matrix G so that d 5 Gb Hint Each of the first k rows of G should contain k zeros and a one What is the last row v Show that for the choice of G in part iv G21 5 G 1 0 0 0 0 1 0 0 0 0 1 0 2c0ck 2c1ck 2ck21ck 1ck W Use this expression for G1 and part iii to conclude that u and its standard error are obtained as the coefficient on xtk ck in the regression of yt on 31 2 1c0 ck2xtk4 3xt1 2 1c1 ck2xtk4 p 3xt k21 2 1ck21 ck2xtk4 xtk ck t 5 1 p n This regression is exactly the one obtained by writing bk in terms of u and b0 b1 p bk21 plugging the result into the original model and rearranging Therefore we can formally justify the trick we use throughout the text for obtaining the standard error of a linear combination of parameters Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix E The Linear Regression Model in Matrix Form 733 5 Assume that the model y 5 Xb 1 u satisfies the GaussMarkov assumptions and let b be the OLS estimator of b Let Z 5 G1X2 be an n 3 1k 1 12 matrix function of X and assume that ZrX3a 1k 1 12 3 1k 1 12 matrix4 is nonsingular Define a new estimator of b by b 5 1ZrX2 21Zry i Show that E1b0X2 5 b so that b is also unbiased conditional on X ii Find Var1b0X2 Make sure this is a symmetric 1k 1 12 3 1k 1 12 matrix that depends on Z X and s2 iii Which estimator do you prefer b or b Explain 6 Consider the setup of the FrischWaugh Theorem i Using partitioned matrices show that the first order conditions 1XrX2b 5 Xry can be written as Xr1X1b 1 1 Xr1X2b 2 5 Xr1y Xr2 X1b 1 1 Xr2X2b 2 5 Xr2y ii Multiply the first set of equations by Xr2X11Xr1X12 21 and subtract the result from the second set of equations to show that 1Xr2M1X22b 2 5 Xr2M1y where In 2 X11Xr1X12 21Xr1 Conclude that b 2 5 1X r2X 22 21X r2y iii Use part ii to show that b 2 5 1X r2X 22 21X r2 y iv Use the fact that M1X1 5 0 to show that the residuals u from the regression y on X 2 are identical to the residuals û from the regression y on X1 X2 Hint By definition and the FW theorem u 5 y 2 X 2b 2 5 M11y 2 X2b 22 5 M11y 2 X1b 1 2 X2b 22 Now you do the rest 7 Suppose that the linear model written in matrix notation y 5 Xb 1 u satisfies Assumptions E1 E2 and E3 Partition the model as y 5 X1b1 1 X2b2 1 u where X1 is n 3 1k1 1 12 and X2 is n 3 k2 i Consider the following proposal for estimating b2 First regress y on X1 and obtain the residuals say y Then regress y on X2 to get b 2 Show that b 2 is generally biased and show what the bias is You should find E1b 20X2 in terms of b2 X2 and the residualmaking matrix M1 ii As a special case write y 5 X1b1 1 bkXk 1 u where Xk is an n 3 1 vector on the variable xtk Show that E1b k0X2 5 a SSRk g n t51x2 tk bbk where SSRk is the sum of squared residuals from regressing xtk on 1 xt1 xt2 p xt k21 How come the factor multiplying bk is never greater than one iii Suppose you know b1 Show that the regression y 2 X1b1 on X2 produces an unbiased estimator of b2 conditional on X Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 734 Answers to Chapter Questions Appendix F Chapter 2 Question 21 When student ability motivation age and other factors in u are not related to at tendance equation 26 would hold This seems unlikely to be the case Question 22 About 1105 To see this from the average wages measured in 1976 and 2003 dollars we can get the CPI deflator as 1906590 323 When we multiply 342 by 323 we obtain about 1105 Question 23 5465 as can be seen by plugging shareA 5 60 into equation 228 This is not unreasonable if Candidate A spends 60 of the total money spent he or she is predicted to receive almost 55 of the vote Question 24 The equation will be salaryhun 5 963191 1 18501 roe as is easily seen by multiplying equation 239 by 10 Question 25 Equation 258 can be written as Var1b 02 5 1s2n212 1 a n i51x2 i 21 a n i511xi 2 x2 22 where the term multiplying s2n21 is greater than or equal to one but it is equal to one if and only if x 5 0 In this case the variance is as small as it can possibly be Var1b 02 5 s2n Chapter 3 Question 31 Just a few factors include age and gender distribution size of the police force or more generally resources devoted to crime fighting population and general historical factors These factors certainly might be correlated with prbconv and avgsen which means equation 35 would not hold For example size of the police force is possibly correlated with both prbcon and avgsen as some cities put more effort into crime prevention and law enforcement We should try to bring as many of these factors into the equation as possible Question 32 We use the third property of OLS concerning predicted values and residu als when we plug the average values of all independent variables into the OLS regression line Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix F Answers to Chapter Questions 735 we obtain the average value of the dependent variable So colGPA 5 129 1 453 hsGPA 1 0094 ACT 5 129 1 4531342 1 009412422 306You can check the average of colGPA in GPA1 to verify this to the second decimal place Question 33 No The variable shareA is not an exact linear function of expendA and expendB even though it is an exact nonlinear function shareA 5 100 3expendA1expendA 1 expendB2 4 Therefore it is legitimate to have expendA expendB and shareA as explanatory variables Question 34 As we discussed in Section 34 if we are interested in the effect of x1 on y cor relation among the other explanatory variables x2 x3 and so on does not affect Var1b 12 These variables are included as controls and we do not have to worry about collinearity among the control variables Of course we are controlling for them primarily because we think they are correlated with attendance but this is necessary to perform a ceteris paribus analysis Chapter 4 Question 41 Under these assumptions the GaussMarkov assumptions are satisfied u is in dependent of the explanatory variables so E1u0x1 p xk2 5 E1u2 and Var1u0x1 p xk2 5 Var1u2 Further it is easily seen that E1u2 5 0 Therefore MLR4 and MLR5 hold The classical linear model assumptions are not satisfied because u is not normally distributed which is a violation of MLR6 Question 42 H0 b1 5 0 H1 b1 0 Question 43 Because b 1 5 56 0 and we are testing against H1 b1 0 the onesided p value is onehalf of the twosided pvalue or 043 Question 44 H0 b5 5 b6 5 b7 5 b8 5 0 k 5 8 and q 5 4 The restricted version of the model is score 5 b0 1 b1classize 1 b2expend 1 b3tchcomp 1 b4enroll 1 u Question 45 The F statistic for testing exclusion of ACT is 3 1291 2 1832 11 2 2912 41680 2 32 10313 Therefore the absolute value of the t statistic is about 1016 The t statistic on ACT is negative because b ACT is negative so tACT 5 21016 Question 46 Not by much The F test for joint significance of droprate and gradrate is eas ily computed from the Rsquareds in the table F 5 3 1361 2 353211 2 3612 4140222 252 The 10 critical value is obtained from Table G3a as 230 while the 5 critical value from Table G3b is 3 The pvalue is about 082 Thus droprate and gradrate are jointly significant at the 10 level but not at the 5 level In any case controlling for these variables has a minor effect on the bs coefficient Chapter 5 Question 51 This requires some assumptions It seems reasonable to assume that b2 0 score depends positively on priGPA and CovskippedpriGPA 0 skipped and priGPA are negatively correlated This means that b2d1 0 which means that plim b 1 b1 Because b1 is thought to be negative or at least nonpositive a simple regression is likely to overestimate the importance of skip ping classes Question 52 b j 6 196se1b j2 is the asymptotic 95 confidence interval Or we can replace 196 with 2 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 736 Chapter 6 Question 61 Because fincdol 5 1000 faminc the coefficient on fincdol will be the coefficient on faminc divided by 1000 or 09271000 5 0000927 The standard error also drops by a factor of 1000 so the t statistic does not change nor do any of the other OLS statistics For readability it is better to measure family income in thousands of dollars Question 62 We can do this generally The equation is log1y2 5 b0 1 b1log1x12 1 b2x2 1 p where x2 is a proportion rather than a percentage Then ceteris paribus Dlog1y2 5 b2Dx2 100 Dlog1y2 5 b21100 Dx22 or Dy b21100 Dx22 Now because Dx2 is the change in the pro portion 100 Dx2 is a percentage point change In particular if Dx2 5 01 then 100 Dx2 5 1 which corresponds to a one percentage point change But then b2 is the percentage change in y when 100 Dx2 5 1 Question 63 The new model would be stndfnl 5 b0 1 b1atndrte 1 b2priGPA 1 b3ACT 1 b4priGPA2 1 b5ACT 2 1 b6 priGPA atndrte 1 b7ACT atndrte 1 u Therefore the partial effect of atndrte on stndfnl is b1 1 b6 priGPA 1 b7 ACT This is what we multiply by Datndrte to obtain the ceteris paribus change in stndfnl Question 64 From equation 621 R2 5 1 2 s 23SST1n 2 12 4 For a given sample and a given dependent variable SST1n 2 12 is fixed When we use different sets of explanatory variables only s 2 changes As s 2 decreases R2 increases If we make s and therefore s 2 as small as possible we are making R2 as large as possible Question 65 One possibility is to collect data on annual earnings for a sample of actors along with profitability of the movies in which they each appeared In a simple regression analysis we could relate earnings to profitability But we should probably control for other factors that may affect salary such as age gender and the kinds of movies in which the actors performed Methods for including qualitative factors in regression models are considered in Chapter 7 Chapter 7 Question 71 No because it would not be clear when party is one and when it is zero A better name would be something like Dem which is one for Democratic candidates and zero for Republi cans Or Rep which is one for Republicans and zero for Democrats Question 72 With outfield as the base group we would include the dummy variables frstbase scndbase thrdbase shrtstop and catcher Question 73 The null in this case is H0 d1 5 d2 5 d3 5 d4 5 0 so that there are four restric tions As usual we would use an F test where q 4 and k depends on the number of other explana tory variables Question 74 Because tenure appears as a quadratic we should allow separate quadratics for men and women That is we would add the explanatory variables female tenure and female tenure2 Question 75 We plug pcnv 5 0 avgsen 5 0 tottime 5 0 ptime86 5 0 qemp86 5 4 black 5 1 and hispan 5 0 into equation 731 arr86 5 380 2 038142 1 170 5 398 or almost 4 It is hard to know whether this is reasonable For someone with no prior convictions who was employed throughout the year this estimate might seem high but remember that the population con sists of men who were already arrested at least once prior to 1986 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix F Answers to Chapter Questions 737 Chapter 8 Question 81 This statement is clearly false For example in equation 87 the usual standard error for black is 147 while the heteroskedasticityrobust standard error is 118 Question 82 The F test would be obtained by regressing u 2 on marrmale marrfem and singfem singmale is the base group With n 5 526 and three independent variables in this regres sion the df are 3 and 522 Question 83 Certainly the outcome of the statistical test suggests some cause for concern A t statistic of 296 is very significant and it implies that there is heteroskedasticity in the wealth equation As a practical matter we know that the WLS standard error 063 is substantially below the heteroskedasticityrobust standard error for OLS 104 and so the heteroskedasticity seems to be practically important Plus the nonrobust OLS standard error is 061 which is too optimistic There fore even if we simply adjust the OLS standard error for heteroskedasticity of unknown form there are nontrivial implications Question 84 The 1 critical value in the F distribution with 12 2 df is 461 An F statistic of 1115 is well above the 1 critical value and so we strongly reject the null hypothesis that the trans formed errors ui hi are homoskedastic In fact the pvalue is less than 00002 which is obtained from the F2804 distribution This means that our model for Var1u0x2 is inadequate for fully eliminat ing the heteroskedasticity in u Chapter 9 Question 91 These are binary variables and squaring them has no effect black2 5 black and hispan2 5 hispan Question 92 When educ IQ is in the equation the coefficient on educ say b1 measures the effect of educ on logwage when IQ 5 0 The partial effect of education is b1 1 b9IQ There is no one in the population of interest with an IQ close to zero At the average population IQ which is 100 the estimated return to education from column 3 is 018 1 0003411002 5 052 which is almost what we obtain as the coefficient on educ in column 2 Question 93 No If educp is an integerwhich means someone has no education past the pre vious grade completedthe measurement error is zero If educp is not an integer educ educp so the measurement error is negative At a minimum e1 cannot have zero mean and e1 and educp are probably correlated Question 94 An incumbents decision not to run may be systematically related to how he or she expects to do in the election Therefore we may only have a sample of incumbents who are stronger on average than all possible incumbents who could run This results in a sample selection problem if the population of interest includes all incumbents If we are only interested in the effects of campaign expenditures on election outcomes for incumbents who seek reelection there is no sample selection problem Chapter 10 Question 101 The impact propensity is 48 while the longrun propensity is 48 2 15 1 32 5 65 Question 102 The explanatory variables are xt1 5 zt and xt2 5 zt21 The absence of perfect collinearity means that these cannot be constant and there cannot be an exact linear relationship Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 738 between them in the sample This rules out the possibility that all the z1 p zn take on the same value or that the z0 z1 p zn21 take on the same value But it eliminates other patterns as well For example if zt 5 a 1 bt for constants a and b then zt21 5 a 1 b1t 2 12 5 1a 1 bt2 2 b 5 zt 2 b which is a perfect linear function of zt Question 103 If zt is slowly moving over timeas is the case for the levels or logs of many economic time seriesthen zt and zt21 can be highly correlated For example the correlation between unemt and unemt21 in PHILLIPS is 75 Question 104 No because a linear time trend with a1 0 becomes more and more negative as t gets large Since gfr cannot be negative a linear time trend with a negative trend coefficient cannot represent gfr in all future time periods Question 105 The intercept for March is b0 1 d2 Seasonal dummy variables are strictly exog enous because they follow a deterministic pattern For example the months do not change based upon whether either the explanatory variables or the dependent variables change Chapter 11 Question 111 i No because E1yt2 5 d0 1 d1t depends on t ii Yes because yt 2 E1yt2 5 et is an iid sequence Question 112 We plug inf e t 5 1122inft21 1 1122inft22 into inft 2 inf e t 5 b11unemt 2 m02 1 et and rearrange inft 2 1122 1inft21 1 inft222 5 b0 1 b1unemt 1 et where b0 5 2b1m0 as before Therefore we would regress yt on unemt where yt 5 inft 2 1122 1inft21 1 inft222 Note that we lose the first two observations in constructing yt Question 113 No because ut and ut21 are correlated In particular Cov1utut212 5 E3 1et 1 a1et212 1et21 1 a1et222 4 5 a1E1e2 t212 5 a1s2 e 2 0 if a1 2 0 If the errors are serially correlated the model cannot be dynamically complete Chapter 12 Question 121 We use equation 124 Now only adjacent terms are correlated In particular the covariance between xtut and xt11ut11 is xt xt11Cov1utut112 5 xt xt11as2 e Therefore the formula is Var1b 12 5 SST22 x a a n t51 x2 tVar1ut2 1 2 a n21 t51 xt xt11E1utut112 b 5 s2SSTx 1 12SST2 x2 a n21 t51 as2 e xt xt11 5 s2SSTx 1 as2 e12SST2 x2 a n21 t51 xt xt11 where s2 5 Var1ut2 5 s2 e 1 a2 1s2 e 5 s2 e11 1 a2 12 Unless xt and xt11 are uncorrelated in the sample the second term is nonzero whenever a1 2 0 Notice that if xt and xt11 are positively correlated and a 0 the true variance is actually smaller than the usual variance When the equation is in levels as opposed to being differenced the typical case is a 0 with positive correlation between xt and xt11 Question 122 r 6 196se1r 2 where se1r 2 is the standard error reported in the regression Or we could use the heteroskedasticityrobust standard error Showing that this is asymptotically valid is complicated because the OLS residuals depend on b j but it can be done Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix F Answers to Chapter Questions 739 Question 123 The model we have in mind is ut 5 r1ut21 1 r4ut24 1 et and we want to test H0 r1 5 0 r4 5 0 against the alternative that H0 is false We would run the regression of u t on u t21 and u t24 to obtain the usual F statistic for joint significance of the two lags We are testing two restrictions Question 124 We would probably estimate the equation using first differences as r 5 92 is close enough to 1 to raise questions about the levels regression See Chapter 18 for more discussion Question 125 Because there is only one explanatory variable the White test is easy to com pute Simply regress u 2 t on returnt21 and return2 t21 with an intercept as always and compute the F test for joint significance of returnt21 and return2 t21 If these are jointly significant at a small enough significance level we reject the null of homoskedasticity Chapter 13 Question 131 Yes assuming that we have controlled for all relevant factors The coefficient on black is 1076 and with a standard error of 174 it is not statistically different from 1 The 95 confidence interval is from about 735 to 1417 Question 132 The coefficient on highearn shows that in the absence of any change in the earnings cap high earners spend much more timeon the order of 292 on average because exp12562 2 1 292on workers compensation Question 133 First E1vi12 5 E1ai 1 ui12 5 E1ai2 1 E1vi12 5 0 Similarly E1vi22 5 0 Therefore the covariance between vi1 and vi2 is simply E1vi1vi22 5 E3 1ai 1 ui12 1ai 1 ui22 4 5 E1a2 i 2 1 E1aiui12 1 E1aiui22 1 E1ui1ui22 5 E1a2 i 2 because all of the covariance terms are zero by assumption But E1a2 i 2 5 Var1ai2 because E1ai2 5 0 This causes positive serial correlation across time in the errors within each i which biases the usual OLS standard errors in a pooled OLS regression Question 134 Because Dadmn 5 admn90 2 admn85 is the difference in binary indicators it can be 21 if and only if admn90 5 0 and admn85 5 1 In other words Washington state had an ad ministrative per se law in 1985 but it was repealed by 1990 Question 135 No just as it does not cause bias and inconsistency in a time series regression with strictly exogenous explanatory variables There are two reasons it is a concern First serial cor relation in the errors in any equation generally biases the usual OLS standard errors and test statistics Second it means that pooled OLS is not as efficient as estimators that account for the serial correla tion as in Chapter 12 Chapter 14 Question 141 Whether we use first differencing or the within transformation we will have trouble estimating the coefficient on kidsit For example using the within transformation if kidsit does not vary for family i then kidsit 5 kidsit 2 kidsi 5 0 for t 5 123 As long as some families have variation in kidsit then we can compute the fixed effects estimator but the kids coefficient could be very imprecisely estimated This is a form of multicollinearity in fixed effects estimation or first differencing estimation Question 142 If a firm did not receive a grant in the first year it may or may not receive a grant in the second year But if a firm did receive a grant in the first year it could not get a grant in Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 740 the second year That is if grant21 5 1 then grant 5 0 This induces a negative correlation between grant and grant21 We can verify this by computing a regression of grant on grant21 using the data in JTRAIN for 1989 Using all firms in the sample we get grant 5 248 2 248 grant21 10352 10722 n 5 157 R2 5 070 The coefficient on grant21 must be the negative of the intercept because grant 5 0 when grant21 5 1 Question 143 It suggests that the unobserved effect ai is positively correlated with unionii Re member pooled OLS leaves ai in the error term while fixed effects removes ai By definition ai has a positive effect on logwage By the standard omitted variables analysis see Chapter 3 OLS has an upward bias when the explanatory variable union is positively correlated with the omitted variable 1ai2 Thus belonging to a union appears to be positively related to timeconstant unobserved factors that affect wage Question 144 Not if all sisters within a family have the same mother and father Then because the parents race variables would not change by sister they would be differenced away in equation 1413 Chapter 15 Question 151 Probably not In the simple equation 1518 years of education is part of the error term If some men who were assigned low draft lottery numbers obtained additional schooling then lottery number and education are negatively correlated which violates the first requirement for an instrumental variable in equation 154 Question 152 i For equation 1527 we require that high school peer group effects carry over to college Namely for a given SAT score a student who went to a high school where smoking marijuana was more popular would smoke more marijuana in college Even if the identification con dition equation 1527 holds the link might be weak ii We have to assume that percentage of students using marijuana at a students high school is not correlated with unobserved factors that affect college grade point average Although we are some what controlling for high school quality by including SAT in the equation this might not be enough Perhaps high schools that did a better job of preparing students for college also had fewer students smoking marijuana Or marijuana usage could be correlated with average income levels These are of course empirical questions that we may or may not be able to answer Question 153 Although prevalence of the NRA and subscribers to gun magazines are probably correlated with the presence of gun control legislation it is not obvious that they are uncorrelated with unobserved factors that affect the violent crime rate In fact we might argue that a population interested in guns is a reflection of high crime rates and controlling for economic and demographic variables is not sufficient to capture this It would be hard to argue persuasively that these are truly exogenous in the violent crime equation Question 154 As usual there are two requirements First it should be the case that growth in government spending is systematically related to the party of the president after netting out the investment rate and growth in the labor force In other words the instrument must be partially cor related with the endogenous explanatory variable While we might think that government spend ing grows more slowly under Republican presidents this certainly has not always been true in the United States and would have to be tested using the t statistic on REPt21 in the reduced form Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendix F Answers to Chapter Questions 741 gGovt 5 p0 1 p1REPt21 1 p2INVRATt 1 p3gLABt 1 vt We must assume that the party of the president has no separate effect on gGDP This would be violated if for example monetary policy differs systematically by presidential party and has a separate effect on GDP growth Chapter 16 Question 161 Probably not It is because firms choose price and advertising expenditures jointly that we are not interested in the experiment where say advertising changes exogenously and we want to know the effect on price Instead we would model price and advertising each as a function of demand and cost variables This is what falls out of the economic theory Question 162 We must assume two things First money supply growth should appear in equation 1622 so that it is partially correlated with inf Second we must assume that money sup ply growth does not appear in equation 1623 If we think we must include money supply growth in equation 1623 then we are still short an instrument for inf Of course the assumption that money supply growth is exogenous can also be questioned Question 163 Use the Hausman test from Chapter 15 In particular let v2 be the OLS residuals from the reduced form regression of open on logpcinc and logland Then use an OLS regression of inf on open logpcinc and v2 and compute the t statistic for significance of v2 If v2 is significant the 2SLS and OLS estimates are statistically different Question 164 The demand equation looks like log1fisht2 5 b0 1 b1log1prcfisht2 1 b2log1inct2 1 b3log1prcchickt2 1 b4log1prcbeeft2 1 ut1 where logarithms are used so that all elasticities are constant By assumption the demand func tion contains no seasonality so the equation does not contain monthly dummy variables say febt mart p dect with January as the base month Also by assumption the supply of fish is sea sonal which means that the supply function does depend on at least some of the monthly dummy variables Even without solving the reduced form for logprcfish we conclude that it depends on the monthly dummy variables Since these are exogenous they can be used as instruments for logprcfish in the demand equation Therefore we can estimate the demandforfish equation using monthly dummies as the IVs for logprcfish Identification requires that at least one monthly dummy variable appears with a nonzero coefficient in the reduced form for logprcfish Chapter 17 Question 171 H0 b4 5 b5 5 b6 5 0 so that there are three restrictions and therefore three df in the LR or Wald test Q u e s t i o n 1 7 2 We need the partial derivative of F1b 0 1 b 1nwifeinc 1 b 2educ 1 b 3exper 1 b 4exper2 1 p2 with respect to exper which is f12 1b 3 1 2b 4exper2 where f12 is evaluated at the given values and the initial level of experience There fore we need to evaluate the standard normal probability density at 270 2 012120132 1 13111232 1 1231102 2 001911022 2 05314252 2 868102 1 036112 463 where we plug in the initial level of experience 10 But f14632 5 12p2 212exp3214632224 358 Next we multiply this by b 3 1 2b 4exper which is evaluated at exper 5 10 The partial effect using the calcu lus approximation is 3583123 2 2100192 1102 4 030 In other words at the given values of the explanatory variables and starting at exper 5 10 the next year of experience increases the probability of labor force participation by about 03 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Appendices 742 Question 173 No The number of extramarital affairs is a nonnegative integer which presum ably takes on zero or small numbers for a substantial fraction of the population It is not realistic to use a Tobit model which while allowing a pileup at zero treats y as being continuously distributed over positive values Formally assuming that y 5 max10 yp2 where yp is normally distributed is at odds with the discreteness of the number of extramarital affairs when y 0 Question 174 The adjusted standard errors are the usual Poisson MLE standard errors multiplied by s 5 2 141 so the adjusted standard errors will be about 41 higher The quasiLR statistic is the usual LR statistic divided by s 2 so it will be onehalf of the usual LR statistic Question 175 By assumption mvpi 5 b0 1 xib 1 ui where as usual xib denotes a linear function of the exogenous variables Now observed wage is the largest of the minimum wage and the marginal value product so wagei 5 max1minwageimvpi2 which is very similar to equation 1734 except that the max operator has replaced the min operator Chapter 18 Question 181 We can plug these values directly into equation 181 and take expecta tions First because zs 5 0 for all s 0 y21 5 a 1 u21 Then z0 5 1 so y0 5 a 1 d0 1 u0 For h 1 yh 5 a 1 dh21 1 dh 1 uh Because the errors have zero expected values E1y212 5 a E1y02 5 a 1 d0 and E1yh2 5 a 1 dh21 1 d for all h 1 As h S dh S 0 It follows that E1yh2 S a as h S that is the expected value of yh returns to the expected value before the increase in z at time zero This makes sense although the increase in z lasted for two periods it is still a temporary increase Question 182 Under the described setup Dyt and Dxt are iid sequences that are independent of one another In particular Dyt and Dxt are uncorrelated If g t is the slope coefficient from regressing Dyt on Dxt t 5 1 2 p n then plim g t 5 0 This is as it should be as we are regressing one I0 pro cess on another I0 process and they are uncorrelated We write the equation Dyt 5 g0 1 g1Dxt 1 et where g0 5 g1 5 0 Because 5et6 is independent of 5Dxt6 the strict exogeneity assumption holds Moreover 5et6 is serially uncorrelated and homoskedastic By Theorem 112 in Chapter 11 the t sta tistic for g t has an approximate standard normal distribution If et is normally distributed the classical linear model assumptions hold and the t statistic has an exact t distribution Question 183 Write xt 5 xt21 1 at where 5at6 is I102 By assumption there is a linear combi nation say st 5 yt 2 bxt which is I102 Now yt 2 bxt21 5 yt 2 b1xt 2 at2 5 st 1 bat Because st and at are I102 by assumption so is st 1 bat Question 184 Just use the sum of squared residuals form of the F test and assume homoskedasticity The restricted SSR is obtained by regressing Dhy6t 2 Dhy3t21 1 1hy6t21 2 hy3t222 on a constant Notice that a0 is the only parameter to estimate in Dhy6t 5 a0 1 g0Dhy3t21 1 d1hy6t21 2 hy3t222 when the restrictions are imposed The unrestricted sum of squared residuals is obtained from equation 1839 Question 185 We are fitting two equations yt 5 a 1 b t and yt 5 g 1 dyeart We can obtain the relationship between the parameters by noting that yeart 5 t 1 49 Plugging this into the sec ond equation gives yt 5 g 1 d 1t 1 492 5 1g 1 49d 2 1 dt Matching the slope and intercept with the first equation gives d 5 b so that the slopes on t and yeart are identicaland a 5 g 1 49d Generally when we use year rather than t the intercept will change but the slope will not You can verify this by using one of the time series data sets such as HSEINV or INVEN Whether we use t or some measure of year does not change fitted values and naturally it does not change forecasts of future values The intercept simply adjusts appropriately to different ways of including a trend in the regression Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 743 Table G1 Cumulative Areas under the Standard Normal Distribution z 0 1 2 3 4 5 6 7 8 9 230 00013 00013 00013 00012 00012 00011 00011 00011 00010 00010 229 00019 00018 00018 00017 00016 00016 00015 00015 00014 00014 228 00026 00025 00024 00023 00023 00022 00021 00021 00020 00019 227 00035 00034 00033 00032 00031 00030 00029 00028 00027 00026 226 00047 00045 00044 00043 00041 00040 00039 00038 00037 00036 225 00062 00060 00059 00057 00055 00054 00052 00051 00049 00048 224 00082 00080 00078 00075 00073 00071 00069 00068 00066 00064 223 00107 00104 00102 00099 00096 00094 00091 00089 00087 00084 222 00139 00136 00132 00129 00125 00122 00119 00116 00113 00110 221 00179 00174 00170 00166 00162 00158 00154 00150 00146 00143 220 00228 00222 00217 00212 00207 00202 00197 00192 00188 00183 219 00287 00281 00274 00268 00262 00256 00250 00244 00239 00233 218 00359 00351 00344 00336 00329 00322 00314 00307 00301 00294 217 00446 00436 00427 00418 00409 00401 00392 00384 00375 00367 216 00548 00537 00526 00516 00505 00495 00485 00475 00465 00455 215 00668 00655 00643 00630 00618 00606 00594 00582 00571 00559 214 00808 00793 00778 00764 00749 00735 00721 00708 00694 00681 213 00968 00951 00934 00918 00901 00885 00869 00853 00838 00823 212 01151 01131 01112 01093 01075 01056 01038 01020 01003 00985 211 01357 01335 01314 01292 01271 01251 01230 01210 01190 01170 210 01587 01562 01539 01515 01492 01469 01446 01423 01401 01379 209 01841 01814 01788 01762 01736 01711 01685 01660 01635 01611 208 02119 02090 02061 02033 02005 01977 01949 01922 01894 01867 207 02420 02389 02358 02327 02296 02266 02236 02206 02177 02148 206 02743 02709 02676 02643 02611 02578 02546 02514 02483 02451 205 03085 03050 03015 02981 02946 02912 02877 02843 02810 02776 204 03446 03409 03372 03336 03300 03264 03228 03192 03156 03121 Statistical Tables Appendix G continued Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it appendices 744 z 0 1 2 3 4 5 6 7 8 9 203 03821 03783 03745 03707 03669 03632 03594 03557 03520 03483 202 04207 04168 04129 04090 04052 04013 03974 03936 03897 03859 201 04602 04562 04522 04483 04443 04404 04364 04325 04286 04247 200 05000 04960 04920 04880 04840 04801 04761 04721 04681 04641 00 05000 05040 05080 05120 05160 05199 05239 05279 05319 05359 01 05398 05438 05478 05517 05557 05596 05636 05675 05714 05753 02 05793 05832 05871 05910 05948 05987 06026 06064 06103 06141 03 06179 06217 06255 06293 06331 06368 06406 06443 06480 06517 04 06554 06591 06628 06664 06700 06736 06772 06808 06844 06879 05 06915 06950 06985 07019 07054 07088 07123 07157 07190 07224 06 07257 07291 07324 07357 07389 07422 07454 07486 07517 07549 07 07580 07611 07642 07673 07704 07734 07764 07794 07823 07852 08 07881 07910 07939 07967 07995 08023 08051 08078 08106 08133 09 08159 08186 08212 08238 08264 08289 08315 08340 08365 08389 10 08413 08438 08461 08485 08508 08531 08554 08577 08599 08621 11 08643 08665 08686 08708 08729 08749 08770 08790 08810 08830 12 08849 08869 08888 08907 08925 08944 08962 08980 08997 09015 13 09032 09049 09066 09082 09099 09115 09131 09147 09162 09177 14 09192 09207 09222 09236 09251 09265 09279 09292 09306 09319 15 09332 09345 09357 09370 09382 09394 09406 09418 09429 09441 16 09452 09463 09474 09484 09495 09505 09515 09525 09535 09545 17 09554 09564 09573 09582 09591 09599 09608 09616 09625 09633 18 09641 09649 09656 09664 09671 09678 09686 09693 09699 09706 19 09713 09719 09726 09732 09738 09744 09750 09756 09761 09767 20 09772 09778 09783 09788 09793 09798 09803 09808 09812 09817 21 09821 09826 09830 09834 09838 09842 09846 09850 09854 09857 22 09861 09864 09868 09871 09875 09878 09881 09884 09887 09890 23 09893 09896 09898 09901 09904 09906 09909 09911 09913 09916 24 09918 09920 09922 09925 09927 09929 09931 09932 09934 09936 25 09938 09940 09941 09943 09945 09946 09948 09949 09951 09952 26 09953 09955 09956 09957 09959 09960 09961 09962 09963 09964 27 09965 09966 09967 09968 09969 09970 09971 09972 09973 09974 28 09974 09975 09976 09977 09977 09978 09979 09979 09980 09981 29 09981 09982 09982 09983 09984 09984 09985 09985 09986 09986 30 09987 09987 09987 09988 09988 09989 09989 09989 09990 09990 Table G1 Continued Examples If Z Normal10 12 then P1Z 21322 5 0934 and P1Z 1842 5 9671 Source This table was generated using the Stata function normal Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it appendix G Statistical Tables 745 Table G2 Critical Values of the t Distribution Significance Level 1Tailed 10 05 025 01 005 2Tailed 20 10 05 02 01 1 3078 6314 12706 31821 63657 2 1886 2920 4303 6965 9925 3 1638 2353 3182 4541 5841 4 1533 2132 2776 3747 4604 5 1476 2015 2571 3365 4032 6 1440 1943 2447 3143 3707 7 1415 1895 2365 2998 3499 8 1397 1860 2306 2896 3355 9 1383 1833 2262 2821 3250 10 1372 1812 2228 2764 3169 11 1363 1796 2201 2718 3106 12 1356 1782 2179 2681 3055 13 1350 1771 2160 2650 3012 14 1345 1761 2145 2624 2977 15 1341 1753 2131 2602 2947 16 1337 1746 2120 2583 2921 17 1333 1740 2110 2567 2898 18 1330 1734 2101 2552 2878 19 1328 1729 2093 2539 2861 20 1325 1725 2086 2528 2845 21 1323 1721 2080 2518 2831 22 1321 1717 2074 2508 2819 23 1319 1714 2069 2500 2807 24 1318 1711 2064 2492 2797 25 1316 1708 2060 2485 2787 26 1315 1706 2056 2479 2779 27 1314 1703 2052 2473 2771 28 1313 1701 2048 2467 2763 29 1311 1699 2045 2462 2756 30 1310 1697 2042 2457 2750 40 1303 1684 2021 2423 2704 60 1296 1671 2000 2390 2660 90 1291 1662 1987 2368 2632 120 1289 1658 1980 2358 2617 1282 1645 1960 2326 2576 D e g r e e s o f F r e e d o m Examples The 1 critical value for a onetailed test with 25 df is 2485 The 5 critical value for a twotailed test with large 1202 df is 196 Source This table was generated using the Stata function invttail Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it appendices 746 Example The 10 critical value for numerator df 5 2 and denominator df 5 40 is 244 Source This table was generated using the Stata function invFtail Table G3a 10 Critical Values of the F Distribution Numerator Degrees of Freedom 1 2 3 4 5 6 7 8 9 10 10 329 292 273 261 252 246 241 238 235 232 11 323 286 266 254 245 239 234 230 227 225 12 318 281 261 248 239 233 228 224 221 219 13 314 276 256 243 235 228 223 220 216 214 14 310 273 252 239 231 224 219 215 212 210 15 307 270 249 236 227 221 216 212 209 206 16 305 267 246 233 224 218 213 209 206 203 17 303 264 244 231 222 215 210 206 203 200 18 301 262 242 229 220 213 208 204 200 198 19 299 261 240 227 218 211 206 202 198 196 20 297 259 238 225 216 209 204 200 196 194 21 296 257 236 223 214 208 202 198 195 192 22 295 256 235 222 213 206 201 197 193 190 23 294 255 234 221 211 205 199 195 192 189 24 293 254 233 219 210 204 198 194 191 188 25 292 253 232 218 209 202 197 193 189 187 26 291 252 231 217 208 201 196 192 188 186 27 290 251 230 217 207 200 195 191 187 185 28 289 250 229 216 206 200 194 190 187 184 29 289 250 228 215 206 199 193 189 186 183 30 288 249 228 214 205 198 193 188 185 182 40 284 244 223 209 200 193 187 183 179 176 60 279 239 218 204 195 187 182 177 174 171 90 276 236 215 201 191 184 178 174 170 167 120 275 235 213 199 190 182 177 172 168 165 271 230 208 194 185 177 172 167 163 160 D e n o m i n a t o r D e g r e e s o f F r e e d o m Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it appendix G Statistical Tables 747 Table G3b 5 Critical Values of the F Distribution Numerator Degrees of Freedom 1 2 3 4 5 6 7 8 9 10 10 496 410 371 348 333 322 314 307 302 298 11 484 398 359 336 320 309 301 295 290 285 12 475 389 349 326 311 300 291 285 280 275 13 467 381 341 318 303 292 283 277 271 267 14 460 374 334 311 296 285 276 270 265 260 15 454 368 329 306 290 279 271 264 259 254 16 449 363 324 301 285 274 266 259 254 249 17 445 359 320 296 281 270 261 255 249 245 18 441 355 316 293 277 266 258 251 246 241 19 438 352 313 290 274 263 254 248 242 238 20 435 349 310 287 271 260 251 245 239 235 21 432 347 307 284 268 257 249 242 237 232 22 430 344 305 282 266 255 246 240 234 230 23 428 342 303 280 264 253 244 237 232 227 24 426 340 301 278 262 251 242 236 230 225 25 424 339 299 276 260 249 240 234 228 224 26 423 337 298 274 259 247 239 232 227 222 27 421 335 296 273 257 246 237 231 225 220 28 420 334 295 271 256 245 236 229 224 219 29 418 333 293 270 255 243 235 228 222 218 30 417 332 292 269 253 242 233 227 221 216 40 408 323 284 261 245 234 225 218 212 208 60 400 315 276 253 237 225 217 210 204 199 90 395 310 271 247 232 220 211 204 199 194 120 392 307 268 245 229 217 209 202 196 191 384 300 260 237 221 210 201 194 188 183 D e n o m i n a t o r D e g r e e s o f F r e e d o m Example The 5 critical value for numerator df 5 4 and large denominator df12 is 237 Source This table was generated using the Stata function invFtail Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it appendices 748 Table G3c 1 Critical Values of the F Distribution Numerator Degrees of Freedom 1 2 3 4 5 6 7 8 9 10 10 1004 756 655 599 564 539 520 506 494 485 11 965 721 622 567 532 507 489 474 463 454 12 933 693 595 541 506 482 464 450 439 430 13 907 670 574 521 486 462 444 430 419 410 14 886 651 556 504 469 446 428 414 403 394 15 868 636 542 489 456 432 414 400 389 380 16 853 623 529 477 444 420 403 389 378 369 17 840 611 518 467 434 410 393 379 368 359 18 829 601 509 458 425 401 384 371 360 351 19 818 593 501 450 417 394 377 363 352 343 20 810 585 494 443 410 387 370 356 346 337 21 802 578 487 437 404 381 364 351 340 331 22 795 572 482 431 399 376 359 345 335 326 23 788 566 476 426 394 371 354 341 330 321 24 782 561 472 422 390 367 350 336 326 317 25 777 557 468 418 385 363 346 332 322 313 26 772 553 464 414 382 359 342 329 318 309 27 768 549 460 411 378 356 339 326 315 306 28 764 545 457 407 375 353 336 323 312 303 29 760 542 454 404 373 350 333 320 309 300 30 756 539 451 402 370 347 330 317 307 298 40 731 518 431 383 351 329 312 299 289 280 60 708 498 413 365 334 312 295 282 272 263 90 693 485 401 354 323 301 284 272 261 252 120 685 479 395 348 317 296 279 266 256 247 663 461 378 332 302 280 264 251 241 232 D e n o m i n a t o r D e g r e e s o f F r e e d o m Example The 1 critical value for numerator df 5 3 and denominator df 5 60 is 413 Source This table was generated using the Stata function invFtail Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it appendix G Statistical Tables 749 Table G4 Critical Values of the ChiSquare Distribution Significance Level 10 05 01 1 271 384 663 2 461 599 921 3 625 781 1134 4 778 949 1328 5 924 1107 1509 6 1064 1259 1681 7 1202 1407 1848 8 1336 1551 2009 9 1468 1692 2167 10 1599 1831 2321 11 1728 1968 2472 12 1855 2103 2622 13 1981 2236 2769 14 2106 2368 2914 15 2231 2500 3058 16 2354 2630 3200 17 2477 2759 3341 18 2599 2887 3481 19 2720 3014 3619 20 2841 3141 3757 21 2962 3267 3893 22 3081 3392 4029 23 3201 3517 4164 24 3320 3642 4298 25 3438 3765 4431 26 3556 3889 4564 27 3674 4011 4696 28 3792 4134 4828 29 3909 4256 4959 30 4026 4377 5089 D e g r e e s o f F r e e d o m Example The 5 critical value with df 5 8 is 1551 Source This table was generated using the Stata function invchi2tail Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 750 References Angrist J D 1990 Lifetime Earnings and the Vietnam Era Draft Lottery Evidence from Social Security Ad ministrative Records American Economic Review 80 313336 Angrist J D and A B Krueger 1991 Does Compulsory School Attendance Affect Schooling and Earnings Quar terly Journal of Economics 106 9791014 Ashenfelter O and A B Krueger 1994 Estimates of the Economic Return to Schooling from a New Sample of Twins American Economic Review 84 11571173 Averett S and S Korenman 1996 The Economic Real ity of the Beauty Myth Journal of Human Resources 31 304330 Ayres I and S D Levitt 1998 Measuring Positive Exter nalities from Unobservable Victim Precaution An Empiri cal Analysis of Lojack Quarterly Journal of Economics 108 4377 Banerjee A J Dolado J W Galbraith and D F Hendry 1993 CoIntegration ErrorCorrection and the Econo metric Analysis of NonStationary Data Oxford Oxford University Press Bartik T J 1991 The Effects of Property Taxes and Oth er Local Public Policies on the Intrametropolitan Pattern of Business Location in Industry Location and Public Policy ed H W Herzog and A M Schlottmann 5780 Knoxville University of Tennessee Press Becker G S 1968 Crime and Punishment An Economic Approach Journal of Political Economy 76 169217 Belsley D E Kuh and R Welsch 1980 Regression Diag nostics Identifying Influential Data and Sources of Collin earity New York Wiley Berk R A 1990 A Primer on Robust Regression in Mod ern Methods of Data Analysis ed J Fox and J S Long 292324 Newbury Park CA Sage Publications Betts J R 1995 Does School Quality Matter Evidence from the National Longitudinal Survey of Youth Review of Economics and Statistics 77 231250 Biddle J E and D S Hamermesh 1990 Sleep and the Allocation of Time Journal of Political Economy 98 922943 Biddle J E and D S Hamermesh 1998 Beauty Produc tivity and Discrimination Lawyers Looks and Lucre Journal of Labor Economics 16 172201 Blackburn M and D Neumark 1992 Unobserved Ability Efficiency Wages and Interindustry Wage Differentials Quarterly Journal of Economics 107 14211436 Blinder A S and M W Watson 2014 Presidents and the US Economy An Econometric Exploration National Bureau of Economic Research Working Paper No 20324 Blomström M R E Lipsey and M Zejan 1996 Is Fixed Investment the Key to Economic Growth Quarterly Jour nal of Economics 111 269276 Blundell R A Duncan and K Pendakur 1998 Semipara metric Estimation and Consumer Demand Journal of Ap plied Econometrics 13 435461 Bollerslev T R Y Chou and K F Kroner 1992 ARCH Modeling in Finance A Review of the Theory and Empiri cal Evidence Journal of Econometrics 52 559 Bollerslev T R F Engle and D B Nelson 1994 ARCH Models in Handbook of Econometrics volume 4 chapter 49 ed R F Engle and D L McFadden 29593038 Am sterdam NorthHolland Bound J D A Jaeger and R M Baker 1995 Problems with Instrumental Variables Estimation When the Correla tion between the Instruments and Endogenous Explanatory Variables Is Weak Journal of the American Statistical As sociation 90 443450 Breusch T S and A R Pagan 1979 A Simple Test for Heteroskedasticity and Random Coefficient Variation Econometrica 47 9871007 Cameron A C and P K Trivedi 1998 Regression Analysis of Count Data Cambridge Cambridge University Press Campbell J Y and N G Mankiw 1990 Permanent Income Current Income and Consumption Journal of Business and Economic Statistics 8 265279 Card D 1995 Using Geographic Variation in College Proximity to Estimate the Return to Schooling in Aspects of Labour Market Behavior Essays in Honour of John Vanderkamp ed L N Christophides E K Grant and R Swidinsky 201222 Toronto University of Toronto Press Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 751 Card D and A Krueger 1992 Does School Quality Mat ter Returns to Education and the Characteristics of Public Schools in the United States Journal of Political Economy 100 140 CastilloFreeman A J and R B Freeman 1992 When the Minimum Wage Really Bites The Effect of the USLevel Minimum on Puerto Rico in Immigration and the Work Force ed G J Borjas and R B Freeman 177211 Chi cago University of Chicago Press Clark K B 1984 Unionization and Firm Performance The Impact on Profits Growth and Productivity American Economic Review 74 893919 Cloninger D O 1991 Lethal Police Response as a Crime De terrent 57City Study Suggests a Decrease in Certain Crimes American Journal of Economics and Sociology 50 5969 Cloninger D O and L C Sartorius 1979 Crime Rates Clearance Rates and Enforcement Effort The Case of Hou ston Texas American Journal of Economics and Sociol ogy 38 389402 Cochrane J H 1997 Where Is the Market Going Uncer tain Facts and Novel Theories Economic Perspectives 21 Federal Reserve Bank of Chicago 337 Cornwell C and W N Trumbull 1994 Estimating the Eco nomic Model of Crime Using Panel Data Review of Eco nomics and Statistics 76 360366 Craig B R W E Jackson III and J B Thomson 2007 Small Firm Finance Credit Rationing and the Impact of SBAGuaranteed Lending on Local Economic Growth Journal of Small Business Management 45 116132 Currie J 1995 Welfare and the WellBeing of Children Chur Switzerland Harwood Academic Publishers Currie J and N Cole 1993 Welfare and Child Health The Link between AFDC Participation and Birth Weight American Economic Review 83 971983 Currie J and D Thomas 1995 Does Head Start Make a Difference American Economic Review 85 341364 Davidson R and J G MacKinnon 1981 Several Tests of Model Specification in the Presence of Alternative Hypoth eses Econometrica 49 781793 Davidson R and J G MacKinnon 1993 Estimation and Inference in Econometrics New York Oxford University Press De Long J B and L H Summers 1991 Equipment Invest ment and Economic Growth Quarterly Journal of Eco nomics 106 445502 Dickey D A and W A Fuller 1979 Distributions of the Estimators for Autoregressive Time Series with a Unit Root Journal of the American Statistical Association 74 427431 Diebold F X 2001 Elements of Forecasting 2nd ed Cincin nati SouthWestern Downes T A and S M Greenstein 1996 Understand ing the Supply Decisions of Nonprofits Modeling the Location of Private Schools Rand Journal of Economics 27 365390 Draper N and H Smith 1981 Applied Regression Analysis 2nd ed New York Wiley Duan N 1983 Smearing Estimate A Nonparametric Re transformation Method Journal of the American Statisti cal Association 78 605610 Durbin J 1970 Testing for Serial Correlation in Least Squares Regressions When Some of the Regressors Are Lagged Dependent Variables Econometrica 38 410421 Durbin J and G S Watson 1950 Testing for Serial Cor relation in Least Squares Regressions I Biometrika 37 409428 Eicker F 1967 Limit Theorems for Regressions with Un equal and Dependent Errors Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Prob ability 1 5982 Berkeley University of California Press Eide E 1994 Economics of Crime Deterrence and the Ra tional Offender Amsterdam NorthHolland Engle R F 1982 Autoregressive Conditional Heteroske dasticity with Estimates of the Variance of United Kingdom Inflation Econometrica 50 9871007 Engle R F and C W J Granger 1987 Cointegration and Error Correction Representation Estimation and Testing Econometrica 55 251276 Evans W N and R M Schwab 1995 Finishing High School and Starting College Do Catholic Schools Make a Difference Quarterly Journal of Economics 110 941974 Fair R C 1996 Econometrics and Presidential Elections Journal of Economic Perspectives 10 89102 Franses P H and R Paap 2001 Quantitative Models in Marketing Research Cambridge Cambridge University Press Freeman D G 2007 Drunk Driving Legislation and Traffic Fatalities New Evidence on BAC 08 Laws Contemporary Economic Policy 25 293308 Friedman B M and K N Kuttner 1992 Money Income Prices and Interest Rates American Economic Review 82 472492 Geronimus A T and S Korenman 1992 The Socioeco nomic Consequences of Teen Childbearing Reconsidered Quarterly Journal of Economics 107 11871214 Goldberger A S 1991 A Course in Econometrics Cam bridge MA Harvard University Press Graddy K 1995 Testing for Imperfect Competition at the Fulton Fish Market Rand Journal of Economics 26 7592 Graddy K 1997 Do FastFood Chains Price Discriminate on the Race and Income Characteristics of an Area Jour nal of Business and Economic Statistics 15 391401 Granger C W J and P Newbold 1974 Spurious Regressions in Econometrics Journal of Econometrics 2 111120 References Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 752 References Greene W 1997 Econometric Analysis 3rd ed New York MacMillan Griliches Z 1957 Specification Bias in Estimates of Pro duction Functions Journal of Farm Economics 39 820 Grogger J 1990 The Deterrent Effect of Capital Punish ment An Analysis of Daily Homicide Counts Journal of the American Statistical Association 410 295303 Grogger J 1991 Certainty vs Severity of Punishment Economic Inquiry 29 297309 Hall R E 1988 The Relation between Price and Marginal Cost in US Industry Journal of Political Economy 96 921948 Hamermesh D S and J E Biddle 1994 Beauty and the La bor Market American Economic Review 84 11741194 Hamermesh D H and A Parker 2005 Beauty in the Class room Instructors Pulchritude and Putative Pedagogical Productivity Economics of Education Review 24 369376 Hamilton J D 1994 Time Series Analysis Princeton NJ Princeton University Press Hansen CB 2007 Asymptotic Properties of a Robust Vari ance Matrix Estimator for Panel Data When T Is Large Journal of Econometrics 141 597620 Hanushek E 1986 The Economics of Schooling Produc tion and Efficiency in Public Schools Journal of Econom ic Literature 24 11411177 Harvey A 1990 The Econometric Analysis of Economic Time Series 2nd ed Cambridge MA MIT Press Hausman J A 1978 Specification Tests in Econometrics Econometrica 46 12511271 Hausman J A and D A Wise 1977 Social Experimen tation Truncated Distributions and Efficient Estimation Econometrica 45 319339 Hayasyi F 2000 Econometrics Princeton NJ Princeton University Press Heckman J J 1976 The Common Structure of Statisti cal Models of Truncation Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models Annals of Economic and Social Measurement 5 475492 Herrnstein R J and C Murray 1994 The Bell Curve Intel ligence and Class Structure in American Life New York Free Press Hersch J and L S Stratton 1997 Housework Fixed Ef fects and Wages of Married Workers Journal of Human Resources 32 285307 Hines J R 1996 Altered States Taxes and the Location of Foreign Direct Investment in America American Eco nomic Review 86 10761094 Holzer H 1991 The Spatial Mismatch Hypothesis What Has the Evidence Shown Urban Studies 28 105122 Holzer H R Block M Cheatham and J Knott 1993 Are Training Subsidies Effective The Michigan Experience Industrial and Labor Relations Review 46 625636 Horowitz J 2001 The Bootstrap in Handbook of Econo metrics volume 5 chapter 52 ed E Leamer and J L Heckman 31593228 Amsterdam North Holland Hoxby C M 1994 Do Private Schools Provide Compe tition for Public Schools National Bureau of Economic Research Working Paper Number 4978 Huber P J 1967 The Behavior of Maximum Likelihood Estimates under Nonstandard Conditions Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1 221233 Berkeley University of Cali fornia Press Hunter W C and M B Walker 1996 The Cultural Affinity Hypothesis and Mortgage Lending Decisions Journal of Real Estate Finance and Economics 13 5770 Hylleberg S 1992 Modelling Seasonality Oxford Oxford University Press Kane T J and C E Rouse 1995 LaborMarket Returns to Two and FourYear Colleges American Economic Review 85 600614 Kiefer N M and T J Vogelsang 2005 A New Asymp totic Theory for HeteroskedasticityAutocorrelation Robust Tests Econometric Theory 21 11301164 Kiel K A and K T McClain 1995 House Prices during Siting Decision Stages The Case of an Incinerator from Rumor through Operation Journal of Environmental Eco nomics and Management 28 241255 Kleck G and E B Patterson 1993 The Impact of Gun Control and Gun Ownership Levels on Violence Rates Journal of Quantitative Criminology 9 249287 Koenker R 1981 A Note on Studentizing a Test for Heter oskedasticity Journal of Econometrics 17 107112 Koenker R 2005 Quantile Regression Cambridge Cam bridge University Press Korenman S and D Neumark 1991 Does Marriage Re ally Make Men More Productive Journal of Human Re sources 26 282307 Korenman S and D Neumark 1992 Marriage Motherhood and Wages Journal of Human Resources 27 233255 Krueger A B 1993 How Computers Have Changed the Wage Structure Evidence from Microdata 19841989 Quarterly Journal of Economics 108 3360 Krupp C M and P S Pollard 1996 Market Respons es to Antidumping Laws Some Evidence from the US Chemical Industry Canadian Journal of Economics 29 199227 Kwiatkowski D P C B Phillips P Schmidt and Y Shin 1992 Testing the Null Hypothesis of Stationarity against the Alternative of a Unit Root How Sure Are We That Eco nomic Time Series Have a Unit Root Journal of Econo metrics 54 159178 Lalonde R J 1986 Evaluating the Econometric Evalua tions of Training Programs with Experimental Data Amer ican Economic Review 76 604620 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 753 Larsen R J and M L Marx 1986 An Introduction to Math ematical Statistics and Its Applications 2nd ed Englewood Cliffs NJ PrenticeHall Leamer E E 1983 Lets Take the Con Out of Economet rics American Economic Review 73 3143 Levine P B A B Trainor and D J Zimmerman 1996 The Effect of Medicaid Abortion Funding Restrictions on Abor tions Pregnancies and Births Journal of Health Econom ics 15 555578 Levine P B and D J Zimmerman 1995 The Benefit of Additional HighSchool Math and Science Classes for Young Men and Women Journal of Business and Econom ics Statistics 13 137149 Levitt S D 1994 Using Repeat Challengers to Estimate the Effect of Campaign Spending on Election Outcomes in the US House Journal of Political Economy 102 777798 Levitt S D 1996 The Effect of Prison Population Size on Crime Rates Evidence from Prison Overcrowding Legisla tion Quarterly Journal of Economics 111 319351 Little R J A and D B Rubin 2002 Statistical Analysis with Missing Data 2nd ed Wiley New York Low S A and L R McPheters 1983 Wage Differentials and the Risk of Death An Empirical Analysis Economic Inquiry 21 271280 Lynch L M 1992 Private Sector Training and the Earn ings of Young Workers American Economic Review 82 299312 MacKinnon J G and H White 1985 Some Heteroskedas ticity Consistent Covariance Matrix Estimators with Im proved Finite Sample Properties Journal of Econometrics 29 305325 Maloney M T and R E McCormick 1993 An Examina tion of the Role that Intercollegiate Athletic Participation Plays in Academic Achievement Athletes Feats in the Classroom Journal of Human Resources 28 555570 Mankiw N G 1994 Macroeconomics 2nd ed New York Worth Mark S T T J McGuire and L E Papke 2000 The In fluence of Taxes on Employment and Population Growth Evidence from the Washington DC Metropolitan Area National Tax Journal 53 105123 McCarthy P S 1994 Relaxed Speed Limits and Highway Safety New Evidence from California Economics Letters 46 173179 McClain K T and J M Wooldridge 1995 A Simple Test for the Consistency of Dynamic Linear Regression in Rational Distributed Lag Models Economics Letters 48 235240 McCormick R E and M Tinsley 1987 Athletics versus Academics Evidence from SAT Scores Journal of Politi cal Economy 95 11031116 McFadden D L 1974 Conditional Logit Analysis of Qual itative Choice Behavior in Frontiers in Econometrics ed P Zarembka 105142 New York Academic Press Meyer B D 1995 Natural and QuasiExperiments in Eco nomics Journal of Business and Economic Statistics 13 151161 Meyer B D W K Viscusi and D L Durbin 1995 Work ers Compensation and Injury Duration Evidence from a Natural Experiment American Economic Review 85 322340 Mizon G E and J F Richard 1986 The Encompassing Principle and Its Application to Testing Nonnested Hypoth eses Econometrica 54 657678 Mroz T A 1987 The Sensitivity of an Empirical Model of Married Womens Hours of Work to Economic and Statisti cal Assumptions Econometrica 55 765799 Mullahy J and P R Portney 1990 Air Pollution Cigarette Smoking and the Production of Respiratory Health Jour nal of Health Economics 9 193205 Mullahy J and J L Sindelar 1994 Do Drinkers Know When to Say When An Empirical Analysis of Drunk Driv ing Economic Inquiry 32 383394 Netzer D 1992 Differences in Reliance on User Charges by American State and Local Governments Public Fi nance Quarterly 20 499511 Neumark D 1996 Sex Discrimination in Restaurant Hir ing An Audit Study Quarterly Journal of Economics 111 915941 Neumark D and W Wascher 1995 Minimum Wage Ef fects on Employment and School Enrollment Journal of Business and Economic Statistics 13 199206 Newey W K and K D West 1987 A Simple Posi tive SemiDefinite Heteroskedasticity and Autocorrela tion Consistent Covariance Matrix Econometrica 55 703708 Papke L E 1987 Subnational Taxation and Capital Mobil ity Estimates of TaxPrice Elasticities National Tax Jour nal 40 191203 Papke L E 1994 Tax Policy and Urban Development Evi dence from the Indiana Enterprise Zone Program Journal of Public Economics 54 3749 Papke L E 1995 Participation in and Contributions to 401k Pension Plans Evidence from Plan Data Journal of Human Resources 30 311325 Papke L E 1999 Are 401k Plans Replacing Other Em ployerProvided Pensions Evidence from Panel Data Journal of Human Resources 34 346368 Papke L E 2005 The Effects of Spending on Test Pass Rates Evidence from Michigan Journal of Public Eco nomics 89 821839 Papke L E and J M Wooldridge 1996 Econometric Meth ods for Fractional Response Variables with an Application to 401k Plan Participation Rates Journal of Applied Econometrics 11 619632 Park R 1966 Estimation with Heteroskedastic Error Terms Econometrica 34 888 References Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 754 References Peek J 1982 Interest Rates Income Taxes and Anticipated Inflation American Economic Review 72 980991 Pindyck R S and D L Rubinfeld 1992 Microeconomics 2nd ed New York Macmillan Ram R 1986 Government Size and Economic Growth A New Framework and Some Evidence from CrossSection and TimeSeries Data American Economic Review 76 191203 Ramanathan R 1995 Introductory Econometrics with Ap plications 3rd ed Fort Worth Dryden Press Ramey V 1991 Nonconvex Costs and the Behavior of In ventories Journal of Political Economy 99 306334 Ramsey J B 1969 Tests for Specification Errors in Clas sical Linear LeastSquares Analysis Journal of the Royal Statistical Association Series B 71 350371 Romer D 1993 Openness and Inflation Theory and Evi dence Quarterly Journal of Economics 108 869903 Rose N L 1985 The Incidence of Regulatory Rents in the Motor Carrier Industry Rand Journal of Economics 16 299318 Rose N L and A Shepard 1997 Firm Diversification and CEO Compensation Managerial Ability or Executive En trenchment Rand Journal of Economics 28 489514 Rouse C E 1998 Private School Vouchers and Student Achievement An Evaluation of the Milwaukee Parental Choice Program Quarterly Journal of Economics 113 553602 Sander W 1992 The Effect of Womens Schooling on Fer tility Economic Letters 40 229233 Savin N E and K J White 1977 The DurbinWatson Test for Serial Correlation with Extreme Sample Sizes or Many Regressors Econometrica 45 19891996 Shea J 1993 The InputOutput Approach to Instrument Selection Journal of Business and Economic Statistics 11 145155 Shughart W F and R D Tollison 1984 The Random Character of Merger Activity Rand Journal of Economics 15 500509 Solon G 1985 The Minimum Wage and Teenage Em ployment A Reanalysis with Attention to Serial Corre lation and Seasonality Journal of Human Resources 20 292297 Staiger D and J H Stock 1997 Instrumental Variables Regression with Weak Instruments Econometrica 65 557586 Stigler S M 1986 The History of Statistics Cambridge MA Harvard University Press Stock J H and M W Watson 1989 Interpreting the Evi dence on MoneyIncome Causality Journal of Economet rics 40 161181 Stock J H and M W Watson 1993 A Simple Estimator of Cointegrating Vectors in Higher Order Integrated Systems Econometrica 61 783820 Stock J H and M Yogo 2005 Asymptotic Distributions of Instrumental Variables Statistics with Many Instruments in Identification and Inference for Econometric Models Essays in Honor of Thomas Rothenberg ed D W K An drews and J H Stock 109120 Cambridge Cambridge University Press Stock J W and M W Watson 2008 Heteroskedasticity Robust Standard Errors for Fixed Effects Panel Data Re gression Econometrica 76 155174 Sydsaeter K and P J Hammond 1995 Mathematics for Economic Analysis Englewood Cliffs NJ Prentice Hall Terza J V 2002 Alcohol Abuse and Employment A Sec ond Look Journal of Applied Econometrics 17 393404 Tucker I B 2004 A Reexamination of the Effect of Big time Football and Basketball Success on Graduation Rates and Alumni Giving Rates Economics of Education Review 23 655661 Vella F and M Verbeek 1998 Whose Wages Do Unions Raise A Dynamic Model of Unionism and Wage Rate De termination for Young Men Journal of Applied Economet rics 13 163183 Wald A 1940 The Fitting of Straight Lines If Both Vari ables Are Subject to Error Annals of Mathematical Statis tics 11 284300 Wallis K F 1972 Testing for FourthOrder Autocorrela tion in Quarterly Regression Equations Econometrica 40 617636 White H 1980 A HeteroskedasticityConsistent Covari ance Matrix Estimator and a Direct Test for Heteroskedas ticity Econometrica 48 817838 White H 1984 Asymptotic Theory for Econometricians Orlando Academic Press White M J 1986 Property Taxes and Firm Location Evi dence from Proposition 13 in Studies in State and Local Public Finance ed H S Rosen 83112 Chicago Univer sity of Chicago Press Whittington L A J Alm and H E Peters 1990 Fertility and the Personal Exemption Implicit Pronatalist Policy in the United States American Economic Review 80 545556 Wooldridge J M 1989 A Computationally Simple Heter oskedasticity and Serial CorrelationRobust Standard Error for the Linear Regression Model Economics Letters 31 239243 Wooldridge J M 1991a A Note on Computing RSquared and Adjusted RSquared for Trending and Seasonal Data Economics Letters 36 4954 Wooldridge J M 1991b On the Application of Robust RegressionBased Diagnostics to Models of Conditional Means and Conditional Variances Journal of Economet rics 47 546 Wooldridge J M 1994a A Simple Specification Test for the Predictive Ability of Transformation Models Review of Economics and Statistics 76 5965 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 755 Wooldridge J M 1994b Estimation and Inference for De pendent Processes in Handbook of Econometrics volume 4 chapter 45 ed R F Engle and D L McFadden 2639 2738 Amsterdam NorthHolland Wooldridge J M 1995 Score Diagnostics for Linear Mod els Estimated by Two Stage Least Squares in Advances in Econometrics and Quantitative Economics ed G S Maddala P C B Phillips and T N Srinivasan 6687 Ox ford Blackwell Wooldridge JM 2001 Diagnostic Testing in Companion to Theoretical Econometrics ed B H Baltagi 180200 Oxford Blackwell Wooldridge J M 2010 Econometric Analysis of Cross Sec tion and Panel Data 2nd ed Cambridge MA MIT Press References Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 756 Glossary A Adjusted RSquared A goodnessoffit measure in multiple regression analysis that penalizes additional explanatory variables by using a degrees of freedom adjustment in esti mating the error variance Alternative Hypothesis The hypothesis against which the null hypothesis is tested AR1 Serial Correlation The errors in a time series regres sion model follow an AR1 model Asymptotic Bias See inconsistency Asymptotic Confidence Interval A confidence interval that is approximately valid in large sample sizes Asymptotic Normality The sampling distribution of a prop erly normalized estimator converges to the standard normal distribution Asymptotic Properties Properties of estimators and test statistics that apply when the sample size grows without bound Asymptotic Standard Error A standard error that is valid in large samples Asymptotic t Statistic A t statistic that has an approximate standard normal distribution in large samples Asymptotic Variance The square of the value by which we must divide an estimator in order to obtain an asymptotic standard normal distribution Asymptotically Efficient For consistent estimators with as ymptotically normal distributions the estimator with the smallest asymptotic variance Asymptotically Uncorrelated A time series process in which the correlation between random variables at two points in time tends to zero as the time interval between them increases See also weakly dependent Attenuation Bias Bias in an estimator that is always toward zero thus the expected value of an estimator with attenua tion bias is less in magnitude than the absolute value of the parameter Augmented DickeyFuller Test A test for a unit root that in cludes lagged changes of the variable as regressors Autocorrelation See serial correlation Autoregressive Conditional Heteroskedasticity ARCH A model of dynamic heteroskedasticity where the variance of the error term given past information depends linearly on the past squared errors Autoregressive Process of Order One AR1 A time series model whose current value depends linearly on its most recent value plus an unpredictable disturbance Auxiliary Regression A regression used to compute a test statisticsuch as the test statistics for heteroskedasticity and serial correlationor any other regression that does not estimate the model of primary interest Average The sum of n numbers divided by n Average Marginal Effect See average partial effect Average Partial Effect For nonconstant partial effects the partial effect averaged across the specified population Average Treatment Effect A treatment or policy effect av eraged across the population B Balanced Panel A panel data set where all years or periods of data are available for all crosssectional units Base Group The group represented by the overall intercept in a multiple regression model that includes dummy ex planatory variables Base Period For index numbers such as price or production indices the period against which all other time periods are measured Base Value The value assigned to the base period for con structing an index number usually the base value is 1 or 100 Benchmark Group See base group Bernoulli or Binary Random Variable A random variable that takes on the values zero or one Best Linear Unbiased Estimator BLUE Among all lin ear unbiased estimators the one with the smallest vari ance OLS is BLUE conditional on the sample values of the explanatory variables under the GaussMarkov assumptions Beta Coefficients See standardized coefficients Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 757 Bias The difference between the expected value of an esti mator and the population value that the estimator is sup posed to be estimating Biased Estimator An estimator whose expectation or sam pling mean is different from the population value it is sup posed to be estimating Biased Towards Zero A description of an estimator whose expectation in absolute value is less than the absolute value of the population parameter Binary Response Model A model for a binary dummy de pendent variable Binary Variable See dummy variable Binomial Distribution The probability distribution of the number of successes out of n independent Bernoulli trials where each trial has the same probability of success Bivariate Regression Model See simple linear regression model BLUE See best linear unbiased estimator Bootstrap A resampling method that draws random samples with replacement from the original data set Bootstrap Standard Error A standard error obtained as the sample standard deviation of an estimate across all boot strap samples BreuschGodfrey Test An asymptotically justified test for ARp serial correlation with AR1 being the most popu lar the test allows for lagged dependent variables as well as other regressors that are not strictly exogenous BreuschPagan Test A test for heteroskedasticity where the squared OLS residuals are regressed on the explanatory variables in the model C Causal Effect A ceteris paribus change in one variable that has an effect on another variable Censored Normal Regression Model The special case of the censored regression model where the underly ing population model satisfies the classical linear model assumptions Censored Regression Model A multiple regression model where the dependent variable has been censored above or below some known threshold Central Limit Theorem CLT A key result from prob ability theory which implies that the sum of independent random variables or even weakly dependent random vari ables when standardized by its standard deviation has a distribution that tends to standard normal as the sample size grows Ceteris Paribus All other relevant factors are held fixed ChiSquare Distribution A probability distribution obtained by adding the squares of independent standard normal ran dom variables The number of terms in the sum equals the degrees of freedom in the distribution ChiSquare Random Variable A random variable with a chisquare distribution Chow Statistic An F statistic for testing the equality of re gression parameters across different groups say men and women or time periods say before and after a policy change Classical ErrorsinVariables CEV A measurement error model where the observed measure equals the actual vari able plus an independent or at least an uncorrelated mea surement error Classical Linear Model The multiple linear regres sion model under the full set of classical linear model assumptions Classical Linear Model CLM Assumptions The ideal set of assumptions for multiple regression analysis for cross sectional analysis Assumptions MLR1 through MLR6 and for time series analysis Assumptions TS1 through TS6 The assumptions include linearity in the parameters no perfect collinearity the zero conditional mean assump tion homoskedasticity no serial correlation and normality of the errors Cluster Effect An unobserved effect that is common to all units usually people in the cluster Cluster Sample A sample of natural clusters or groups that usually consist of people Clustering The act of computing standard errors and test statistics that are robust to cluster correlation either due to cluster sampling or to time series correlation in panel data CochraneOrcutt CO Estimation A method of estimat ing a multiple linear regression model with AR1 errors and strictly exogenous explanatory variables unlike Prais Winsten CochraneOrcutt does not use the equation for the first time period Coefficient of Determination See Rsquared Cointegration The notion that a linear combination of two series each of which is integrated of order one is inte grated of order zero Column Vector A vector of numbers arranged as a column Composite Error Term In a panel data model the sum of the timeconstant unobserved effect and the idiosyncratic error Conditional Distribution The probability distribution of one random variable given the values of one or more other random variables Conditional Expectation The expected or average value of one random variable called the dependent or explained variable that depends on the values of one or more other variables called the independent or explanatory variables Conditional Forecast A forecast that assumes the future val ues of some explanatory variables are known with certainty Conditional Median The median of a response variable con ditional on some explanatory variables Conditional Variance The variance of one random variable given one or more other random variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 758 Confidence Interval CI A rule used to construct a random interval so that a certain percentage of all data sets deter mined by the confidence level yields an interval that con tains the population value Confidence Level The percentage of samples in which we want our confidence interval to contain the population value 95 is the most common confidence level but 90 and 99 are also used Consistency An estimator converges in probability to the correct population value as the sample size grows Consistent Estimator An estimator that converges in proba bility to the population parameter as the sample size grows without bound Consistent Test A test where under the alternative hypoth esis the probability of rejecting the null hypothesis con verges to one as the sample size grows without bound Constant Elasticity Model A model where the elasticity of the dependent variable with respect to an explanatory vari able is constant in multiple regression both variables ap pear in logarithmic form Contemporaneously Homoskedastic Describes a time se ries or panel data applications in which the variance of the error term conditional on the regressors in the same time period is constant Contemporaneously Exogenous Describes a time series or panel data application in which a regressor is contempora neously exogenous if it is uncorrelated with the error term in the same time period although it may be correlated with the errors in other time periods Continuous Random Variable A random variable that takes on any particular value with probability zero Control Group In program evaluation the group that does not participate in the program Control Variable See explanatory variable Corner Solution Response A nonnegative dependent vari able that is roughly continuous over strictly positive values but takes on the value zero with some regularity Correlated Random Effects An approach to panel data analysis where the correlation between the unobserved ef fect and the explanatory variables is modeled usually as a linear relationship Correlation Coefficient A measure of linear dependence be tween two random variables that does not depend on units of measurement and is bounded between 1 and 1 Count Variable A variable that takes on nonnegative integer values Covariance A measure of linear dependence between two random variables Covariance Stationary A time series process with constant mean and variance where the covariance between any two random variables in the sequence depends only on the dis tance between them Covariate See explanatory variable Critical Value In hypothesis testing the value against which a test statistic is compared to determine whether or not the null hypothesis is rejected CrossSectional Data Set A data set collected by sampling a population at a given point in time Cumulative Distribution Function cdf A function that gives the probability of a random variable being less than or equal to any specified real number Cumulative Effect At any point in time the change in a re sponse variable after a permanent increase in an explana tory variableusually in the context of distributed lag models D Data Censoring A situation that arises when we do not al ways observe the outcome on the dependent variable be cause at an upper or lower threshold we only know that the outcome was above or below the threshold See also censored regression model Data Frequency The interval at which time series data are collected Yearly quarterly and monthly are the most com mon data frequencies Data Mining The practice of using the same data set to estimate numerous models in a search to find the best model DavidsonMacKinnon Test A test that is used for testing a model against a nonnested alternative it can be imple mented as a t test on the fitted values from the competing model Degrees of Freedom df In multiple regression analysis the number of observations minus the number of estimated parameters Denominator Degrees of Freedom In an F test the degrees of freedom in the unrestricted model Dependent Variable The variable to be explained in a mul tiple regression model and a variety of other models Derivative The slope of a smooth function as defined using calculus Descriptive Statistic A statistic used to summarize a set of numbers the sample average sample median and sample standard deviation are the most common Deseasonalizing The removing of the seasonal components from a monthly or quarterly time series Detrending The practice of removing the trend from a time series Diagonal Matrix A matrix with zeros for all offdiagonal entries DickeyFuller Distribution The limiting distribution of the t statistic in testing the null hypothesis of a unit root DickeyFuller DF Test A t test of the unit root null hypoth esis in an AR1 model See also augmented DickeyFuller test Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 759 Difference in Slopes A description of a model where some slope parameters may differ by group or time period DifferenceinDifferences Estimator An estimator that arises in policy analysis with data for two time periods One version of the estimator applies to independently pooled cross sections and another to panel data sets DifferenceStationary Process A time series sequence that is I0 in its first differences Diminishing Marginal Effect The marginal effect of an ex planatory variable becomes smaller as the value of the ex planatory variable increases Discrete Random Variable A random variable that takes on at most a finite or countably infinite number of values Distributed Lag Model A time series model that relates the dependent variable to current and past values of an explan atory variable Disturbance See error term Downward Bias The expected value of an estimator is below the population value of the parameter Dummy Dependent Variable See binary response model Dummy Variable A variable that takes on the value zero or one Dummy Variable Regression In a panel data setting the regression that includes a dummy variable for each cross sectional unit along with the remaining explanatory vari ables It produces the fixed effects estimator Dummy Variable Trap The mistake of including too many dummy variables among the independent variables it oc curs when an overall intercept is in the model and a dummy variable is included for each group Duration Analysis An application of the censored regres sion model where the dependent variable is time elapsed until a certain event occurs such as the time before an un employed person becomes reemployed DurbinWatson DW Statistic A statistic used to test for first order serial correlation in the errors of a time se ries regression model under the classical linear model assumptions Dynamically Complete Model A time series model where no further lags of either the dependent variable or the ex planatory variables help to explain the mean of the depen dent variable E Econometric Model An equation relating the dependent variable to a set of explanatory variables and unobserved disturbances where unknown population parameters de termine the ceteris paribus effect of each explanatory variable Economic Model A relationship derived from economic the ory or less formal economic reasoning Economic Significance See practical significance Elasticity The percentage change in one variable given a 1 ceteris paribus increase in another variable Empirical Analysis A study that uses data in a formal econometric analysis to test a theory estimate a relation ship or determine the effectiveness of a policy Endogeneity A term used to describe the presence of an en dogenous explanatory variable Endogenous Explanatory Variable An explanatory vari able in a multiple regression model that is correlated with the error term either because of an omitted variable mea surement error or simultaneity Endogenous Sample Selection Nonrandom sample selec tion where the selection is related to the dependent variable either directly or through the error term in the equation Endogenous Variables In simultaneous equations mod els variables that are determined by the equations in the system EngleGranger Test A test of the null hypothesis that two time series are not cointegrated the statistic is obtained as the DickeyFuller statistic using OLS residuals EngleGranger TwoStep Procedure A twostep method for estimating error correction models whereby the coin tegrating parameter is estimated in the first stage and the error correction parameters are estimated in the second Error Correction Model A time series model in first dif ferences that also contains an error correction term which works to bring two I1 series back into longrun equilibrium Error Term The variable in a simple or multiple regression equation that contains unobserved factors which affect the dependent variable The error term may also include mea surement errors in the observed dependent or independent variables Error Variance The variance of the error term in a multiple regression model ErrorsinVariables A situation where either the dependent variable or some independent variables are measured with error Estimate The numerical value taken on by an estimator for a particular sample of data Estimator A rule for combining data to produce a numerical value for a population parameter the form of the rule does not depend on the particular sample obtained Event Study An econometric analysis of the effects of an event such as a change in government regulation or eco nomic policy on an outcome variable Excluding a Relevant Variable In multiple regression anal ysis leaving out a variable that has a nonzero partial effect on the dependent variable Exclusion Restrictions Restrictions which state that certain variables are excluded from the model or have zero popu lation coefficients Exogenous Explanatory Variable An explanatory variable that is uncorrelated with the error term Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 760 Exogenous Sample Selection A sample selection that either depends on exogenous explanatory variables or is indepen dent of the error term in the equation of interest Exogenous Variable Any variable that is uncorrelated with the error term in the model of interest Expected Value A measure of central tendency in the distri bution of a random variable including an estimator Experiment In probability a general term used to denote an event whose outcome is uncertain In econometric analysis it denotes a situation where data are collected by randomly assigning individuals to control and treatment groups Experimental Data Data that have been obtained by running a controlled experiment Experimental Group See treatment group Explained Sum of Squares SSE The total sample varia tion of the fitted values in a multiple regression model Explained Variable See dependent variable Explanatory Variable In regression analysis a variable that is used to explain variation in the dependent variable Exponential Function A mathematical function defined for all values that has an increasing slope but a constant pro portionate change Exponential Smoothing A simple method of forecasting a variable that involves a weighting of all previous outcomes on that variable Exponential Trend A trend with a constant growth rate F F Distribution The probability distribution obtained by form ing the ratio of two independent chisquare random variables where each has been divided by its degrees of freedom F Random Variable A random variable with an F distribution F Statistic A statistic used to test multiple hypotheses about the parameters in a multiple regression model Feasible GLS FGLS Estimator A GLS procedure where variance or correlation parameters are unknown and there fore must first be estimated See also generalized least squares estimator Finite Distributed Lag FDL Model A dynamic model where one or more explanatory variables are allowed to have lagged effects on the dependent variable First Difference A transformation on a time series con structed by taking the difference of adjacent time periods where the earlier time period is subtracted from the later time period FirstDifferenced FD Equation In time series or panel data models an equation where the dependent and inde pendent variables have all been first differenced FirstDifferenced FD Estimator In a panel data setting the pooled OLS estimator applied to first differences of the data across time First Order Autocorrelation For a time series process or dered chronologically the correlation coefficient between pairs of adjacent observations First Order Conditions The set of linear equations used to solve for the OLS estimates Fitted Values The estimated values of the dependent variable when the values of the independent variables for each ob servation are plugged into the OLS regression line Fixed Effect See unobserved effect Fixed Effects Estimator For the unobserved effects panel data model the estimator obtained by applying pooled OLS to a timedemeaned equation Fixed Effects Model An unobserved effects panel data model where the unobserved effects are allowed to be ar bitrarily correlated with the explanatory variables in each time period Fixed Effects Transformation For panel data the timede meaned data Forecast Error The difference between the actual outcome and the forecast of the outcome Forecast Interval In forecasting a confidence interval for a yet unrealized future value of a time series variable See also prediction interval FrischWaugh Theorem The general algebraic result that provides multiple regression analysis with its partialling out interpretation Functional Form Misspecification A problem that occurs when a model has omitted functions of the explanatory variables such as quadratics or uses the wrong func tions of either the dependent variable or some explanatory variables G GaussMarkov Assumptions The set of assumptions As sumptions MLR1 through MLR5 or TS1 through TS5 under which OLS is BLUE GaussMarkov Theorem The theorem that states that under the five GaussMarkov assumptions for cross sectional or time series models the OLS estimator is BLUE conditional on the sample values of the explanatory variables Generalized Least Squares GLS Estimator An estimator that accounts for a known structure of the error variance heteroskedasticity serial correlation pattern in the errors or both via a transformation of the original model Geometric or Koyck Distributed Lag An infinite distrib uted lag model where the lag coefficients decline at a geo metric rate GoodnessofFit Measure A statistic that summarizes how well a set of explanatory variables explains a dependent or response variable Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 761 Granger Causality A limited notion of causality where past values of one series xt are useful for predicting future val ues of another series yt after past values of yt have been controlled for Growth Rate The proportionate change in a time series from the previous period It may be approximated as the differ ence in logs or reported in percentage form H Heckit Method An econometric procedure used to correct for sample selection bias due to incidental truncation or some other form of nonrandomly missing data Heterogeneity Bias The bias in OLS due to omitted hetero geneity or omitted variables Heteroskedasticity The variance of the error term given the explanatory variables is not constant Heteroskedasticity of Unknown Form Heteroskedasticity that may depend on the explanatory variables in an un known arbitrary fashion HeteroskedasticityRobust F Statistic An Ftype statistic that is asymptotically robust to heteroskedasticity of un known form HeteroskedasticityRobust LM Statistic An LM statistic that is robust to heteroskedasticity of unknown form HeteroskedasticityRobust Standard Error A standard er ror that is asymptotically robust to heteroskedasticity of unknown form HeteroskedasticityRobust t Statistic A t statistic that is asymptotically robust to heteroskedasticity of unknown form Highly Persistent A time series process where outcomes in the distant future are highly correlated with current outcomes Homoskedasticity The errors in a regression model have constant variance conditional on the explanatory variables Hypothesis Test A statistical test of the null or maintained hypothesis against an alternative hypothesis I Idempotent Matrix A square matrix where multiplication of the matrix by itself equals itself Identification A population parameter or set of parameters can be consistently estimated Identified Equation An equation whose parameters can be consistently estimated especially in models with endog enous explanatory variables Identity Matrix A square matrix where all diagonal ele ments are one and all offdiagonal elements are zero Idiosyncratic Error In panel data models the error that changes over time as well as across units say individuals firms or cities Impact Elasticity In a distributed lag model the immediate percentage change in the dependent variable given a 1 in crease in the independent variable Impact Multiplier See impact propensity Impact Propensity In a distributed lag model the immedi ate change in the dependent variable given a oneunit in crease in the independent variable Incidental Truncation A sample selection problem whereby one variable usually the dependent variable is only ob served for certain outcomes of another variable Inclusion of an Irrelevant Variable The including of an explanatory variable in a regression model that has a zero population parameter in estimating an equation by OLS Inconsistency The difference between the probability limit of an estimator and the parameter value Inconsistent Describes an estimator that does not converge in probability to the correct population parameter as the sample size grows Independent Random Variables Random variables whose joint distribution is the product of the marginal distributions Independent Variable See explanatory variable Independently Pooled Cross Section A data set obtained by pooling independent random samples from different points in time Index Number A statistic that aggregates information on economic activity such as production or prices Infinite Distributed Lag IDL Model A distributed lag model where a change in the explanatory variable can have an impact on the dependent variable into the indefinite future Influential Observations See outliers Information Set In forecasting the set of variables that we can observe prior to forming our forecast InSample Criteria Criteria for choosing forecasting models that are based on goodnessoffit within the sample used to obtain the parameter estimates Instrument See instrumental variable Instrument Exogeneity In instrumental variables estima tion the requirement that an instrumental variable is uncor related with the error term Instrument Relevance In instrumental variables estima tion the requirement that an instrumental variable helps to partially explain variation in the endogenous explanatory variable Instrumental Variable IV In an equation with an endog enous explanatory variable an IV is a variable that does not appear in the equation is uncorrelated with the error in the equation and is partially correlated with the endogenous explanatory variable Instrumental Variables IV Estimator An estimator in a linear model used when instrumental variables are avail able for one or more endogenous explanatory variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 762 Integrated of Order One I1 A time series process that needs to be firstdifferenced in order to produce an I0 process Integrated of Order Zero I0 A stationary weakly de pendent time series process that when used in regression analysis satisfies the law of large numbers and the central limit theorem Interaction Effect In multiple regression the partial effect of one explanatory variable depends on the value of a dif ferent explanatory variable Interaction Term An independent variable in a regression model that is the product of two explanatory variables Intercept In the equation of a line the value of the y variable when the x variable is zero Intercept Parameter The parameter in a multiple lin ear regression model that gives the expected value of the dependent variable when all the independent variables equal zero Intercept Shift The intercept in a regression model differs by group or time period Internet A global computer network that can be used to ac cess information and download databases Interval Estimator A rule that uses data to obtain lower and upper bounds for a population parameter See also confi dence interval Inverse For an n n matrix its inverse if it exists is the n n matrix for which pre and postmultiplication by the original matrix yields the identity matrix Inverse Mills Ratio A term that can be added to a multiple regression model to remove sample selection bias J Joint Distribution The probability distribution determining the probabilities of outcomes involving two or more ran dom variables Joint Hypotheses Test A test involving more than one re striction on the parameters in a model Jointly Insignificant Failure to reject using an F test at a specified significance level that all coefficients for a group of explanatory variables are zero Jointly Statistically Significant The null hypothesis that two or more explanatory variables have zero population co efficients is rejected at the chosen significance level Just Identified Equation For models with endogenous ex planatory variables an equation that is identified but would not be identified with one fewer instrumental variable K Kurtosis A measure of the thickness of the tails of a distribu tion based on the fourth moment of the standardized ran dom variable the measure is usually compared to the value for the standard normal distribution which is three L Lag Distribution In a finite or infinite distributed lag model the lag coefficients graphed as a function of the lag length Lagged Dependent Variable An explanatory variable that is equal to the dependent variable from an earlier time period Lagged Endogenous Variable In a simultaneous equations model a lagged value of one of the endogenous variables Lagrange Multiplier LM Statistic A test statistic with largesample justification that can be used to test for omit ted variables heteroskedasticity and serial correlation among other model specification problems Large Sample Properties See asymptotic properties Latent Variable Model A model where the observed depen dent variable is assumed to be a function of an underlying latent or unobserved variable Law of Iterated Expectations A result from probability that relates unconditional and conditional expectations Law of Large Numbers LLN A theorem that says that the average from a random sample converges in probability to the population average the LLN also holds for stationary and weakly dependent time series Leads and Lags Estimator An estimator of a cointegrating parameter in a regression with I1 variables where the current some past and some future first differences in the explanatory variable are included as regressors Least Absolute Deviations LAD A method for estimat ing the parameters of a multiple regression model based on minimizing the sum of the absolute values of the residuals Least Squares Estimator An estimator that minimizes a sum of squared residuals LevelLevel Model A regression model where the dependent variable and the independent variables are in level or origi nal form LevelLog Model A regression model where the dependent variable is in level form and at least some of the indepen dent variables are in logarithmic form Likelihood Ratio Statistic A statistic that can be used to test single or multiple hypotheses when the constrained and unconstrained models have been estimated by maximum likelihood The statistic is twice the difference in the un constrained and constrained loglikelihoods Limited Dependent Variable LDV A dependent or re sponse variable whose range is restricted in some important way Linear Function A function where the change in the depen dent variable given a oneunit change in an independent variable is constant Linear Probability Model LPM A binary response model where the response probability is linear in its parameters Linear Time Trend A trend that is a linear function of time Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 763 Linear Unbiased Estimator In multiple regression analysis an unbiased estimator that is a linear function of the out comes on the dependent variable Linearly Independent Vectors A set of vectors such that no vector can be written as a linear combination of the others in the set Log Function A mathematical function defined only for strictly positive arguments with a positive but decreasing slope Logarithmic Function A mathematical function defined for positive arguments that has a positive but diminishing slope Logit Model A model for binary response where the re sponse probability is the logit function evaluated at a linear function of the explanatory variables LogLevel Model A regression model where the dependent variable is in logarithmic form and the independent vari ables are in level or original form LogLikelihood Function The sum of the loglikelihoods where the loglikelihood for each observation is the log of the density of the dependent variable given the explanatory variables the loglikelihood function is viewed as a func tion of the parameters to be estimated LogLog Model A regression model where the dependent variable and at least some of the explanatory variables are in logarithmic form Longitudinal Data See panel data LongRun Elasticity The longrun propensity in a distrib uted lag model with the dependent and independent vari ables in logarithmic form thus the longrun elasticity is the eventual percentage increase in the explained variable given a permanent 1 increase in the explanatory variable LongRun Multiplier See longrun propensity LongRun Propensity LRP In a distributed lag model the eventual change in the dependent variable given a perma nent oneunit increase in the independent variable Loss Function A function that measures the loss when a forecast differs from the actual outcome the most common examples are absolute value loss and squared loss M Marginal Effect The effect on the dependent variable that results from changing an independent variable by a small amount Martingale A time series process whose expected value given all past outcomes on the series simply equals the most recent value Martingale Difference Sequence The first difference of a martingale It is unpredictable or has a zero mean given past values of the sequence Matched Pair Sample A sample where each observation is matched with another as in a sample consisting of a hus band and wife or a set of two siblings Matrix An array of numbers Matrix Multiplication An algorithm for multiplying to gether two conformable matrices Matrix Notation A convenient mathematical notation grounded in matrix algebra for expressing and manipulat ing the multiple regression model Maximum Likelihood Estimation MLE A broadly appli cable estimation method where the parameter estimates are chosen to maximize the loglikelihood function Maximum Likelihood Estimator An estimator that maxi mizes the log of the likelihood function Mean See expected value Mean Absolute Error MAE A performance measure in forecasting computed as the average of the absolute values of the forecast errors Mean Independent The key requirement in multiple regres sion analysis which says the unobserved error has a mean that does not change across subsets of the population de fined by different values of the explanatory variables Mean Squared Error MSE The expected squared distance that an estimator is from the population value it equals the variance plus the square of any bias Measurement Error The difference between an observed variable and the variable that belongs in a multiple regres sion equation Median In a probability distribution it is the value where there is a 50 chance of being below the value and a 50 chance of being above it In a sample of numbers it is the middle value after the numbers have been ordered Method of Moments Estimator An estimator obtained by using the sample analog of population moments ordinary least squares and two stage least squares are both method of moments estimators Micronumerosity A term introduced by Arthur Goldberger to describe properties of econometric estimators with small sample sizes Minimum Variance Unbiased Estimator An estimator with the smallest variance in the class of all unbiased estimators Missing at Random In multiple regression analysis a miss ing data mechanism where the reason data are missing may be correlated with the explanatory variables but is indepen dent of the error term Missing Completely at Random MCAR In multiple re gression analysis a missing data mechanism where the rea son data are missing is statistically independent of the values of the explanatory variables as well as the unobserved error Missing Data A data problem that occurs when we do not observe values on some variables for certain observations individuals cities time periods and so on in the sample Misspecification Analysis The process of determining likely biases that can arise from omitted variables mea surement error simultaneity and other kinds of model misspecification Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 764 Moving Average Process of Order One MA1 A time series process generated as a linear function of the current value and one lagged value of a zeromean constant vari ance uncorrelated stochastic process Multicollinearity A term that refers to correlation among the independent variables in a multiple regression model it is usually invoked when some correlations are large but an actual magnitude is not well defined Multiple Hypotheses Test A test of a null hypothesis involv ing more than one restriction on the parameters Multiple Linear Regression MLR Model A model linear in its parameters where the dependent variable is a func tion of independent variables plus an error term Multiple Regression Analysis A type of analysis that is used to describe estimation of and inference in the multiple linear regression model Multiple Restrictions More than one restriction on the pa rameters in an econometric model MultipleStepAhead Forecast A time series forecast of more than one period into the future Multiplicative Measurement Error Measurement error where the observed variable is the product of the true unob served variable and a positive measurement error Multivariate Normal Distribution A distribution for mul tiple random variables where each linear combination of the random variables has a univariate onedimensional normal distribution N nRSquared Statistic See Lagrange multiplier statistic Natural Experiment A situation where the economic en vironmentsometimes summarized by an explanatory variableexogenously changes perhaps inadvertently due to a policy or institutional change Natural Logarithm See logarithmic function Nominal Variable A variable measured in nominal or cur rent dollars Nonexperimental Data Data that have not been obtained through a controlled experiment Nonlinear Function A function whose slope is not constant Nonnested Models Two or more models where no model can be written as a special case of the other by imposing restrictions on the parameters Nonrandom Sample A sample obtained other than by sam pling randomly from the population of interest Nonstationary Process A time series process whose joint distributions are not constant across different epochs Normal Distribution A probability distribution commonly used in statistics and econometrics for modeling a popula tion Its probability distribution function has a bell shape Normality Assumption The classical linear model assump tion which states that the error or dependent variable has a normal distribution conditional on the explanatory variables Null Hypothesis In classical hypothesis testing we take this hypothesis as true and require the data to provide substan tial evidence against it Numerator Degrees of Freedom In an F test the number of restrictions being tested O Observational Data See nonexperimental data OLS See ordinary least squares OLS Intercept Estimate The intercept in an OLS regression line OLS Regression Line The equation relating the predicted value of the dependent variable to the independent vari ables where the parameter estimates have been obtained by OLS OLS Slope Estimate A slope in an OLS regression line Omitted Variable Bias The bias that arises in the OLS es timators when a relevant variable is omitted from the regression Omitted Variables One or more variables which we would like to control for have been omitted in estimating a re gression model OneSided Alternative An alternative hypothesis that states that the parameter is greater than or less than the value hypothesized under the null OneStepAhead Forecast A time series forecast one period into the future OneTailed Test A hypothesis test against a onesided alternative Online Databases Databases that can be accessed via a com puter network Online Search Services Computer software that allows the Internet or databases on the Internet to be searched by topic name title or keywords Order Condition A necessary condition for identifying the parameters in a model with one or more endogenous explanatory variables the total number of exogenous variables must be at least as great as the total number of explanatory variables Ordinal Variable A variable where the ordering of the val ues conveys information but the magnitude of the values does not Ordinary Least Squares OLS A method for estimating the parameters of a multiple linear regression model The ordinary least squares estimates are obtained by minimiz ing the sum of squared residuals Outliers Observations in a data set that are substantially dif ferent from the bulk of the data perhaps because of errors or because some data are generated by a different model than most of the other data Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 765 OutofSample Criteria Criteria used for choosing forecast ing models which are based on a part of the sample that was not used in obtaining parameter estimates Over Controlling In a multiple regression model including explanatory variables that should not be held fixed when studying the ceteris paribus effect of one or more other ex planatory variables this can occur when variables that are themselves outcomes of an intervention or a policy are in cluded among the regressors Overall Significance of a Regression A test of the joint sig nificance of all explanatory variables appearing in a mul tiple regression equation Overdispersion In modeling a count variable the variance is larger than the mean Overidentified Equation In models with endogenous ex planatory variables an equation where the number of in strumental variables is strictly greater than the number of endogenous explanatory variables Overidentifying Restrictions The extra moment conditions that come from having more instrumental variables than endogenous explanatory variables in a linear model Overspecifying a Model See inclusion of an irrelevant variable P pValue The smallest significance level at which the null hy pothesis can be rejected Equivalently the largest signifi cance level at which the null hypothesis cannot be rejected Pairwise Uncorrelated Random Variables A set of two or more random variables where each pair is uncorrelated Panel Data A data set constructed from repeated cross sec tions over time With a balanced panel the same units ap pear in each time period With an unbalanced panel some units do not appear in each time period often due to attrition Parameter An unknown value that describes a population relationship Parsimonious Model A model with as few parameters as possible for capturing any desired features Partial Derivative For a smooth function of more than one variable the slope of the function in one direction Partial Effect The effect of an explanatory variable on the dependent variable holding other factors in the regression model fixed Partial Effect at the Average PEA In models with non constant partial effects the partial effect evaluated at the average values of the explanatory variables Percent Correctly Predicted In a binary response model the percentage of times the prediction of zero or one coin cides with the actual outcome Percentage Change The proportionate change in a variable multiplied by 100 Percentage Point Change The change in a variable that is measured as a percentage Perfect Collinearity In multiple regression one independent variable is an exact linear function of one or more other independent variables PlugIn Solution to the Omitted Variables Problem A proxy variable is substituted for an unobserved omitted variable in an OLS regression Point Forecast The forecasted value of a future outcome Poisson Distribution A probability distribution for count variables Poisson Regression Model A model for a count dependent variable where the dependent variable conditional on the explanatory variables is nominally assumed to have a Pois son distribution Policy Analysis An empirical analysis that uses econometric methods to evaluate the effects of a certain policy Pooled Cross Section A data configuration where indepen dent cross sections usually collected at different points in time are combined to produce a single data set Pooled OLS Estimation OLS estimation with independently pooled cross sections panel data or cluster samples where the observations are pooled across time or group as well as across the crosssectional units Population A welldefined group of people firms cities and so on that is the focus of a statistical or econometric analysis Population Model A model especially a multiple linear re gression model that describes a population Population RSquared In the population the fraction of the variation in the dependent variable that is explained by the explanatory variables Population Regression Function See conditional expectation Positive Definite A symmetric matrix such that all quadratic forms except the trivial one that must be zero are strictly positive Positive SemiDefinite A symmetric matrix such that all quadratic forms are nonnegative Power of a Test The probability of rejecting the null hypoth esis when it is false the power depends on the values of the population parameters under the alternative Practical Significance The practical or economic impor tance of an estimate which is measured by its sign and magnitude as opposed to its statistical significance PraisWinsten PW Estimation A method of estimating a multiple linear regression model with AR1 errors and strictly exogenous explanatory variables unlike Cochrane Orcutt PraisWinsten uses the equation for the first time period in estimation Predetermined Variable In a simultaneous equations model either a lagged endogenous variable or a lagged ex ogenous variable Predicted Variable See dependent variable Prediction The estimate of an outcome obtained by plugging specific values of the explanatory variables into an esti mated model usually a multiple regression model Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 766 Prediction Error The difference between the actual out come and a prediction of that outcome Prediction Interval A confidence interval for an unknown outcome on a dependent variable in a multiple regression model Predictor Variable See explanatory variable Probability Density Function pdf A function that for dis crete random variables gives the probability that the ran dom variable takes on each value for continuous random variables the area under the pdf gives the probability of various events Probability Limit The value to which an estimator con verges as the sample size grows without bound Probit Model A model for binary responses where the re sponse probability is the standard normal cdf evaluated at a linear function of the explanatory variables Program Evaluation An analysis of a particular private or public program using econometric methods to obtain the causal effect of the program Proportionate Change The change in a variable relative to its initial value mathematically the change divided by the initial value Proxy Variable An observed variable that is related but not identical to an unobserved explanatory variable in multiple regression analysis Pseudo RSquared Any number of goodnessoffit measures for limited dependent variable models Q Quadratic Form A mathematical function where the vector argument both pre and postmultiplies a square symmet ric matrix Quadratic Functions Functions that contain squares of one or more explanatory variables they capture diminishing or increasing effects on the dependent variable Qualitative Variable A variable describing a nonquantita tive feature of an individual a firm a city and so on QuasiDemeaned Data In random effects estimation for panel data it is the original data in each time period minus a fraction of the time average these calculations are done for each crosssectional observation QuasiDifferenced Data In estimating a regression model with AR1 serial correlation it is the difference between the current time period and a multiple of the previous time pe riod where the multiple is the parameter in the AR1 model QuasiExperiment See natural experiment QuasiLikelihood Ratio Statistic A modification of the likelihood ratio statistic that accounts for possible distribu tional misspecification as in a Poisson regression model QuasiMaximum Likelihood Estimation QMLE Maxi mum likelihood estimation where the loglikelihood function may not correspond to the actual conditional dis tribution of the dependent variable R RBar Squared See adjusted Rsquared RSquared In a multiple regression model the proportion of the total sample variation in the dependent variable that is explained by the independent variable RSquared Form of the F Statistic The F statistic for testing exclusion restrictions expressed in terms of the Rsquareds from the restricted and unrestricted models Random Coefficient Slope Model A multiple regression model where the slope parameters are allowed to depend on unobserved unitspecific variables Random Effects Estimator A feasible GLS estimator in the unobserved effects model where the unobserved effect is assumed to be uncorrelated with the explanatory variables in each time period Random Effects Model The unobserved effects panel data model where the unobserved effect is assumed to be uncor related with the explanatory variables in each time period Random Sample A sample obtained by sampling randomly from the specified population Random Sampling A sampling scheme whereby each ob servation is drawn at random from the population In par ticular no unit is more likely to be selected than any other unit and each draw is independent of all other draws Random Variable A variable whose outcome is uncertain Random Vector A vector consisting of random variables Random Walk A time series process where next periods value is obtained as this periods value plus an indepen dent or at least an uncorrelated error term Random Walk with Drift A random walk that has a con stant or drift added in each period Rank Condition A sufficient condition for identification of a model with one or more endogenous explanatory variables Rank of a Matrix The number of linearly independent col umns in a matrix Rational Distributed Lag RDL Model A type of infinite distributed lag model where the lag distribution depends on relatively few parameters Real Variable A monetary value measured in terms of a base period Reduced Form Equation A linear equation where an en dogenous variable is a function of exogenous variables and unobserved errors Reduced Form Error The error term appearing in a reduced form equation Reduced Form Parameters The parameters appearing in a reduced form equation Regressand See dependent variable Regression Specification Error Test RESET A general test for functional form in a multiple regression model it is an F test of joint significance of the squares cubes and perhaps higher powers of the fitted values from the initial OLS estimation Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 767 Regression through the Origin Regression analysis where the intercept is set to zero the slopes are obtained by mini mizing the sum of squared residuals as usual Regressor See explanatory variable Rejection Region The set of values of a test statistic that leads to rejecting the null hypothesis Rejection Rule In hypothesis testing the rule that deter mines when the null hypothesis is rejected in favor of the alternative hypothesis Relative Change See proportionate change Resampling Method A technique for approximating stan dard errors and distributions of test statistics whereby a series of samples are obtained from the original data set and estimates are computed for each subsample Residual The difference between the actual value and the fit ted or predicted value there is a residual for each obser vation in the sample used to obtain an OLS regression line Residual Analysis A type of analysis that studies the sign and size of residuals for particular observations after a mul tiple regression model has been estimated Residual Sum of Squares See sum of squared residuals Response Probability In a binary response model the prob ability that the dependent variable takes on the value one conditional on explanatory variables Response Variable See dependent variable Restricted Model In hypothesis testing the model obtained after imposing all of the restrictions required under the null Retrospective Data Data collected based on past rather than current information Root Mean Squared Error RMSE Another name for the standard error of the regression in multiple regression analysis Row Vector A vector of numbers arranged as a row S Sample Average The sum of n numbers divided by n a mea sure of central tendency Sample Correlation For outcomes on two random variables the sample covariance divided by the product of the sample standard deviations Sample Correlation Coefficient An estimate of the popula tion correlation coefficient from a sample of data Sample Covariance An unbiased estimator of the popula tion covariance between two random variables Sample Regression Function SRF See OLS regression line Sample Selection Bias Bias in the OLS estimator which is induced by using data that arise from endogenous sample selection Sample Standard Deviation A consistent estimator of the population standard deviation Sample Variance An unbiased consistent estimator of the population variance Sampling Distribution The probability distribution of an es timator over all possible sample outcomes Sampling Standard Deviation The standard deviation of an estimator that is the standard deviation of a sampling distribution Sampling Variance The variance in the sampling distribu tion of an estimator it measures the spread in the sampling distribution Scalar Multiplication The algorithm for multiplying a sca lar number by a vector or matrix Scalar VarianceCovariance Matrix A variancecovariance matrix where all offdiagonal terms are zero and the diago nal terms are the same positive constant Score Statistic See Lagrange multiplier statistic Seasonal Dummy Variables A set of dummy variables used to denote the quarters or months of the year Seasonality A feature of monthly or quarterly time series where the average value differs systematically by season of the year Seasonally Adjusted Monthly or quarterly time series data where some statistical procedurepossibly regression on seasonal dummy variableshas been used to remove the seasonal component Selected Sample A sample of data obtained not by random sampling but by selecting on the basis of some observed or unobserved characteristic SelfSelection Deciding on an action based on the likely benefits or costs of taking that action SemiElasticity The percentage change in the depen dent variable given a oneunit increase in an independent variable Sensitivity Analysis The process of checking whether the estimated effects and statistical significance of key explan atory variables are sensitive to inclusion of other explana tory variables functional form dropping of potentially outlying observations or different methods of estimation Sequentially Exogenous A feature of an explanatory vari able in time series or panel data models where the error term in the current time period has a zero mean conditional on all current and past explanatory variables a weaker ver sion is stated in terms of zero correlations Serial Correlation In a time series or panel data model cor relation between the errors in different time periods Serial CorrelationRobust Standard Error A standard er ror for an estimator that is asymptotically valid whether or not the errors in the model are serially correlated Serially Uncorrelated The errors in a time series or panel data model are pairwise uncorrelated across time ShortRun Elasticity The impact propensity in a distributed lag model when the dependent and independent variables are in logarithmic form Significance Level The probability of a Type I error in hy pothesis testing Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 768 Simple Linear Regression Model A model where the de pendent variable is a linear function of a single independent variable plus an error term Simultaneity A term that means at least one explanatory variable in a multiple linear regression model is determined jointly with the dependent variable Simultaneity Bias The bias that arises from using OLS to estimate an equation in a simultaneous equations model Simultaneous Equations Model SEM A model that jointly determines two or more endogenous variables where each endogenous variable can be a function of other endogenous variables as well as of exogenous variables and an error term Skewness A measure of how far a distribution is from being symmetric based on the third moment of the standardized random variable Slope In the equation of a line the change in the y variable when the x variable increases by one Slope Parameter The coefficient on an independent variable in a multiple regression model Smearing Estimate A retransformation method particularly useful for predicting the level of a response variable when a linear model has been estimated for the natural log of the response variable Spreadsheet Computer software used for entering and ma nipulating data Spurious Correlation A correlation between two variables that is not due to causality but perhaps to the dependence of the two variables on another unobserved factor Spurious Regression Problem A problem that arises when re gression analysis indicates a relationship between two or more unrelated time series processes simply because each has a trend is an integrated time series such as a random walk or both Square Matrix A matrix with the same number of rows as columns Stable AR1 Process An AR1 process where the param eter on the lag is less than one in absolute value The cor relation between two random variables in the sequence declines to zero at a geometric rate as the distance between the random variables increases and so a stable AR1 pro cess is weakly dependent Standard Deviation A common measure of spread in the distribution of a random variable Standard Deviation of bj A common measure of spread in the sampling distribution of bj Standard Error Generically an estimate of the standard de viation of an estimator Standard Error of bj An estimate of the standard deviation in the sampling distribution of bj Standard Error of the Estimate See standard error of the regression Standard Error of the Regression SER In multiple re gression analysis the estimate of the standard deviation of the population error obtained as the square root of the sum of squared residuals over the degrees of freedom Standard Normal Distribution The normal distribution with mean zero and variance one Standardized Coefficients Regression coefficients that measure the standard deviation change in the dependent variable given a one standard deviation increase in an inde pendent variable Standardized Random Variable A random variable trans formed by subtracting off its expected value and dividing the result by its standard deviation the new random vari able has mean zero and standard deviation one Static Model A time series model where only contempora neous explanatory variables affect the dependent variable Stationary Process A time series process where the mar ginal and all joint distributions are invariant across time Statistical Inference The act of testing hypotheses about population parameters Statistical Significance The importance of an estimate as measured by the size of a test statistic usually a t statistic Statistically Different from Zero See statistically significant Statistically Insignificant Failure to reject the null hypoth esis that a population parameter is equal to zero at the cho sen significance level Statistically Significant Rejecting the null hypothesis that a parameter is equal to zero against the specified alternative at the chosen significance level Stochastic Process A sequence of random variables indexed by time Stratified Sampling A nonrandom sampling scheme whereby the population is first divided into several non overlapping exhaustive strata and then random samples are taken from within each stratum Strict Exogeneity An assumption that holds in a time series or panel data model when the explanatory variables are strictly exogenous Strictly Exogenous A feature of explanatory variables in a time series or panel data model where the error term at any time period has zero expectation conditional on the explanatory variables in all time periods a less restrictive version is stated in terms of zero correlations Strongly Dependent See highly persistent Structural Equation An equation derived from economic theory or from less formal economic reasoning Structural Error The error term in a structural equation which could be one equation in a simultaneous equations model Structural Parameters The parameters appearing in a struc tural equation Studentized Residuals The residuals computed by exclud ing each observation in turn from the estimation divided by the estimated standard deviation of the error Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 769 Sum of Squared Residuals SSR In multiple regression analysis the sum of the squared OLS residuals across all observations Summation Operator A notation denoted by used to de fine the summing of a set of numbers Symmetric Distribution A probability distribution charac terized by a probability density function that is symmet ric around its median value which must also be the mean value whenever the mean exists Symmetric Matrix A square matrix that equals its transpose T t Distribution The distribution of the ratio of a standard nor mal random variable and the square root of an independent chisquare random variable where the chisquare random variable is first divided by its df t Ratio See t statistic t Statistic The statistic used to test a single hypothesis about the parameters in an econometric model Test Statistic A rule used for testing hypotheses where each sample outcome produces a numerical value Text Editor Computer software that can be used to edit text files Text ASCII File A universal file format that can be trans ported across numerous computer platforms TimeDemeaned Data Panel data where for each cross sectional unit the average over time is subtracted from the data in each time period Time Series Data Data collected over time on one or more variables Time Series Process See stochastic process Time Trend A function of time that is the expected value of a trending time series process Tobit Model A model for a dependent variable that takes on the value zero with positive probability but is roughly con tinuously distributed over strictly positive values See also corner solution response Top Coding A form of data censoring where the value of a variable is not reported when it is above a given threshold we only know that it is at least as large as the threshold Total Sum of Squares SST The total sample variation in a dependent variable about its sample average Trace of a Matrix For a square matrix the sum of its diago nal elements Transpose For any matrix the new matrix obtained by inter changing its rows and columns Treatment Group In program evaluation the group that par ticipates in the program Trending Process A time series process whose expected value is an increasing or a decreasing function of time TrendStationary Process A process that is stationary once a time trend has been removed it is usually implicit that the detrended series is weakly dependent True Model The actual population model relating the depen dent variable to the relevant independent variables plus a disturbance where the zero conditional mean assumption holds Truncated Normal Regression Model The special case of the truncated regression model where the underly ing population model satisfies the classical linear model assumptions Truncated Regression Model A linear regression model for crosssectional data in which the sampling scheme entirely excludes on the basis of outcomes on the dependent vari able part of the population TwoSided Alternative An alternative where the population parameter can be either less than or greater than the value stated under the null hypothesis Two Stage Least Squares 2SLS Estimator An instru mental variables estimator where the IV for an endogenous explanatory variable is obtained as the fitted value from regressing the endogenous explanatory variable on all ex ogenous variables TwoTailed Test A test against a twosided alternative Type I Error A rejection of the null hypothesis when it is true Type II Error The failure to reject the null hypothesis when it is false U Unbalanced Panel A panel data set where certain years or periods of data are missing for some crosssectional units Unbiased Estimator An estimator whose expected value or mean of its sampling distribution equals the population value regardless of the population value Uncentered Rsquared The Rsquared computed without subtracting the sample average of the dependent variable when obtaining the total sum of squares SST Unconditional Forecast A forecast that does not rely on knowing or assuming values for future explanatory variables Uncorrelated Random Variables Random variables that are not linearly related Underspecifying a Model See excluding a relevant variable Unidentified Equation An equation with one or more en dogenous explanatory variables where sufficient instru mental variables do not exist to identify the parameters Unit Root Process A highly persistent time series process where the current value equals last periods value plus a weakly dependent disturbance Unobserved Effect In a panel data model an unobserved variable in the error term that does not change over time For cluster samples an unobserved variable that is com mon to all units in the cluster Unobserved Effects Model A model for panel data or clus ter samples where the error term contains an unobserved effect Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Glossary 770 Unobserved Heterogeneity See unobserved effect Unrestricted Model In hypothesis testing the model that has no restrictions placed on its parameters Upward Bias The expected value of an estimator is greater than the population parameter value V Variance A measure of spread in the distribution of a ran dom variable VarianceCovariance Matrix For a random vector the posi tive semidefinite matrix defined by putting the variances down the diagonal and the covariances in the appropriate offdiagonal entries VarianceCovariance Matrix of the OLS Estimator The matrix of sampling variances and covariances for the vector of OLS coefficients Variance Inflation Factor In multiple regression analysis under the GaussMarkov assumptions the term in the sam pling variance affected by correlation among the explana tory variables Variance of the Prediction Error The variance in the error that arises when predicting a future value of the depen dent variable based on an estimated multiple regression equation Vector Autoregressive VAR Model A model for two or more time series where each variable is modeled as a linear function of past values of all variables plus disturbances that have zero means given all past values of the observed variables W Wald Statistic A general test statistic for testing hypotheses in a variety of econometric settings typically the Wald sta tistic has an asymptotic chisquare distribution Weak Instruments Instrumental variables that are only slightly correlated with the relevant endogenous explana tory variable or variables Weakly Dependent A term that describes a time series pro cess where some measure of dependence between random variables at two points in timesuch as correlationdimin ishes as the interval between the two points in time increases Weighted Least Squares WLS Estimator An estima tor used to adjust for a known form of heteroskedasticity where each squared residual is weighted by the inverse of the estimated variance of the error White Test A test for heteroskedasticity that involves re gressing the squared OLS residuals on the OLS fitted values and on the squares of the fitted values in its most general form the squared OLS residuals are regressed on the explanatory variables the squares of the explanatory variables and all the nonredundant interactions of the ex planatory variables Within Estimator See fixed effects estimator Within Transformation See fixed effects transformation Y Year Dummy Variables For data sets with a time series component dummy binary variables equal to one in the relevant year and zero in all other years Z Zero Conditional Mean Assumption A key assumption used in multiple regression analysis that states that given any values of the explanatory variables the expected value of the error equals zero See Assumptions MLR4 TS3 and TS3 in the text Zero Matrix A matrix where all entries are zero ZeroOne Variable See dummy variable Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it 771 Index Numbers 2SLS See two stage least squares 401k plans asymptotic normality 155156 comparison of simple and multiple regression estimates 70 statistical vs practical significance 121 WLS estimation 259 A ability and wage causality 12 excluding ability from model 7883 IV for ability 481 mean independence 23 proxy variable for ability 279285 adaptive expectations 353 355 adjusted Rsquareds 181184 374 AFDC participation 231 age financial wealth and 257259 263 smoking and 261262 aggregate consumption function 511514 air pollution and housing prices beta coefficients 175176 logarithmic forms 171173 quadratic functions 175177 t test 118 alcohol drinking 230 alternative hypotheses defined 694 onesided 110114 695 twosided 114115 695 antidumping filings and chemical imports AR3 serial correlation 381 dummy variables 327328 forecasting 596 597 598 PW estimation 384 seasonality 336338 apples ecolabeled 180181 AR1 models consistency example 350351 testing for after 2SLS estimation 486 AR1 serial correlation correcting for 381387 testing for 376381 AR2 models EMH example 352 forecasting example 352 397 ARCH model 393394 ARq serial correlation correcting for 386387 testing for 379380 arrests asymptotic normality 155156 average sentence length and 249 goodnessoffit 72 heteroskedasticityrobust LM statistic 249 linear probability model 227228 normality assumption and 107 Poisson regression 545546 ASCII files 609 assumptions classical linear model CLM 106 establishing unbiasedness of OLS 7377 317320 homoskedasticity 4548 8283 89 363 matrix notation 723726 for multiple linear regressions 7377 82 89 152 normality 105108 322 for simple linear regressions 4045 4548 for time series regressions 317323 348354 363 zero mean and zero correlation 152 asymptotically uncorrelated sequences 346348 asymptotic bias deriving 153154 asymptotic confidence interval 157 asymptotic efficiency of OLS 161162 asymptotic normality of estimators in general 683684 asymptotic normality of OLS for multiple linear regressions 156158 for time series regressions 351354 asymptotic properties See large sample properties Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 772 asymptotics OLS See OLS asymptotics asymptotic sample properties of estimators 681684 asymptotic standard errors 157 asymptotic t statistics 157 asymptotic variance 156 atime series data applying 2SLS to 485486 attenuation bias 291 292 attrition 441 augmented DickeyFuller test 576 autocorrelation 320322 See also serial correlation autoregressive conditional heteroskedacity ARCH model 393394 autoregressive model of order two AR2 See AR2 models autoregressive process of order one AR1 347 auxiliary regression 159 average using summation operator 629 average marginal effect AME 286 532 average partial effect APE 286 532 540 average treatment effect 410 B balanced panel 420 baseball players salaries nonnested models 183 testing exclusion restrictions 127132 base group 208 base period and value 326 base value 326 beer price and demand 185186 taxes and traffic fatalities 184 benchmark group 208 Bernoulli random variables 646647 best linear unbiased estimator BLUE 89 beta coefficients 169170 between estimators 435 bias attenuation 291 292 heterogeneity 413 omitted variable 7883 simultaneity in OLS 503504 biased estimators 677678 biased toward zero 80 binary random variable 646 binary response models See logit and probit models binary variables See also qualitative information defined 206 random 646647 binomial distribution 651 birth weight AFDC participation and 231 asymptotic standard error 158 data scaling 166168 F statistic 133134 IV estimation 470471 bivariate linear regression model See simple regression model BLUE best linear unbiased estimator 89 bootstrap standard error 204 BreuschGodfrey test 381 BreuschPagan test for heteroskedasticity 251 C calculus differential 640642 campus crimes t test 116117 causality 1014 cdf cumulative distribution functions 648649 censored regression models 547552 Center for Research in Security Prices CRSP 608 central limit theorem 684 CEO salaries in multiple regressions motivation for multiple regression 6364 nonnested models 183184 predicting 192 193194 writing in population form 74 returns on equity and fitted values and residuals 32 goodnessoffit 35 OLS Estimates 2930 sales and constant elasticity model 39 ceteris paribus 1014 66 6768 chemical firms nonnested models 183 chemical imports See antidumping filings and chemical imports chisquare distribution critical values table 749 discussions 669 717 Chow tests differences across groups 223224 heteroskedasticity and 247248 for panel data 423424 for structural change across time 406 cigarettes See smoking city crimes See also crimes law enforcement and 13 panel data 910 classical errorsinvariables CEV 290 classical linear model CLM assumptions 106 clearup rate distributed lag estimation 416417 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 773 clusters 449450 effect 449 sample 449 CochraneOrcutt CO estimation 383 391 395 coefficient of determination See Rsquareds cointegration 580584 college admission omitting unobservables 285 college GPA beta coefficients 169170 fitted values and intercept 68 gender and 221224 goodnessoffit 71 heteroskedasticityrobust F statistic 247248 interaction effect 178179 interpreting equations 66 with measurement error 292 partial effect 67 population regression function 23 predicted 187188 189 with single dummy variable 209210 t test 115 college proximity as IV for education 473474 colleges junior vs fouryear 124127 collinearity perfect 7476 column vectors 709 commute time and freeway width 702703 compact discs demand for 732 complete cases estimator 293 composite error 413 term 441 Compustat 608 computer ownership college GPA and 209210 determinants of 267 computers grants to buy reducing error variance 185186 Rsquared size 180181 computer usage and wages with interacting terms 218 proxy variable in 282283 conceptual framework 615 conditional distributions features 652658 overview 649 651653 conditional expectations 661665 conditional forecasts 587 conditional median 300302 conditional variances 665 confidence intervals 95 rule of thumb for 691 asymptotic 157 asymptotic for nonnormal populations 692693 hypothesis testing and 701702 interval estimation and 687693 main discussions 122123 687688 for mean from normally distributed population 689691 for predictions 186189 consistency of estimators in general 681683 consistency of OLS in multiple regressions 150154 sampling selection and 553554 in time series regressions 348351 372 consistent tests 703 constant dollars 326 constant elasticity model 39 75 638 constant terms 21 consumer price index CPI 323 consumption See under family income contemporaneously exogenous variables 318 continuous random variables 648649 control group 210 control variables 21 See also independent variables corner solution response 525 corrected Rsquareds 181184 correlated random effects 445447 correlation 2223 coefficients 659660 count variables 543547 county crimes multiyear panel data 422423 covariances 658659 stationary processes 345346 covariates 21 CPI consumer price index 323 crimes See also arrests on campuses t test 116117 in cities law enforcement and 13 in cities panel data 910 clearup rate 416417 in counties multiyear panel data 422423 earlier data use of 283284 econometric model of 45 economic model of 3 160 275277 functional form misspecification 275277 housing prices and beta coefficients 175176 LM statistic 160 prison population and SEM 515516 unemployment and twoperiod panel data 412417 criminologists 607 critical values discussions 110 695 tables of 743749 crop yields and fertilizers causality 11 12 simple equation 2122 crosssectional analysis 612 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 774 crosssectional data See also panel data pooled cross sections regression analysis GaussMarkov assumptions and 82 354 main discussion 57 time series data vs 312313 CRSP Center for Research in Security Prices 608 cumulative areas under standard normal distribution 743744 cumulative distribution functions cdf 648649 cumulative effect 316 current dollars 326 cyclical unemployment 353 D data collection 608611 economic types of 512 experimental vs nonexperimental 2 frequency 7 data issues See also misspecification measurement error 287292 missing data 293294 multicollinearity 8386 293294 nonrandom samples 294295 outliers and influential observations 296300 random slopes 285287 unobserved explanatory variables 279285 data mining 613 data scaling effects on OLS statistics 166170 DavidsonMacKinnon test 278 deficits See interest rates degrees of freedom df chisquare distributions with n 669 for fixed effects estimator 436 for OLS estimators 88 dependent variables See also regression analysis specific event studies defined 21 measurement error in 289292 derivatives 635 descriptive statistics 629 deseasonalizing data 337 detrending 334335 diagonal matrices 710 DickeyFuller distribution 575 DickeyFuller DF test 575578 augmented 576 differenceindifferences estimator 408 410 difference in slopes 218224 differencestationary processes 358 differencing panel data with more than two periods 420425 twoperiod 412417 serial correlation and 387388 differential calculus 640642 diminishing marginal effects 635 discrete random variables 646647 disturbance terms 4 21 63 disturbance variances 45 downward bias 80 drug usage 230 drunk driving laws and fatalities 419 dummy variables See also qualitative information year dummy variables defined 206 regression 438439 trap 208 duration analysis 549551 DurbinWatson test 378379 381 dynamically complete models 360363 E EagleGranger test 581582 earnings of veterans IV estimation 469 EconLit 606 607 econometric analysis in projects 611614 econometric models 45 See also econometric models econometrics 12 See also specific topics economic growth and government policies 7 economic models 25 economic significance See practical significance economic vs statistical significance 120124 702703 economists types of 606607 education birth weight and 133134 fertility and 2SLS 487 with discrete dependent variables 231232 independent cross sections 404405 gender wage gap and 405406 IV for 463 473474 logarithmic equation 639 return to 2SLS 477 differencing 448 fixed effects estimation 438 independent cross sections 405406 IQ and 281282 IV estimation 467469 testing for endogeneity 482 testing overidentifying restrictions 482 wages and See under wages return to education over time 405406 smoking and 261262 women and 225227 See also under women in labor force Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 775 efficiency asymptotic 161162 of estimators in general 679680 of OLS with serially correlated errors 373374 efficient markets hypothesis EMH asymptotic analysis example 352353 heteroskedasticity and 393 elasticity 39 637638 elections See voting outcomes EMH See efficient markets hypothesis EMH empirical analysis data collection 608611 econometric analysis 611614 literature review 607608 posing question 605607 sample projects 621625 steps in 25 writing paper 614621 employment and unemployment See also wages arrests and 227228 crimes and 412417 enterprise zones and 422 estimating average rate 675 forecasting 589 591592 594 inflation and See under inflation in Puerto Rico logarithmic form 323324 time series data 78 women and See women in labor force endogenous explanatory variables See also instrumental variables simultaneous equations models two stage least squares defined 76 274 in logit and probit models 536 sample selection and 557 tesing for 481482 endogenous sample selection 294 EngleGranger twostep procedure 586 enrollment t test 116117 enterprise zones business investments and 696697 unemployment and 422 error correction models 584586 errorsinvariables problem 479481 512 error terms 4 21 63 error variances adding regressors to reduce 185186 defined 45 83 estimating 4850 estimated GLS See feasible GLS estimation and estimators See also first differencing fixed effects instrumental variables logit and probit models OLS ordinary least squares random effects Tobit model advantages of multiple over simple regression 6064 asymptotic sample properties of 681684 changing independent variables simultaneously 68 defined 675 differenceindifferences 408410 finite sample properties of 675680 LAD 300302 language of 9091 method of moments approach 2526 misspecifying models 7883 sampling distributions of OLS estimators 105108 event studies 325 327328 Excel 610 excluding relevant variables 7883 exclusion restrictions 127 for 2SLS 475 general linear 136137 Lagrange multiplier LM statistic 158160 overall significance of regressions 135 for SEM 510511 testing 127132 exogenous explanatory variables 76 exogenous sample selection 294 553 expectations augmented Phillips curve 353354 377 378 expectations hypothesis 14 expected values 652654 716 experience wage and causality 12 interpreting equations 67 motivation for multiple regression 61 omitted variable bias 81 partial effect 642 quadratic functions 173175 636 women and 225227 experimental data 2 experimental group 210 experiments defined 645 explained sum of squares SSE 34 70 explained variables See also dependent variables defined 21 explanatory variables 21 See also independent variables exponential function 639 exponential smoothing 587 exponential trends 330331 F family income See also savings birth weight and asymptotic standard error 158 data scaling 166168 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 776 family income continued college GPA and 292 consumption and motivation for multiple regression 62 63 perfect collinearity and 75 farmers and pesticide usage 185 F distribution critical values table 746748 discussions 670 671 717 FDL finite distributed lag models 314316 350 416417 feasible GLS with heteroskedasticity and AR1 serial correlations 395 main discussion 258263 OLS vs 385386 Federal Bureau of Investigation 608 fertility rate education and 487 forecasting 597 over time 404405 tax exemption and with binary variables 324325 cointegration 582583 FDL model 314316 first differences 363364 serial correlation 362 trends 333 fertility studies with discrete dependent variable 231232 fertilizers land quality and 23 soybean yields and causality 11 12 simple equation 2122 final exam scores interaction effect 178179 skipping classes and 464465 financial wealth nonrandom sampling 294295 WLS estimation 257259 263 finite distributed lag FDL models 314316 350 finite sample properties of estimators 675680 of OLS in matrix form 723726 firm sales See sales firstdifferenced equations 414 firstdifferenced estimator 414 first differencing defined 414 fixed effects vs 439440 I1 time series and 358 panel data pitfalls in 423424 first order autocorrelation 359 first order conditions 27 65 642 721 fitted values See also OLS ordinary least squares in multiple regressions 6869 in simple regressions 27 32 fixed effects defined 413 dummy variable regression 438439 estimation 435441 first differencing vs 439440 random effects vs 444445 transformation 435 with unbalanced panels 440441 forecast error 586 forecasting multiplestepahead 592594 onestepahead 588 overview and definitions 586587 trending seasonal and integrated processes 594598 types of models used for 587588 forecast intervals 588 free throw shooting 651652 freeway width and commute time 702703 frequency data 7 frequency distributions 401k plans 155 F statistics See also F tests defined 129 heteroskedasticityrobust 247248 F tests See also Chow tests F statistics F and t statistics 132133 functional form misspecification and 275279 general linear restrictions 136137 LM tests and 160 overall significance of regressions 135 pvalues for 134135 reporting regression results 137138 Rsquared form 133134 testing exclusion restrictions 127132 functional forms in multiple regressions with interaction terms 177179 logarithmic 171173 misspecification 275279 quadratic 173177 in simple regressions 3640 in time series regressions 323324 G Gaussian distribution 665 GaussMarkov assumptions for multiple linear regressions 7377 82 for simple linear regressions 4044 4548 for time series regressions 319322 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 777 GaussMarkov Theorem for multiple linear regressions 8990 for OLS in matrix form 725726 GDL geometric distributed lag 571572 GDP See gross domestic product GDP gender oversampling 295 wage gap 405406 gender gap independent cross sections 405406 panel data 405406 generalized least squares GLS estimators for AR1 models 383387 with heteroskedasticity and AR1 serial correlations 395 when heteroskedasticity function must be estimated 258263 when heteroskedasticity is known up to a multiplicative constant 255256 geometric distributed lag GDL 571572 GLS estimators See generalized least squares GLS estimators Goldberger Arthur 85 goodnessoffit See also predictions Rsquareds change in unit of measurement and 37 in multiple regressions 7071 overemphasizing 184185 percent correctly predicted 227 530 in simple regressions 3536 in time series regressions 374 Google Scholar 606 government policies economic growth and 6 89 GPA See college GPA Granger Clive W J 150 Granger causality 590 gross domestic product GDP data frequency for 7 government policies and 6 high persistence 355357 in real terms 326 seasonal adjustment of 336 unit root test 578 growth rate 331 gun control laws 230 H HAC standard errors 389 Hartford School District 190 Hausman test 262 444 Head Start participation 230 Heckit method 556 heterogeneity bias 413 heteroskedasticity See also weighted least squares estimation 2SLS with 484485 consequences of for OLS 243244 defined 45 HAC standard errors 389 heteroskedasticityrobust procedures 244249 linear probability model and 265267 robust F statistic 247 robust LM Statistic 248 robust t statistic 246 for simple linear regressions 4548 testing for 249254 for time series regressions 363 in time series regressions 391395 of unknown form 244 in wage equation 46 highly persistent time series deciding whether I0 or I1 359360 description of 354363 transformations on 358360 histogram 401k plan participation 155 homoskedasticity for IV estimation 466467 for multiple linear regressions 8283 89 for OLS in matrix form 724 for time series regressions 319322 351352 hourly wages See wages housing prices and expenditures general linear restrictions 136137 heteroskedasticity BP test 251252 White test 252254 incinerators and inconsistency in OLS 153 pooled cross sections 407411 income and 631 inflation 572574 investment and computing Rsquared 334335 spurious relationship 332333 over controlling 185 with qualitative information 211 RESET 278279 savings and 502 hypotheses See also hypothesis testing about single linear combination of parameters 124127 after 2SLS estimation 479 expectations 14 language of classical testing 120 in logit and probit models 529530 multiple linear restrictions See F tests residual analysis 190 stating in empirical analysis 4 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 778 hypothesis testing about mean in normal population 695696 asymptotic tests for nonnormal populations 698 computing and using pvalues 698700 confidence intervals and 701702 in matrix form Wald statistics for 730731 overview and fundamentals 693695 practical vs statistical significance 702703 I I0 and I1 processes 359360 idempotent matrices 715 identification defined 465 in systems with three or more equations 510511 in systems with two equations 504510 identified equation 505 identity matrices 710 idiosyncratic error 413 IDL infinite distributed lag models 569574 IIP index of industrial production 326327 impact propensitymultiplier 315 incidental truncation 553 554558 incinerators and housing prices inconsistency in OLS 153 pooled cross sections 407411 including irrelevant variables 7778 income See also wages family See family income housing expenditure and 631 PIH 513514 savings and See under savings inconsistency in OLS deriving 153154 inconsistent estimators 681 independence joint distributions and 649651 independently pooled cross sections See also pooled cross sections across time 403407 defined 402 independent variables See also regression analysis specific event studies changing simultaneously 68 defined 21 measurement error in 289291 in misspecified models 7883 random 650 simple vs multiple regression 6164 index numbers 324327 industrial production index of IIP 326327 infant mortality rates outliers 299300 inference in multiple regressions confidence intervals 122124 statistical with IV estimator 466469 in time series regressions 322323 373374 infinite distributed lag models 569574 inflation from 1948 to 2003 313 openness and 508 509510 random walk model for 355 unemployment and expectations augmented Phillips curve 353354 forecasting 589 static Phillips curve 314 322323 unit root test 577 influential observations 296300 information set 587 insample criteria 591 instrumental variables computing Rsquared after estimation 471 in multiple regressions 471475 overview and definitions 462 463 465 properties with poor instrumental variable 469471 in simple regressions 462471 solutions to errorsinvariables problems 479481 statistical inference 466469 integrated of order zeroone processes 358360 integrated processes forecasting 594598 interaction effect 177179 interaction terms 217218 intercept parameter 21 intercepts See also OLS estimators regression analysis change in unit of measurement and 3637 defined 21 630 in regressions on a constant 51 in regressions through origin 5051 intercept shifts 207 interest rates differencing 387388 inference under CLM assumptions 323 Tbill See Tbill rates internet services 606 interval estimation 674 687688 inverse Mills ratio 538 inverse of matrix 713 IQ ability and 279283 284285 nonrandom sampling 294295 irrelevant variables including 7778 IV See instrumental variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 779 J JEL See Journal of Economic Literature JEL job training sample model as selfselection problem 3 worker productivity and program evaluation 229 as selfselection problem 230 joint distributions features of 652658 independence and 649651 joint hypotheses tests 127 jointly statistically significantinsignificant 130 joint probability 649 Journal of Economic Literature JEL 606 junior colleges vs universities 124127 just identified equations 511 K Koyck distributed lag 571572 kurtosi 658 L labor economists 605 607 labor force See employment and unemployment women in labor force labor supply and demand 500501 labor supply function 639 LAD least absolute deviations estimation 300302 lag distribution 315 lagged dependent variables as proxy variables 283284 serial correlation and 374375 lagged endogenous variables 591592 lagged explanatory variables 316 Lagrange multiplier LM statistics heteroskedasticityrobust 248249 See also heteroskedasticity main discussion 158160 land quality and fertilizers 23 large sample properties 681683 latent variable models 526 law enforcement city crime levels and causality 13 murder rates and SEM 501502 law of iterated expectations 664 law of large numbers 682 law school rankings as dummy variables 216217 residual analysis 190 leads and lags estimators 584 least absolute deviations LAD estimation 300302 least squares estimator 686 likelihood ratio statistic 529 limited dependent variables censored and truncated regression models 547552 corner solution response See Tobit model count response Poisson regression for 543547 overview 524525 sample selection corrections 554558 linear functions 630631 linear independence 714 linear in parameters assumption for OLS in matrix form 723724 for simple linear regressions 40 44 for time series regressions 317318 linearity and weak dependence assumption 348349 linear probability model LPM See also limited dependent variables heteroskedasticity and 265266 main discussion 224229 linear regression model 40 64 linear relationship among independent variables 8386 linear time trends 330 literature review 607608 loan approval rates F and t statistics 150 multicollinearity 85 program evaluation 230 logarithms in multiple regressions 171173 natural overview 736739 predicting y when logy is dependent 191193 qualitative information and 211212 real dollars and 327 in simple regressions 3739 in time series regressions 323324 log function 636 logit and probit models interpreting estimates 530536 maximum likelihood estimation of 528529 specifying 525528 testing multiple hypotheses 529530 loglikelihood functions 529 longitudinal data See panel data longrun elasticity 324 longrun multiplier See longrun propensity LRP longrun propensity LRP 316 loss functions 586 LRP longrun propensity 316 lunch program and math performance 4445 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 780 M macroeconomists 606 MAE mean absolute error 591 marginal effect 630 marital status See qualitative information martingale difference sequence 574 martingale functions 587 matched pair samples 449 mathematical statistics See statistics math performance and lunch program 4445 matrices See also OLS in matrix form addition 710 basic definitions 709710 differentiation of linear and quadratic forms 715 idempotent 715 linear independence and rank of 714 moments and distributions of random vectors 716717 multiplication 711712 operations 710713 quadratic forms and positive definite 714715 matrix notation 721 maximum likelihood estimation 528529 685686 MCAR missing completely at random 293 mean using summation operator 629630 mean absolute error MAE 591 mean independence 23 mean squared error MSE 680 measurement error IV solutions t0 479481 men return to education 468 properties of OLS under 287292 measures of association 658 measures of central tendency 655657 measures of variability 656 median 630 655 method of moments approach 2526 685 micronumerosity 85 military personnel survey oversampling in 295 minimum variance unbiased estimators 106 686 727 minimum wages causality 13 employmentunemployment and AR1 serial correlation testing for 377378 detrending 334335 logarithmic form 323324 SCrobust standard error 391 in Puerto Rico effects of 78 minorities and loans See loan approval rates missing at random 294 missing completely at random MCAR 293 missing data 293294 misspecification in empirical projects 613 functional form 275279 unbiasedness and 7883 variances 8687 motherhood teenage 448449 moving average process of order one MA1 346 MSE mean squared error 680 multicollinearity 2SLS and 477 among explanatory variables 293 main discussion 8386 multiple hypotheses tests 127 multiple linear regression MLR model 63 multiple regression analysis See also data issues estimation and estimators heteroskedasticity hypotheses OLS ordinary least squares predictions Rsquareds adding regressors to reduce error variance 185186 advantages over simple regression 6064 confidence intervals 122124 interpreting equations 67 null hypothesis 108 omitted variable bias 7883 over controlling 184185 multiple regressions See also qualitative information beta coefficients 169 hypotheses with more than one parameter 124127 misspecified functional forms 275 motivation for multiple regression 61 62 nonrandom sampling 294295 normality assumption and 107 productivity and 360 quadratic functions 173177 with qualitative information of baseball players race and 220221 computer usage and 218 with different slopes 218221 education and 218220 gender and 207211 212214 218221 with interacting terms 218 law school rankings and 216217 with logy dependent variable 213214 marital status and 219220 with multiple dummy variables 212213 with ordinal variables 215217 physical attractiveness and 216217 random effects model 443444 random slope model 285 reporting results 137138 t test 110 with unobservables general approach 284285 with unobservables using proxy 279285 working individuals in 1976 6 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 781 multiple restrictions 127 multiplestepahead forecast 587 592594 multiplicative measurement error 289 multivariate normal distribution 716717 municipal bond interest rates 214215 murder rates SEM 501502 static Phillips curve 314 N natural experiments 410 469 natural logarithms 736739 See also logarithms netted out 69 nominal dollars 326 nominal vs real 326 nonexperimental data 2 nonlinear functions 634640 nonlinearities incorporating in simple regressions 3739 nonnested models choosing between 182184 functional form misspecification and 278279 nonrandom samples 294295 553 nonstationary time series processes 345346 no perfect collinearity assumption form 723 for multiple linear regressions 7476 77 for time series regressions 318 349 normal distribution 665669 normality assumption for multiple linear regressions 105108 for time series regressions 322 normality of errors assumption 726 normality of estimators in general asymptotic 683684 normality of OLS asymptotic in multiple regressions 154160 in time series regressions 351354 normal sampling distributions for multiple linear regressions 107108 for time series regressions 322323 no serial correlation assumption See also serial correlation for OLS in matrix form 724725 for time series regressions 320322 351352 nRsquared statistic 159 null hypothesis 108110 694 See also hypotheses numerator degrees of freedom 129 O observational data 2 OLS ordinary least squares cointegration and 583584 comparison of simple and multiple regression estimates 6970 consistency See consistency of OLS logit and probit vs 533535 in multiple regressions algebraic properties 6472 computational properties 6466 6472 effects of data scaling 166170 fitted values and residuals 68 goodnessoffit 7071 interpreting equations 6566 Lagrange multiplier LM statistic 158160 measurement error and 287292 normality 154160 partialling out 69 regression through origin 73 statistical properties 7381 Poisson vs 545 546547 in simple regressions algebraic properties 3234 defined 27 deriving estimates 2432 statistical properties 4550 units of measurement changing 3637 simultaneity bias in 503504 in time series regressions correcting for serial correlation 383386 FGLS vs 385386 finite sample properties 317323 normality 351354 SCrobust standard errors 388391 with serially correlated errors properties of 373375 Tobit vs 540542 OLS and Tobit estimates 540542 OLS asymptotics in matrix form 728731 in multiple regressions consistency 150154 efficiency 161162 overview 149150 in time series regressions consistency 348354 OLS estimators See also heteroskedasticity defined 40 in multiple regressions efficiency of 8990 variances of 8189 sampling distributions of 105108 in simple regressions expected value of 7381 unbiasedness of 4045 77 variances of 4548 in time series regressions sampling distributions of 322323 unbiasedness of 317323 variances of 320322 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 782 OLS in matrix form asymptotic analysis 728731 finite sample properties 723726 overview 720722 statistical inference 726728 Wald statistics for testing multiple hypotheses 730731 OLS intercept estimates defined 6566 OLS regression line See also OLS ordinary least squares defined 28 in multiple regressions 65 OLS slope estimates defined 65 omitted variable bias See also instrumental variables general discussions 7883 using proxy variables 279285 onesided alternatives 695 onestepahead forecasts 586 588 onetailed tests 110 696 See also t tests online databases 609 online search services 607608 order condition 479 507 ordinal variables 214217 outliers guarding against 300302 main discussion 296300 outofsample criteria 591 overall significance of regressions 135 over controlling 184185 overdispersion 545 overidentified equations 511 overidentifying restrictions testing 482485 overspecifying the model 78 P pairwise uncorrelated random variables 660661 panel data applying 2SLS to 487488 applying methods to other structures 448450 correlated random effects 445447 differencing with more than two periods 420425 fixed effects 435441 independently pooled cross sections vs 403 organizing 417 overview 910 pitfalls in first differencing 424 random effects 441445 simultaneous equations models with 514516 twoperiod analysis 417419 twoperiod policy analysis with 417419 unbalanced 440441 Panel Study of Income Dynamics 608 parameters defined 4 674 estimation general approach to 684686 partial derivatives 641 partial effect 66 6768 partial effect at average PEA 531532 partialling out 69 partitioned matrix multiplication 712713 pdf probability density functions 647 percentage point change 634 percentages 633634 change 633 percent correctly predicted 227 530 perfect collinearity 7476 permanent income hypothesis 513514 pesticide usage over controlling 185 physical attractiveness and wages 215216 pizzas expected revenue 654 plugin solution to the omitted variables problem 280 point estimates 674 point forecasts 588 poisson distribution 544 545 poisson regression model 543547 policy analysis with pooled cross sections 407412 with qualitative information 210 229231 with twoperiod panel data 417419 pooled cross sections See also independently pooled cross sections applying 2SLS to 487488 overview 8 policy analysis with 407412 population defined 674 population model defined 73 population regression function PRF 23 population Rsquareds 181 positive definite and semidefinite matrices defined 715 poverty rate in absence of suitable proxies 285 excluding from model 80 power of test 694 practical significance 120 practical vs statistical significance 120124 702703 PraisWinsten PW estimation 383384 386 390 predetermined variables 592 predicted variables 21 See also dependent variables prediction error 188 predictions confidence intervals for 186189 with heteroskedasticity 264266 residual analysis 190 for y when logy is dependent 191193 predictor variables 23 See also dependent variables price index 326327 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 783 prisons population and crime rates 515516 recidivism 549551 probability See also conditional distributions joint distributions features of distributions 652658 independence 649651 joint 649 normal and related distributions 665669 overview 645 random variables and their distributions 645649 probability density function pdf 647 probability limits 681683 probit model See logit and probit models productivity See worker productivity program evaluation 210 229231 projects See empirical analysis property taxes and housing prices 8 proportions 733734 proxy variables 279285 pseudo Rsquareds 531 public finance study researchers 606 Puerto Rico employment in detrending 334335 logarithmic form 323324 time series data 78 pvalues computing and using 698700 for F tests 134135 for t tests 118120 Q QMLE quasimaximum likelihood estimation 728 quadratic form for matrices 714715 716 quadratic function 634636 quadratic time trends 331 qualitative information See also linear probability model LPM in multiple regressions allowing for different slopes 218221 binary dependent variable 224229 describing 205206 discrete dependent variables 231232 interactions among dummy variables 217 with logy dependent variable 211212 multiple dummy independent variables 212217 ordinal variables 214217 overview 205 policy analysis and program evaluation 229231 proxy variables 282283 single dummy independent variable 206212 testing for differences in regression functions across groups 221224 in time series regressions main discussion 324329 seasonal 336338 quantile regression 302 quasidemeaned data 442 quasidifferenced data 382 390 quasiexperiment 410 quasi natural experiments 410 469 quasilikelihood ratio statistic 546 quasimaximum likelihood estimation QMLE 545 728 R R2 j 8386 race arrests and 229 baseball player salaries and 220221 discrimination in hiring asymptotic confidence interval 692693 hypothesis testing 698 pvalue 701 random coefficient model 285287 random effects correlated 445447 estimator 442 fixed effects vs 444445 main discussion 441445 random sampling assumption for multiple linear regressions 74 for simple linear regressions 4041 42 44 crosssectional data and 57 defined 675 random slope model 285287 random variables 645649 random vectors 716 random walks 354 rank condition 479 497 506507 rank of matrix 714 rational distributed lag models 572574 RD and sales confidence intervals 123124 nonnested models 182184 outliers 296298 RDL rational distributed lag models 572574 real dollars 326 recidivism duration analysis 549551 reduced form equations 473 504 reduced form error 504 reduced form parameters 504 regressands 21 See also dependent variables Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 784 regression analysis 5051 See also multiple regression analysis simple regression model time series data regression specification error test RESET 277278 regression through origin 5052 regressors 21 185186 See also independent variables rejection region 695 rejection rule 110 See also t tests relative change 633 relative efficiency 679680 relevant variables excluding 7883 reporting multiple regression results 137138 resampling method 203 rescaling 166168 RESET regression specification error test 277 residual analysis 190 residuals See also OLS ordinary least squares in multiple regressions 68 297298 in simple regressions 27 32 48 studentized 297298 residual sum of squares SSR See sum of squared residuals response probability 225 525 response variables 21 See also dependent variables REST regression specification error test 277278 restricted model 128129 See also F tests retrospective data 2 returns on equity and CEO salaries fitted values and residuals 32 goodnessoffit 35 OLS Estimates 2930 RMSE root mean squared error 50 88 591 robust regression 302 rooms and housing prices beta coefficients 175176 interaction effect 177179 quadratic functions 175177 residual analysis 190 root mean squared error RMSE 50 88 591 row vectors 709 Rsquareds See also predictions adjusted 181184 374 after IV estimation 471 change in unit of measurement and 37 in fixed effects estimation 437 438439 for F statistic 133134 in multiple regressions main discussion 7073 for probit and logit models 531 for PW estimation 383384 in regressions through origin 5051 73 in simple regressions 3536 size of 180181 in time series regressions 374 trending dependent variables and 334335 uncentered 214 S salaries See CEO salaries income wages sales CEO salaries and constant elasticity model 39 nonnested models 183184 motivation for multiple regression 6364 RD and See RD and sales sales tax increase 634 sample average 675 sample correlation coefficient 685 sample covariance 685 sample regression function SRF 28 65 sample selection corrections 553558 sample standard deviation 683 sample variation in the explanatory variable assumption 42 44 sampling nonrandom 293300 sampling distributions defined 676 of OLS estimators 105108 sampling standard deviation 693 sampling variances of estimators in general 678679 of OLS estimators for multiple linear regressions 82 83 for simple linear regressions 4748 savings housing expenditures and 502 income and heteroskedasticity 254256 scatterplot 25 measurement error in 289 with nonrandom sample 294295 scalar multiplication 710 scalar variancecovariance matrices 724 scatterplots RD and sales 297298 savings and income 25 wage and education 27 school lunch program and math performance 4445 school size and student performance 113114 score statistic 158160 scrap rates and job training 2SLS 487 confidence interval 700701 confidence interval and hypothesis testing 702 fixed effects estimation 436437 measurement error in 289 program evaluation 229 pvalue 700701 statistical vs practical significance 121122 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 785 twoperiod panel data 418 unbalanced panel data 441 seasonal dummy variables 337 seasonality forecasting 594598 serial correlation and 381 of time series 336338 seasonally adjusted patterns 336 selected samples 553 selfselection problems 230 SEM See simultaneous equations models semielasticity 39 639 sensitivity analysis 613 sequential exogeneity 363 serial correlation correcting for 381387 differencing and 387389 heteroskedasticity and 395 lagged dependent variables and 374375 no serial correlation assumption 320322 351354 properties of OLS with 373375 testing for 376381 serial correlationrobust standard errors 388391 serially uncorrelation 360 shortrun elasticity 324 significance level 110 simple linear regression model 20 simple regression model 2024 See also OLS ordinary least squares incorporating nonlinearities in 3739 IV estimation 462471 multiple regression vs 6063 regression on a constant 51 regression through origin 5051 simultaneity bias 504 simultaneous equations models bias in OLS 503504 identifying and estimating structural equations 504510 overview and nature of 449503 with panel data 514516 systems with more than two equations 510511 with time series 511514 skewness 658 sleeping vs working tradeoff 415416 slopes See also OLS estimators regression analysis change in unit of measurement and 3637 39 defined 21 630 parameter 21 qualitative information and 218221 random 285287 in regressions on a constant 51 in regressions through origin 5051 smearing estimates 191 smoking birth weight and asymptotic standard error 158 data scaling 166170 cigarette taxes and consumption 411412 demand for cigarettes 261262 IV estimation 470 measurement error 292 Social Sciences Citation Index 606 soybean yields and fertilizers causality 11 12 simple equation 2122 specification search 613 spreadsheets 610 spurious regression 332333 578580 square matrices 709710 SRF sample regression function 28 65 SSE explained sum of squares 34 7071 SSR residual sum of squares See sum of squared residuals SST total sum of squares 34 7071 SSTj total sample variation in xj 83 stable AR1 processes 347 standard deviation of bˆj 8990 defined 45 657 estimating 49 properties of 657 standard error of the regression SER 50 88 standard errors asymptotic 157 of bˆj 88 heteroskedasticityrobust 246247 of OLS estimators 8789 of bˆ1 50 serial correlationrobust 388391 standardized coefficients 169170 standardized random variables 657658 standardized test scores beta coefficients 169 collinearity 7475 interaction effect 178179 motivation for multiple regression 61 62 omitted variable bias 80 81 omitting unobservables 285 residual analysis 190 standard normal distribution 666668 743744 static models 314 350 static Phillips curve 314 322323 377 378 386 stationary time series processes 345346 statistical inference with IV estimator 466469 for OLS in matrix form 726728 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 786 statistical significance defined 115 economicpractical significance vs 120124 economicpractical significance vs 702 joint 130 statistical tables 743749 statistics See also hypothesis testing asymptotic properties of estimators 681684 finite sample properties of estimators 675680 interval estimation and confidence intervals 687693 notation 703 overview and definitions 674675 parameter estimation general approaches to 684686 stepwise regression 614 stochastic process 313 345 stock prices and trucking regulations 325 stock returns 393 394 See also efficient markets hypothesis EMH stratified sampling 295 strict exogeneity assumption 414420 570 strictly exogenous variables correcting for 381387 serial correlation testing for 376381 strict stationarity 345 strongly dependent time series See highly persistent time series structural equations definitions 471 500 501 504 identifying and estimating 504510 structural error 501 structural parameters 504 student enrollment t test 116117 studentized residuals 298 student performance See also college GPA final exam scores standardized test scores in math lunch program and 4445 school expenditures and 85 school size and 113114 style hints for empirical papers 619621 summation operator 628630 sum of squared residuals See also OLS ordinary least squares in multiple regressions 7071 in simple regressions 34 supply shock 353 Survey of Consumer Finances 608 symmetric matrices 712 systematic part defined 24 system estimation methods 511 T tables statistical 743749 tax exemption See under fertility rate Tbill rates cointegration 580584 error correction model 585 inflation deficits See under interest rates random walk characterization of 355 356 unit root test 576 t distribution critical values table 745 discussions 108110 660670 717 for standardized estimators 108110 teachers salarypension tradeoff 137138 teenage motherhood 448449 tenure See also wages interpreting equations 67 motivation for multiple regression 6364 testing overidentifying restrictions 482485 test scores as indicators of ability 481 test statistic 695 text editor 609 text files and editors 608609 theorems asymptotic efficiency of OLS 162 for time series regressions 351354 consistency of OLS for multiple linear regressions 150154 for time series regressions 348351 GaussMarkov for multiple linear regressions 8990 for time series regressions 320322 normal sampling distributions 107108 for OLS in matrix form GaussMarkov 725726 statistical inference 726728 unbiasedness 726 variancecovariance matrix of OLS estimator 724725 sampling variances of OLS estimators for simple linear regressions 4748 for time series regressions 320322 unbiased estimation of s2 for multiple linear regressions 8889 for time series regressions 321 unbiasedness of OLS for multiple linear regressions 77 for time series regressions 317320 theoretical framework 615 three stage least squares 511 timedemeaned data 435 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 787 time series data absence of serial correlation 360363 applying 2SLS to 485486 cointegration 580584 dynamically complete models 360363 error correction models 584586 examples of models 313316 functional forms 323324 heteroskedasticity in 391395 highly persistent See highly persistent time series homoskedasticity assumption for 363364 infinite distributed lag models 569574 nature of 312313 OLS See under OLS ordinary least squares OLS estimators overview 78 in panel data 910 in pooled cross sections 89 with qualitative information See under qualitative information seasonality 336338 simultaneous equations models with 511514 spurious regression 578580 stationary and nonstationary 345346 unit roots testing for 574579 weakly dependent 346348 time trends See trends timevarying error 413 tobit model interpreting estimates 537542 overview 536537 specification issues in 543 top coding 548 total sample variation in xj 83 total sum of squares SST 34 7071 trace of matrix 713 traffic fatalities beer taxes and 184 training grants See also job training program evaluation 229 single dummy variable 210211 transpose of matrix 712 treatment group 210 trends characterizing trending time series 329332 detrending 334335 forecasting 594598 high persistence vs 352 Rsquared and trending dependent variable 334335 seasonality and 337338 time 329 using trending variables 332333 trendstationary processes 348 trucking regulations and stock prices 325 true model defined 74 truncated normal regression model 551 truncated regression models 548 551552 t statistics See also t tests asymptotic 157 defined 109 696 F statistic and 132133 heteroskedasticityrobust 246247 t tests See also t statistics for AR1 serial correlation 376378 null hypothesis 108110 onesided alternatives 110114 other hypotheses about bj 116118 overview 108110 pvalues for 118120 twosided alternatives 114115 twoperiod panel data analysis 417419 policy analysis with 417419 twosided alternatives 695696 two stage least squares applied to pooled cross sections and panel data 487488 applied to time series data 485486 with heteroskedasticity 485486 multiple endogenous explanatory variables 478479 for SEM 508510 511 single endogenous explanatory variable 475477 tesing multiple hypotheses after estimation 479 testing for endogeneity 481482 twotailed tests 115 697 See also t tests Type III error 694 U u unobserved term CEV assumption and 292 foregoing specifying models with 284285 general discussions 45 2123 in time series regressions 319 using proxy variables for 279285 unanticipated inflation 353 unbalanced panels 440441 unbiased estimation of s² for multiple linear regressions 8889 for simple linear regressions 49 for time series regressions 321 unbiasedness in general 677678 of OLS in matrix form 724 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 788 unbiasedness continued in multiple regressions 77 for simple linear regressions 4344 in simple regressions 4044 in time series regressions 317323 373375 of sˆ ² 726 uncentered Rsquareds 214 unconditional forecasts 587 uncorrelated random variables 660 underspecifying the model 7883 unemployment See employment and unemployment unidentified equations 511 unit roots forecasting processes with 597598 testing for 574579 gross domestic product GDP 578 inflation 577 process 355 358 units of measurement effects of changing 3637 166168 universities vs junior colleges 124127 unobserved effectsheterogeneity 413 435 See also fixed effects unobserved terms See u unobserved term unrestricted model 128129 See also F tests unsystematic part defined 24 upward bias 80 81 utility maximization 2 V variables See also dependent variables independent variables specific types dummy 206 See also qualitative information in multiple regressions 6164 seasonal dummy 337 in simple regressions 2021 variancecovariance matrices 716 724725 variance inflation factor VIF 86 variance of prediction error 188 variances conditional 665 of OLS estimators in multiple regressions 8189 in simple regressions 4550 in time series regressions 320322 overview and properties of 656657 660661 of prediction error 189 VAR model 589 597598 vector autoregressive model 589 597598 vectors defined 709 veterans earnings of 469 voting outcomes campaign expenditures and deriving OLS estimate 31 economic performance and 328329 perfect collinearity 7576 W wages causality 1314 education and 2SLS 488 conditional expectation 661665 heteroskedasticity 4647 independent cross sections 405406 nonlinear relationship 3739 OLS estimates 3031 partial effect 641 rounded averages 33 scatterplot 27 simple equation 22 experience and See under experience with heteroskedasticityrobust standard errors 246247 labor supply and demand 500501 labor supply function 639 multiple regressions See also qualitative information homoskedasticity 8283 Wald teststatistics 529530 537 730731 weak instruments 471 weakly dependent time series 346348 wealth See financial wealth weighted least squares estimation linear probability model 265267 overview 254 prediction and prediction intervals 264265 for time series regressions 390 393394 when assumed heteroskedasticity function is wrong 262264 when heteroskedasticity function must be estimated 258263 when heteroskedasticity is known up to a multiplicative constant 254259 White test for heteroskedasticity 252254 within estimators 435 See also fixed effects within transformation 435 women in labor force heteroskedasticity 265267 LPM logit and probit estimates 533535 return to education 2SLS 477 IV estimation 467 testing for endogeneity 482 testing overidentifying restrictions 482 sample selection correction 556557 Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Index 789 womens fertility See fertility rate worker compensation laws and weeks out of work 411 worker productivity job training and program evaluation 229 sample model 4 in US trend in 331 wages and 360 working vs sleeping tradeoff 415416 working women See women in labor force writing empirical papers 614621 conceptual or theoretical framework 615 conclusions 618619 data description 617618 econometric models and estimation methods 615617 introduction 614615 results section 618 style hints 619621 Y year dummy variables in fixed effects model 436438 pooling independent cross sections across time 403407 in random effects model 443444 Z zero conditional mean assumption homoskedasticity vs 45 for multiple linear regressions 6263 7677 for OLS in matrix form 724 for simple linear regressions 2324 42 44 for time series regressions 318319 349 zero mean and zero correlation assumption 152 zeroone variables 206 See also qualitative information Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it Copyright 2016 Cengage Learning All Rights Reserved May not be copied scanned or duplicated in whole or in part Due to electronic rights some third party content may be suppressed from the eBook andor eChapters Editorial review has deemed that any suppressed content does not materially affect the overall learning experience Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it